My AI coding loop: design first, tests next

Coding with AI makes me faster at writing good code and bad code at roughly equal rates. The thing that decides which one I ship on a given day is not the model, and not the prompt. It is whether the test suite and the type-checker can tell the difference before I hit commit.

The PR that made me change my flow

Sometime last year I shipped a migration on Rabbitholes.ai that I had generated in about forty minutes. It passed the type-checker, passed every test I had written, and broke a production edge case within a week. The failure was not in the code the model wrote. It was in the tests I had written to validate it.

I had been treating the AI like a junior engineer I was supervising. Read the diff, nod at the tests, move on. The problem was that I was the only one in the loop thinking about whether the tests covered what actually mattered. The model was writing both the code and the tests to match it, and I was rubber-stamping both.

I stopped doing that the same afternoon. The flow below is what I have landed on after a year of working this way on Rabbitholes.ai, NeverCram, and a few smaller projects.

Design first, prompt second

The first round of any non-trivial task is a design conversation, not a code request. I open Claude Code, describe the problem in the shape of a spec, and ask it to poke holes before writing anything.

The prompt looks close to this:

I want to add background jobs for processing large uploads on NeverCram. The constraints: jobs must be idempotent, must survive a deploy mid-run, and must never duplicate a user-visible action. Before writing any code, walk me through the failure modes you would worry about and the questions I have not answered.

Two things come back. First, a short list of concrete failure modes I usually have not thought of. Second, a set of questions where the model is honest that it does not know what I want: retry budget, what the user sees on failure, what to do when the same job is enqueued twice by a retry.

I answer those questions inline. Then I ask for a one-page design, written as a markdown doc. Only when that doc reads like something I would have written myself do I let the model write the first line of actual code.

This is the single biggest change in how I work. The old loop was prompt-and-patch. The new one is design, disagree, design again, then code.

The test suite is the leash

This is the section that took me the longest to build and it is the one I guard hardest.

The models I use daily are good enough at self-correction that if I hand them a test suite and a type-checker and a linter, they will run them, read the failures, and try again without me asking. That shortcuts a huge amount of back-and-forth. It only works if the tests are written in a way that actually catches the thing I care about.

So the tests have to come before the code. For each task, the flow is:

Write a test rubric in plain English — three to five bullet points describing what "done" means in observable terms
Ask the model to translate the rubric into failing tests before touching the implementation
Read those tests carefully. This is the step I cannot delegate
Only then let the model write the implementation, with a standing instruction to run the tests and fix forward until the suite is green

The rubric is where I spend most of my thinking time now. "It works" is not a rubric. "Given a user with a paused subscription, when they try to add a card, the API returns 402 and the UI shows the exact text X" is a rubric. The rubric is the contract. The tests are the enforcement. The model is the executor.

On Rabbitholes.ai this has changed how I touch the node-chat code entirely. Each node holds its own context, and a bug in how contexts propagate is nearly invisible until a user opens a branch twelve levels deep. The tests for that subsystem are the ones I review most carefully, because they are the only thing standing between a confident-looking diff and a broken canvas.

On NeverCram the story is different. The tests that matter are the ones around FSRS scheduling, because if the scheduler drifts by even a small margin, every student using the app gets slightly worse review intervals and I have no way to notice from the outside. Those tests use real review histories sampled from a staging database. A generated test with generated data is a test I do not trust.

I write the rubric. The model writes the tests. I read the tests like my job depends on them, because on the scheduling code, it does.

Reviews happen before the commit, not after

A year ago my review loop was: push the branch, open the PR, re-read my own diff, hit merge. Now there is a gate before the commit even lands.

The gate is a set of small agents that run in sequence against the diff: one that looks for code I could delete without changing behaviour, one that reads the diff the way an adversarial reviewer would, and one that checks for the security issues I know I am bad at catching — auth edges, input validation, anything that touches a secrets handler.

I did not invent these. They are small skills I keep in my Claude Code config and invoke as a pre-commit step on anything non-trivial. The useful part is not the specific checks. It is that the review happens on a version of me that still has the energy to rewrite things, instead of on a version that has already mentally moved to the next ticket.

The first skill catches the kind of over-engineered scaffolding that AI writes by default: a helper for something used once, a wrapper around a wrapper. The second catches the thing I am personally worst at, which is stopping to ask what this looks like if the inputs are hostile. The third is narrower, but the one time it caught a token-logging line I had written at 1 a.m. it earned its place for life.

What AI turned out to be better at than I expected

I went into this thinking AI would be best at the grunt work. Rename this, fill in the boilerplate, port this file to the new syntax. It is good at that. It is better than I expected at two other things.

The first is writing the first draft of a design. A lot of design work for me used to be staring at a blank doc. Handing the model my constraints and asking for three different approaches with trade-offs gets me to a usable first draft in twenty minutes. Most of those drafts are wrong, but being wrong on paper is a faster way to get to right than being blank on paper.

The second is edge-case coverage. If I write a rubric and ask for tests against it, the model often comes back with a question like "what should happen when this runs with an empty input" before writing a single test. That question is sometimes the whole point of the task. I used to find those edge cases in production. Now I find them in a prompt response, and I find them before the implementation exists.

The flip side: the model is still worse than I am at deciding what not to build. It will happily write a feature into a codebase where the right move is to delete three files. That judgement has to come from me.

The part I am still figuring out

The piece I have not locked in is how to handle tasks where the design and the tests are both expensive to write up-front because the requirements are genuinely unclear.

If a user tells me the app feels slow, a rubric-first flow is too rigid. The right first step is a dirty exploration: instrument something, read some traces, let the shape of the problem reveal itself. I still do that by hand, because every time I have tried to push an agent into that mode it has come back with plausible-looking investigations pointed at the wrong layer.

Maybe the answer is a different kind of prompt for that phase. Maybe the answer is that the exploratory loop is the part that stays mine. I do not know yet. I am working on it the way I work on anything new: one real task at a time, until I can describe the loop well enough to write it down.