A green build is not a correct feature — I measured the gap

I ran a small, frozen benchmark to answer one question: when an AI coding CLI says a task is done, how often is the code actually correct — and does an explicit spec plus a verification gate change that?

Setup (frozen, so it’s reproducible)
A real Next.js + TypeScript repo, pinned at one commit. 5 invented features, each a user story + 5 acceptance criteria. Every run in its own isolated git worktree. Six models: Claude Sonnet 4.6, Claude Haiku 4.5, Codex, gpt-5-mini, gpt-5-nano, Kimi.

The scoring that matters isn’t an LLM reading a diff. It’s an execution gate: re-apply each diff, import the route, call it with real requests, and assert every acceptance criterion in code. Pass = all 5 ACs actually execute correctly. A green build doesn’t count.

The number
A raw one-line request — what people actually type — ships a genuine bug (a crash, an empty response, an ignored requirement, or no change at all) in 10 of 27 runs (~37%).

It’s concentrated in cheap models. gpt-5-nano produced a real bug or nothing on 4 of 4 tasks. Strong models are mostly fine raw (Codex: 0 genuine bugs). The point isn’t “models are bad” — it’s that which model you pick silently decides whether your unverified output is safe.

With the same task + spec + the execution gate, genuine bugs drop to 1 of 23 runs (~4%), and the gate catches the rest before anything reaches “done”.

The part I didn’t expect
This makes cheap models safe. A model 10–50x cheaper, run with a spec and a gate, lands correct output the raw frontier model misses. Verification decouples cost from trust.

Honest limits (because you’d find them anyway)
n=5, one repo, one run per cell — directional, not a paper. I deliberately exclude arbitrary interface mismatches (a param named minMag vs minMagnitude) from the genuine-bug count: the model can’t know an interface it was never told. The claim rests only on real correctness bugs. And on some features the raw agent nailed it unaided — the gap isn’t universal.

Why I built it this way
Same philosophy as everywhere in PaellaDoc: the source of truth is what executed, verified — not what a model assumed. The spec makes intent travel with the task; the gate refuses “done” without proof.

This is directional — n=5, one repo, one run per cell. Not a paper. But the pattern is consistent: unverified output is a coin-flip on cheap models, and a gate turns it into something you can trust.

Happy to go into the method in the replies. Tell me where you think this breaks.

Hi, I find your benchmark interesting. Would it be possible for you to share the repo you used and the prompt you gave to the LLMs??

1 Like

Mauro — the repo’s here, and I rebuilt it to be worth attacking: https://github.com/jlcases/paelladoc-verification-benchmark

The prompts are in run_v2.py (build_prompt). There are two: raw is the one-line request, spec is the same plus the five acceptance criteria. Scoring is an independent execution gate that runs each diff and checks every criterion. A green build doesn’t count.

The headline I’d defend: this isn’t an IQ test for models, it’s a trust problem. Across 120 runs a raw request shipped a genuine bug in 40% of cases with the build green, and the same model flips between right and wrong on the same prompt (16% of cells, all in the raw arm).

Then I added the configs people name to dismiss it: Opus 4.8 at xhigh and max effort, Codex 5.5 at xhigh. Even Opus 4.8 at max still ships a genuine bug in 13% of raw runs, 2 of 3 on the hard feature, and the two Opus configs failed on different runs of it. With the criteria up front every config goes to 0%, and a cheap model with the criteria matches the frontier one.

Everything is published, all 210 diffs included. Tell me where it breaks.