I ran a small, frozen benchmark to answer one question: when an AI coding CLI says a task is done, how often is the code actually correct — and does an explicit spec plus a verification gate change that?
Setup (frozen, so it’s reproducible)
A real Next.js + TypeScript repo, pinned at one commit. 5 invented features, each a user story + 5 acceptance criteria. Every run in its own isolated git worktree. Six models: Claude Sonnet 4.6, Claude Haiku 4.5, Codex, gpt-5-mini, gpt-5-nano, Kimi.
The scoring that matters isn’t an LLM reading a diff. It’s an execution gate: re-apply each diff, import the route, call it with real requests, and assert every acceptance criterion in code. Pass = all 5 ACs actually execute correctly. A green build doesn’t count.
The number
A raw one-line request — what people actually type — ships a genuine bug (a crash, an empty response, an ignored requirement, or no change at all) in 10 of 27 runs (~37%).
It’s concentrated in cheap models. gpt-5-nano produced a real bug or nothing on 4 of 4 tasks. Strong models are mostly fine raw (Codex: 0 genuine bugs). The point isn’t “models are bad” — it’s that which model you pick silently decides whether your unverified output is safe.
With the same task + spec + the execution gate, genuine bugs drop to 1 of 23 runs (~4%), and the gate catches the rest before anything reaches “done”.
The part I didn’t expect
This makes cheap models safe. A model 10–50x cheaper, run with a spec and a gate, lands correct output the raw frontier model misses. Verification decouples cost from trust.
Honest limits (because you’d find them anyway)
n=5, one repo, one run per cell — directional, not a paper. I deliberately exclude arbitrary interface mismatches (a param named minMag vs minMagnitude) from the genuine-bug count: the model can’t know an interface it was never told. The claim rests only on real correctness bugs. And on some features the raw agent nailed it unaided — the gap isn’t universal.
Why I built it this way
Same philosophy as everywhere in PaellaDoc: the source of truth is what executed, verified — not what a model assumed. The spec makes intent travel with the task; the gate refuses “done” without proof.
This is directional — n=5, one repo, one run per cell. Not a paper. But the pattern is consistent: unverified output is a coin-flip on cheap models, and a gate turns it into something you can trust.
Happy to go into the method in the replies. Tell me where you think this breaks.
