PaellaDoc + Claude Code: how are you running it?

Claude Code is one of the agents PaellaDoc drives per task. The thing I keep coming back to: “it built” isn’t “it’s correct”, so I let an independent gate run the criteria instead of trusting the green build.

How are you wiring Claude Code in? Bring-your-own-subscription, one repo or many? Anything that broke?

Full comparison: PaellaDoc vs Claude Code: Running and Verifying the Agent · PAELLADOC

Claude Code is my most-used engine inside PaellaDoc, so this is not a knock. It writes code as well as anything I have.

The reason PaellaDoc exists is the moment right after. The agent writes the code and also runs its own check, and “it built” quietly becomes “it is done”. I measured how often that lies: across 210 runs, output that passed the build was genuinely wrong about 40% of the time, and even the strongest frontier model at max effort slipped a real bug on a hard task two times out of three, differently each run.

I did not want to be the only thing standing between that and shipping. So the gate runs criteria the agent did not write and cannot edit. Claude Code does the work; the gate decides if it is actually done. How are you all catching the green-but-broken ones?