Same idea as the other engine threads: Codex writes the code, PaellaDoc decides done by executing your acceptance criteria instead of trusting a passing build.
Anyone running Codex this way? How is it holding up vs using it raw?
Full comparison: PaellaDoc vs Codex: Running and Verifying OpenAI's Coding Agent · PAELLADOC
Codex is one of the engines I rotate through PaellaDoc. On a hard task I will route it to Claude, Codex and a couple of others and see who actually solves it, instead of marrying one vendor.
The reason none of them get to self-certify, Codex included: in my benchmark even frontier models at max effort slipped a real bug on the hard task about two thirds of the time, and not the same way twice, so you cannot predict it. A passing build is a weak signal coming from the same thing that wrote the code.
The surprise was that a cheaper model with the criteria-gate in front matched a frontier model running raw. That is the whole point of the gate for me: it lets me pick the engine on cost and privacy, not on faith. Anyone else routing across engines like this?