Running Codex inside PaellaDoc — your experience?

jlcases · June 10, 2026, 7:42pm

Same idea as the other engine threads: Codex writes the code, PaellaDoc decides done by executing your acceptance criteria instead of trusting a passing build.

Anyone running Codex this way? How is it holding up vs using it raw?

Full comparison: PaellaDoc vs Codex: Running and Verifying OpenAI's Coding Agent · PAELLADOC

jlcases · June 10, 2026, 7:51pm

Codex is one of the engines I rotate through PaellaDoc. On a hard task I will route it to Claude, Codex and a couple of others and see who actually solves it, instead of marrying one vendor.

The reason none of them get to self-certify, Codex included: in my benchmark even frontier models at max effort slipped a real bug on the hard task about two thirds of the time, and not the same way twice, so you cannot predict it. A passing build is a weak signal coming from the same thing that wrote the code.

The surprise was that a cheaper model with the criteria-gate in front matched a frontier model running raw. That is the whole point of the gate for me: it lets me pick the engine on cost and privacy, not on faith. Anyone else routing across engines like this?