The Golden Gate: why we verify what users see, not what compiles

jlcases · May 18, 2026, 7:03pm

The Golden Gate is PaellaDoc verification engine: it checks what the user actually sees on screen, not what the test runner says.

🇬🇧 English

Every other tool stops at “the code compiles” or “the tests pass”. The Golden Gate stops at “we have proof a real user-visible behavior exists”.

A user story does not transition to passed until all of these are true on the diff produced by the engine:

Screenshot of the result — captured by Playwright on the running app
Playwright trace — full user-visible interaction recorded
Exit code = 0 — all declared checks (pytest, ruff, pyright, eslint, tsc, custom) pass
WAL hash chain intact — the engine didn’t sneak in hidden writes to the database between checks
Diff is non-empty — actual work was done, not just a no-op claim

If any one of these fails, the user story stays in its previous state. No silent “almost done”. No “trust me, I changed it”. The orchestrator refuses.

Why this matters

Three months after the agent generated the code, when something breaks, you can answer:

“Was this AC ever actually verified?” → Yes, here is the screenshot and the Playwright trace from the moment it was marked passed
“Did the engine change anything outside the diff?” → No, WAL hash is intact
“Was the work real or did it just claim to be?” → Diff size and content

Common confusion

“But the tests pass and the code looks good. Why won’t the gate let it through?”

Because looking good and compiling is what every other tool already does. The Golden Gate is the one part of PaellaDoc that is explicitly not optimized for speed. It’s optimized for being honest later. If you find it annoying, that’s the point: it’s annoying for the same reason a seatbelt is annoying.

How to debug a failing gate

Look at the evidence pack of the failed iteration (it’s displayed in the UI, also stored on disk under .casia/ for the project)
Each of the 5 checks reports red/amber/green individually with the specific reason
The decision_record for that iteration tells you what the engine claimed it did vs what the gate observed
If the gate is wrong (false negative), open a Help post with the evidence pack attached

The gate is not infallible. But it’s strictly better than “trust me”.

🇪🇸 Español

Todas las demás herramientas paran en “el código compila” o “los tests pasan”. El Golden Gate para en “tenemos prueba de que existe un comportamiento real visible al usuario”.

Una user story no transiciona a passed hasta que todo esto sea cierto sobre el diff producido por el motor:

Screenshot del resultado — capturado por Playwright sobre la app corriendo
Traza de Playwright — interacción completa visible al usuario grabada
Exit code = 0 — todos los checks declarados (pytest, ruff, pyright, eslint, tsc, custom) pasan
Cadena de hash WAL íntegra — el motor no metió escrituras ocultas a la base de datos entre checks
Diff no vacío — se hizo trabajo real, no solo una declaración no-op

Si alguno falla, la user story se queda en su estado previo. Sin “casi hecho” silencioso. Sin “créeme, lo cambié”. El orquestador se niega.

Por qué esto importa

Tres meses después de que el agente generó el código, cuando algo se rompe, puedes responder:

“¿Este AC se verificó realmente alguna vez?” → Sí, aquí está el screenshot y la traza de Playwright del momento en que se marcó como passed
“¿El motor cambió algo fuera del diff?” → No, el hash WAL está intacto
“¿El trabajo fue real o solo lo afirmó?” → Tamaño y contenido del diff

Confusión típica

“Pero los tests pasan y el código se ve bien. ¿Por qué el gate no lo deja pasar?”

Porque verse bien y compilar es lo que todas las demás herramientas ya hacen. El Golden Gate es la única parte de PaellaDoc que explícitamente no está optimizada para velocidad. Está optimizada para ser honesta después. Si te resulta molesto, ese es el punto: es molesto por la misma razón que el cinturón de seguridad es molesto.

Cómo debugar un gate que falla

Mira el evidence pack de la iteración fallida (se muestra en la UI, también se guarda en disco bajo .casia/ del proyecto)
Cada uno de los 5 checks reporta rojo/ámbar/verde individualmente con la razón concreta
El decision_record de esa iteración te dice qué afirmó el motor que hizo vs qué observó el gate
Si el gate se equivoca (falso negativo), abre un post en Help adjuntando el evidence pack

El gate no es infalible. Pero es estrictamente mejor que “créeme”.