Why I stopped trusting my AI agents

How a database of validation checks replaced hope as a quality strategy.

The confidence problem

I run autonomous AI agent pipelines that build software. Claude writes PRDs, generates marketing copy, implements features, deploys apps. The agents operate inside a loop: they receive context, do work, and emit a <complete> tag when they think they're done. The runner sees that tag and moves on to the next stage.

The problem is that agents are confidently wrong. A PRD with a feature matrix that references APIs that don't exist. Marketing content with hallucinated customer quotes. A build ticket that specifies dependencies with version conflicts. The agent wrote <complete>DONE</complete> and the system believed it.

For a while I treated this like a people-management problem. Give better prompts. Add more examples. Be more specific about expectations. But the failure mode isn't confusion -- it's misplaced confidence. The agent genuinely believes it did good work. And the runner has no mechanism to disagree.

Checking after the fact

My first instinct was reactive validation. After the agent finishes, check the output.

I built health checks that run after deployment -- Playwright tests, Lighthouse scores above 80, responsive layout verification. I added scoring rubrics with weighted dimensions to evaluate niche research quality. I wired up human-in-the-loop approval gates through Telegram so I could review screenshots before anything went live. For marketing content, I ran creative tournaments where a critic agent scores three competing copywriter agents on eight dimensions, and only the winner advances.

These all help. But they share a structural flaw: the agent runs first, you check later. When validation fails, you either start the whole thing over or try to patch the output. There's no structured re-validation loop. No way to say "these three specific things are broken, fix them and prove they're fixed." You're back to hoping the next attempt goes better.

Two rules

The insight came from thinking about what makes deployment health checks work better than prompt-based quality control. Health checks are defined before deployment happens. They're external to the thing being checked. And they're binary -- pass or fail, no negotiation.

That gave me two rules. First: generate validation criteria before the work begins, not after. The criteria should be based on the requirements, not the output. Second: the agent should never control its own exit. An external system -- specifically, a database query -- should be the gatekeeper. The agent can claim it's done all it wants. Only the data decides.

The architecture

The system I built has five roles, and they never overlap.

A Validation Builder runs first. It reads the work request, assesses complexity, and generates a set of typed validation checks -- stored as rows in SQLite. Each check has a type (lint, build, test, design, accessibility, content), a description of what to verify, acceptance criteria, and a dependency list pointing to other checks that must pass first. The builder doesn't do any implementation work. It only asks: "What would need to be true for this to be done well?"

Then the Implementation Agent does the actual work. It writes code, generates content, builds whatever was requested. It has no awareness of the validation checks. It just works from the original requirements and signals completion when it thinks it's finished.

The Validation Agent processes the checks bottom-up, ordered by dependency. It spawns typed sub-validators -- a linter runs lint checks, a test runner handles test checks, a design reviewer evaluates visual checks. Each sub-validator marks its check as pass or fail with evidence. No check runs until everything it depends on has passed.

When checks fail, a Fix Agent queries the database for failures grouped by type and hires specialists: a designer for visual issues, a frontend engineer for UI bugs, a backend engineer for API problems. After fixes are applied, control returns to the Validation Agent, which re-processes from the lowest failing dependency upward.

Exit only happens when every check in the database is marked as passing. Not when the agent says so. Not when a timer runs out. When the data says so.

Bottom-up

The checks aren't a flat list. They form a directed acyclic graph -- a DAG -- where each check declares what it depends on. Lint checks and dependency checks have no prerequisites. Build depends on both. Unit tests depend on build. End-to-end tests depend on unit tests. Design review depends on e2e because there's no point evaluating visual polish on a page that crashes.

This ordering matters because of cascading invalidation. When the Fix Agent patches a lint error, the build check gets reset to pending. When a build fix changes an import, unit tests go back to pending. You always re-validate from foundations up. The system never evaluates a higher-level check on top of a broken foundation.

This also prevents the oscillation problem. Without dependency ordering, fix A breaks B and fix B breaks A, and the system ping-pongs forever. With bottom-up validation, fixes are applied at the lowest failing level first. By the time you reach higher-level checks, the foundation is solid.

The gatekeeper

The final piece is the runner itself -- the JavaScript process that orchestrates everything. It uses the Ralph Loop pattern: fresh context window per iteration, state persisted externally, signals parsed from agent output. But now the exit condition isn't the agent's <complete> tag. It's a database query.

After every iteration, the runner queries SQLite: "How many critical and important checks are still failing?" If the answer is anything other than zero, it ignores the agent's completion signal and spawns a new iteration. The new iteration gets context about which checks failed, what the evidence was, and which types of fixes are needed. The agent can claim it's done in every single iteration. The runner doesn't care.

There's a hard ceiling on iterations -- the system won't loop forever. But within that ceiling, the runner is relentless. It doesn't accept "mostly done" or "good enough." Every critical check passes, or the work continues. The agent does the creative work. The database holds it accountable. The runner enforces the standard.

This isn't about building smarter agents. The agents are already smart enough to do remarkable work. The problem was never capability -- it was accountability. An agent with no external validation will produce confident garbage as often as it produces confident excellence. You can't tell the difference until something breaks in production.

The database of validation checks changed the game. Define what "done" looks like before work begins. Let the agent do the creative work. Let a separate system judge the results. Let the runner enforce the standard. Trust the process, not the agent.