5 Rounds of AI-on-AI Rejection: The System Behind Trustworthy Agents

Most teams try to make AI more reliable by improving a single agent — better prompts, more context, tighter guardrails. We went the other direction. Our insight cards are written by one model, rejected by two others, and rewritten an average of 5 times before they pass.

The card above was made entirely by AI agents. Claude Opus writes it. GPT-4o checks the data accuracy and narrative. GPT-4o vision reads the rendered image and flags layout and readability issues. Three models, three jobs, no self-review.

Using AI to validate AI sounds circular. It isn't.

We tried having the agent review its own output first. It approved everything. Makes sense in hindsight — the same model that produced an error won't catch it. It reads what it meant to write, not what's on the page.

So we split it: one agent creates, different ones review.

The most common rejection reason: overclaiming. The work agent turns a 15% difference into a "dramatic shift." The validator catches it every time. Other frequent rejections: missing sample sizes that give findings context, text overlaps, font sizes that sacrifice readability for design. Every one of these would have shipped if the creator had checked its own work.

But "try again" wasn't enough

Early versions sent freeform feedback. "The chart doesn't accurately represent the data." The work agent would make vague revisions — fix one thing, break another.

What changed: structured validation data. Specific fields — numerical accuracy, data representation, narrative fairness, design compliance — each with pass/fail and a concrete issue description. The work agent knows exactly what to fix and what to leave alone.

The vision validator does the same for the rendered output: rendering quality, brand compliance, readability, composition. Each scored and explained.

Freeform text between agents is lossy. Structured data is precise. The more structure you put between agents, the faster the loop converges.

Validation only works if it runs every time

Early experiments were fragile. The agent would skip validation steps, forget to check the database, hallucinate data points. We couldn't trust the pipeline to run consistently.

What fixed it: reliable tool calling. Once you can trust an agent to call a validation tool every time — not sometimes, every time — you can build real loops. Create, validate, revise, validate. The agent doesn't decide whether to check its work. It's built into the workflow.

The full pipeline

The insight cards start further upstream. Our Insight Finder agent has access to around 15 tools across two servers: search survey questions, build SQL queries, execute them against the database, check schema documentation, recall past explorations, evaluate whether a finding is worth reporting.

It directs a Data Analyst agent — telling it what to query, evaluating the responses, deciding what to explore next. When it finds something, multi-gate verification kicks in: re-running every SQL query against the database (not trusting its own memory), checking that every data reference actually exists, having a separate model verify the numbers support the conclusion.

One run: 50+ queries executed, work equivalent to days of analyst time, completed in about 20 minutes. The agent that explores doesn't validate. The agent that validates doesn't explore. Each does one job well.

What holds it together

Four principles, one idea — no agent does two jobs:

The creator is never the reviewer.
Feedback is structured, not freeform.
Validation is in the tools, not the agent's judgment.
Each specialist does one job.

The insight cards on our feed went through all of it. Found by an agent, verified by others, designed by another, checked again before publishing. We spent a lot of time getting here. Hopefully sharing what we learned saves you some of that time.

5 Rounds of AI-on-AI Rejection: The System Behind Trustworthy Agents

Using AI to validate AI sounds circular. It isn't.

But "try again" wasn't enough

Validation only works if it runs every time

The full pipeline

What holds it together

Continue Reading

Agents Got Good at Calling Tools. What If the Tools Are Agents?

Want to learn more?