Agents Got Good at Calling Tools. What If the Tools Are Agents?

Every additional AI agent in a chain should make the output less reliable. Each step compounds errors from the last. We chain five. The output is publishable.

Our research agent explores a survey database, runs 50+ queries, & produces verified insights in about 20 minutes.¹ A content agent takes those insights, generates data visualisations, validates them visually & numerically, & renders publishable cards. A social agent formats for LinkedIn & X. Each agent hands structured, verified data to the next. No human intervention.

We wrote previously about how we made individual agents trustworthy — separate creators from validators, use structured feedback, build validation into tools.² This post is about what that unlocked.

Once an agent reliably produces structured, validated output, it looks like a deterministic tool from the outside. Input in, trusted output out. The caller doesn't know whether the work was done by a database query or a multi-step reasoning loop.

At that point the agent is a tool. Same interface, same contract.

Agents can call agents.

Before validated outputs, chaining agents meant compounding errors. The first agent produces something roughly right. The second trusts it completely, builds on the imprecisions. By the third, the output is unusable. Every step amplified uncertainty from the last.

Validated outputs break this. Each agent verifies its work before handing it on. The next agent receives structured data it can rely on. Five validated steps compound capability, not errors.

Three layers

The creating agent is stateful — it accumulates context across dozens of tool calls. What it's tried, what data it found, why previous approaches failed. Context is what makes creative work good.

The validators are stateless. No history, no context. Artifact in, pass/fail out. Different models from the creator — Claude Opus creates, GPT validates.³ The model that produced an error won't catch it.

Output generation is deterministic. Render a PNG, save to a database, send an email. No LLM.

If it needs memory, it's a stateful agent. If it's input → evaluate → pass/fail, it's a stateless tool. If it doesn't need an LLM, it's deterministic.

Stateful Agent

Accumulated context: user request, data fetched, previous attempts, reasoning

calls

Stateless Validators

No context — artifact + criteria → pass/fail + structured feedback

on pass

Deterministic Output

Render, save, send — no LLM

Five gates

Our research agent's validation pipeline has five gates.⁴ Re-run every SQL query against the database. Parse each query to extract references. Validate those references against the schema. Check the numbers support the conclusion — tolerances of ±0.5pp for percentages, ±0.05x for ratios.⁵ Have a different model re-evaluate the finding's significance with verified data.

On failure, feedback is structured with named error types : possible_overcounting, missing_variant_filter, raw_count_comparison, ambiguous_aggregation.⁶ Three attempts per finding. After that, abandoned.

Agent submits finding

Re-run SQL queries against database

Extract & parse references via LLM

Validate references against schema

Check numbers support conclusion

Tolerances: ±0.5pp percentages, ±0.05x ratios

Re-evaluate significance via LLM

Save with embedding + dedup

On failureStructured feedback with named error type → agent revises → retry (max 3)

Each gate is a stateless tool. Adding a gate is adding a tool. The pipeline is composable in the same way the agents are.

The content agent has its own validation — a model checks every number against source data, a vision model checks the rendered output for readability & layout.⁷ Different pipeline, same pattern.

Research Agent

exploreevaluateverifystructured output

Content Agent

createvalidate codevalidate visuallyrender

Published Output

Cards, reports, social posts

Each handoff: structured, verified data

Chain five unreliable steps & errors compound. Chain five reliable steps & complex workflows hold together. The difference is whether each step's output was validated before the next step used it.

Footnotes

The Insight Finder agent has access to ~15 tools across two MCP servers. A typical run executes 50+ queries in roughly 20 minutes. ↩
How We Built Agents Whose Work We Can Trust — covers the creator/validator split, structured feedback, & mandatory tool-based validation. ↩
Claude Opus for creation, GPT for code & numerical validation, GPT vision for visual validation. Three models, three jobs. ↩
Five sequential gates. Each must pass. Failure at any gate returns structured feedback. ↩
Tolerances defined in a single canonical spec file shared across all prompts. ±0.5pp for percentages, ±0.05x for ratios, minimum sample sizes of n≥100 for headline findings & n≥50 for confirmatory. Duplicate detection : cosine similarity ≥0.85. ↩
These error types emerged from observing common agent failures over several months. Each maps to a specific check in the verification pipeline. ↩
The content pipeline validates code accuracy (numerical, data representation, narrative fairness) & visual quality (rendering, brand compliance, readability, composition). ↩

Agents Got Good at Calling Tools. What If the Tools Are Agents?

Three layers

Five gates

Footnotes

Continue Reading

5 Rounds of AI-on-AI Rejection: The System Behind Trustworthy Agents

Want to learn more?