Multi-Agent Orchestration
Phase 3 advanced. Companion to DL.AI Multi AI Agent Systems with crewAI.
Mental model in one line: a multi-agent system is a distributed system whose nodes are non-deterministic, expensive, and occasionally lie. Everything hard about it falls out of those three properties.
The staff-engineer framing
Before any pattern or framework, internalize the cost equation, because it drives every decision below:
total_cost ≈ Σ (agent_calls × tokens_per_call × price_per_token)
total_latency ≈ serial: Σ per-agent latency | parallel: max(per-agent latency) + merge
quality = single_agent_quality + specialization_gain − coordination_lossYou add agents only when specialization_gain − coordination_loss is provably positive and large enough to justify a 5–15× cost multiplier. Most teams discover, after building the multi-agent version, that a single agent with better tools and a better prompt would have won on every axis. That is the default outcome, not the exception.
A staff engineer's instinct on hearing "let's make it multi-agent" is to ask: what is the single agent failing at, specifically, and is the failure about perspective or about tools? If it's tools, you don't need more agents — you need more (or better) tools. Multi-agent is justified only when the bottleneck is genuinely perspective / role / context-isolation, not capability.
When you need multi-agent (and when you don't)
Single agent is enough when:
- Linear workflow with clear steps
- One LLM "personality" / system prompt suffices
- Tools are the source of variation, not perspective
- The whole task fits comfortably in one context window (with caching, that's a lot)
Multi-agent earns its complexity when:
- Different roles need genuinely different system prompts (researcher vs critic vs writer) — and "different" means different objectives, not different phrasings
- Parallel work streams need to fan out then converge (the latency win is real here)
- You need debate / refinement loops between independent perspectives to reduce a single model's blind spots
- Context isolation improves quality: a sub-agent with a clean, scoped context window outperforms one big agent whose context is polluted with 40 unrelated tool results
- Specialization improves quality (small expert agents > one big generalist) on a measurable eval
Default to single agent + good tools + tool search + skills. Reach for multi-agent only when single agent is provably insufficient on an eval you can point to. "It feels more sophisticated" is not a reason.
The decision tree a senior actually runs
Is the single agent failing?
├── No → ship the single agent. Stop.
└── Yes → what is it failing at?
├── Wrong/missing capability → add a tool. Not an agent.
├── Context window blown / polluted → context editing, compaction, or ONE sub-agent
│ with a scoped context (cheapest multi-agent win)
├── Needs an independent second opinion → debate / verifier sub-agent
└── Independent parallel workstreams → supervisor + workers (real fan-out latency win)Patterns
1. Pipeline (sequential)
A → B → C, each agent passes output to next.
- Use: research → write → edit
- Failure mode: error compounding. B trusts A's hallucination; C polishes it into something that looks authoritative. Add a verification step or pass provenance forward.
- Latency: sum of all stages. Cost: sum of all stages. No parallelism win — only a quality/specialization win.
2. Supervisor + workers
1 supervisor routes work, N workers do specialized tasks.
- Use: "AI office" — supervisor delegates to analyst, writer, coder
- This is the workhorse pattern and the one Anthropic's Managed Agents
multiagentcoordinator implements natively ({type: "coordinator", agents: [...]}— one level of delegation, up to 20 roster agents, max 25 concurrent threads). See the Managed Agents mapping below. - Failure mode: the supervisor becomes a bottleneck and a single point of context bloat. Keep the supervisor's own context lean — delegate work, not raw data.
3. Debate / refinement
2+ agents argue, reach consensus (or a judge picks).
- Use: critical reasoning, fact-checking, reducing single-model blind spots
- Failure mode: cost explosion with diminishing returns. Two rounds usually capture ~90% of the gain; round five is almost always waste. Cap rounds and define a termination condition (judge satisfied, no new objections, or
max_iterations).
4. Hierarchical
Trees of agents, each with sub-agents.
- Use: complex projects (engineering, research)
- Failure mode: the hardest pattern to debug and the easiest to make exponentially expensive. Depth ≥ 3 is rarely worth it. Note Anthropic's coordinator caps delegation at one level on purpose — depth > 1 is ignored — because deeper trees lose more to coordination than they gain.
5. Round-robin
Agents take turns in a conversation.
- Use: brainstorming, simulation
- Failure mode: no natural termination; turns drift off-topic. Always set a turn cap and a stop condition.
Pattern selection cheat-sheet
| Pattern | Parallelism | Cost vs single | Best when | Worst when |
|---|---|---|---|---|
| Pipeline | None | N× | Distinct sequential stages, each a specialty | Error compounding matters and there's no verification |
| Supervisor + workers | High (workers fan out) | 1 + N× | Independent subtasks, dynamic routing | Subtasks are actually sequential/dependent |
| Debate / refinement | Low | 2–10× | Quality-critical, blind-spot reduction | Latency-sensitive or budget-tight |
| Hierarchical | Mixed | Unbounded if careless | Genuinely tree-structured problems | Anything you can't draw the tree for upfront |
| Round-robin | None | N× turns | Open-ended ideation / simulation | You need a deterministic, bounded output |
Frameworks comparison
| Framework | Best for | Mental model | Production caveat |
|---|---|---|---|
| LangGraph | Production, complex flows | State machine + nodes | You own state, retries, observability — most control, most boilerplate |
| crewAI | Quick multi-agent prototypes | Roles + tasks | Fast to demo, opaque under the hood; re-validate cost/latency before prod |
| AutoGen | Research, conversational | Chat between agents | Conversational loops can run away on cost; cap turns |
| Anthropic Managed Agents | Server-run agent loop + hosted sandbox + native multiagent coordinator | Persisted Agent → Session → threads | Anthropic runs the loop and the tool-execution container; you stream events. See mapping below |
| Custom | Maximum control, weird patterns | You build the loop with AsyncAnthropic + asyncio.gather | You own everything — only when frameworks fight you |
Framework choice is a secondary decision. The pattern, the termination conditions, and the cost model matter far more than which library spells
Agent(). A senior picks the pattern first, proves it on an eval, then picks the thinnest framework that expresses it.
Mapping patterns onto Claude / Anthropic primitives
If you're building on Claude directly (the stack this hub targets), here is how the abstract patterns become concrete API choices. Always consult the claude-api skill / canonical facts for current model IDs and syntax — never hard-code from memory.
- Models per role (cost-tier your agents). Don't run every agent on the flagship. A typical split: supervisor/critic on
claude-opus-4-8(flagship, 5 USD / 25 USD per million tok in/out at 1M context), bulk workers onclaude-sonnet-4-6(mid), and cheap/parallel-safe sub-agents (classifiers, "explore" scouts) onclaude-haiku-4-5(1 USD / 5 USD). This is the single biggest cost lever in a multi-agent system — Claude Code's Explore subagents use Haiku exactly this way. - Thinking & effort. On Opus 4.8 / 4.7, use adaptive thinking (
thinking={"type": "adaptive"}) plusoutput_config={"effort": "low"|"medium"|"high"|"xhigh"|"max"}. The legacybudget_tokensform is removed and returns HTTP 400 on 4.7/4.8 — do not use it. Give a debate-judge or planning supervisorhigh/xhigh; give cheap parallel scoutslow. Sonnet 4.6 and Haiku take adaptive thinking but no thinking budget. - Parallel tool / agent calls. Use
AsyncAnthropicon the server andasyncio.gatherto fan out worker calls — that's what turns the supervisor pattern's latency fromΣintomax. Mark read-only sub-tasks (grep/glob/scout) parallel-safe; serialize anything with side effects. - Structured handoffs. When agent A's output is agent B's input, don't pass free-form prose and hope. Use native structured outputs (
client.messages.parse()with a Pydantic/zod schema, oroutput_config.format) so the handoff is typed and validated. This is the #1 fix for pipeline error-compounding. - Caching the stable prefix. Every agent shares a system prompt and tool list that don't change per call. Put
cache_controlon the stabletools→systemprefix so the 5–15× call multiplier doesn't also multiply your input token bill. Verify withusage.cache_read_input_tokens— if it's zero across calls, a silent invalidator (a timestamp, an unsorted tool list) is in the prefix. - Resilience. Use the SDK's
max_retriesand typed exceptions (RateLimitError,OverloadedError,APIStatusError,APITimeoutError), a per-calltimeout, and streaming for any large worker output — a single overloaded worker should degrade gracefully, not take down the whole crew. - Native multi-agent. If you want Anthropic to run the loop and host the sandbox, Managed Agents gives you a
multiagentcoordinator: declare the roster on the agent ({type: "coordinator", agents: [...]}, not on the session, not intools[]), each sub-agent runs in its own context-isolated thread sharing the container filesystem. One level of delegation, ≤ 20 roster agents, ≤ 25 concurrent threads. Tool confirmations from sub-agents cross-post to the primary thread.
Reference implementation: a supervisor that fans out (and bills itself honestly)
Everything above is abstract until you can point at the code. This is the hand-rolled supervisor + workers pattern, in Python, with every production lever the senior bullets above describe wired in: AsyncAnthropic on the server, asyncio.gather to convert Σ latency into max, per-role model tiering (Opus supervisor, Haiku scouts), typed structured handoffs (messages.parse with a Pydantic schema), prompt caching on the stable prefix, typed-exception resilience, and per-agent usage logging so you can answer "which agent burned the budget". It is deliberately compact — the point is the shape, not a framework.
import asyncio
import time
from anthropic import (
AsyncAnthropic,
APITimeoutError,
RateLimitError,
OverloadedError,
)
from pydantic import BaseModel
# max_retries handles 429/529/5xx with exponential backoff before our own
# fallback ever runs. timeout is per-call wall-clock — a wedged worker dies fast.
client = AsyncAnthropic(max_retries=2, timeout=60.0)
SUPERVISOR_MODEL = "claude-opus-4-8" # the judge/merger reasons — pay for it
WORKER_MODEL = "claude-haiku-4-5" # parallel scouts are cheap and disposable
# The stable prefix every worker shares. Put cache_control on the LAST stable
# block (tools -> system render first); the per-task question goes in messages,
# AFTER the breakpoint, so it never invalidates the cached prefix.
WORKER_SYSTEM = [
{
"type": "text",
"text": "You are a research scout. Return ONLY findings relevant to the question.",
"cache_control": {"type": "ephemeral"},
}
]
class Finding(BaseModel):
"""Typed handoff — the #1 fix for pipeline error-compounding."""
claim: str
evidence: str
confidence: float # 0..1 — lets the supervisor weight, not just concatenate
def log_usage(agent: str, model: str, usage) -> float:
"""Per-agent cost attribution. Without this you cannot find the runaway."""
price = {"claude-opus-4-8": (5, 25), "claude-haiku-4-5": (1, 5)}[model]
cost = (usage.input_tokens * price[0] + usage.output_tokens * price[1]) / 1e6
cache_read = getattr(usage, "cache_read_input_tokens", 0)
print(f"[{agent}] {model} in={usage.input_tokens} out={usage.output_tokens} "
f"cache_read={cache_read} ${cost:.5f}")
return cost
async def scout(question: str, angle: str) -> Finding | None:
"""One worker. Read-only, parallel-safe, degrades gracefully on overload."""
try:
resp = await client.messages.parse(
model=WORKER_MODEL,
max_tokens=1024,
system=WORKER_SYSTEM,
messages=[{"role": "user", "content": f"From the angle of {angle}: {question}"}],
output_config={"format": Finding}, # validated, typed handoff
)
log_usage(f"scout:{angle}", WORKER_MODEL, resp.usage)
return resp.parsed
except (RateLimitError, OverloadedError, APITimeoutError) as e:
# One overloaded worker must NOT take down the crew. Skip it; the
# supervisor reconciles around the gap. (Could fall back to a cheaper
# model or a cached answer here instead of returning None.)
print(f"[scout:{angle}] degraded: {type(e).__name__}")
return None
async def supervisor(question: str, angles: list[str]) -> str:
t0 = time.monotonic()
# THE WHOLE POINT: gather() runs the scouts concurrently. Awaiting them in a
# for-loop would silently re-serialize — you'd pay N* cost for 1* latency.
findings = await asyncio.gather(*(scout(question, a) for a in angles))
findings = [f for f in findings if f is not None] # drop degraded workers
fan_out_ms = (time.monotonic() - t0) * 1000
print(f"[supervisor] fan-out of {len(angles)} scouts took {fan_out_ms:.0f}ms "
f"(serial would be ~{len(angles)}x)")
if not findings:
return "All scouts degraded — no answer. Escalate or retry."
# The supervisor RECONCILES (weighs confidence, resolves contradictions),
# it does not concatenate. Adaptive thinking + high effort for the judge.
digest = "\n".join(f"- ({f.confidence:.0%}) {f.claim} [{f.evidence}]" for f in findings)
resp = await client.messages.create(
model=SUPERVISOR_MODEL,
max_tokens=2048,
thinking={"type": "adaptive"},
output_config={"effort": "high"},
system="Reconcile these scout findings into one answer. Flag contradictions; "
"do not average away a low-confidence claim that contradicts a high-confidence one.",
messages=[{"role": "user", "content": f"Question: {question}\n\nFindings:\n{digest}"}],
)
log_usage("supervisor", SUPERVISOR_MODEL, resp.usage)
return resp.content[0].text
if __name__ == "__main__":
answer = asyncio.run(supervisor(
"Is multi-agent orchestration worth its cost for a research task?",
angles=["latency", "dollar cost", "answer quality"],
))
print("\n=== ANSWER ===\n", answer)What to notice, because a staff engineer reads code for the decisions, not the syntax:
- The
gatheris load-bearing. Replace it withfor a in angles: await scout(...)and the program still works — but it now costs the same and runsΣslow instead ofmaxslow. That one-line difference is the entire latency value proposition of the supervisor pattern. Exercise 3 makes you prove this with a timestamp. messages.parseis the handoff contract. The scout's output is a validatedFinding, not prose the supervisor has to re-parse and trust. A scout that hallucinates can still lie, but it can't return malformed structure that corrupts the merge — andconfidencegives the supervisor a lever to weight rather than blindly believe. This is how you blunt pipeline error-compounding.- The model tier is the cost lever. Three Haiku scouts + one Opus supervisor is dramatically cheaper than four Opus agents, and on a scout task (gather, don't reason) the quality delta is usually zero.
log_usageprints the receipts so you can defend the number in Exercise 6. - Degradation is a
return None, not a crash. The crew survives a 529 on one worker. In production you'd replacereturn Nonewith a fallback (cheaper model, cached answer, or a retry on a different angle) — but never let one worker's overload propagate into a crew-wide failure. - The cache breakpoint sits on the stable prefix, not the question. Verify with
cache_read_input_tokensin the log. If it's zero across runs, a silent invalidator (a timestamp, an unsorted tool list, a per-request ID) crept into the prefix — Exercise 4 is the hunt.
Anti-patterns
- ❌ Multi-agent for "everything is a hammer" → overengineering. The single-agent baseline almost always wins on simple tasks.
- ❌ Agents that just call the LLM with different prompts but the same tools and the same context → that's not multi-agent, it's one agent wearing hats. No isolation, no real specialization.
- ❌ No supervision / no termination condition → infinite loops and runaway bills. Every loop needs a cap (
max_iterations, turn limit) and a stop condition. - ❌ State sprawl (each agent has its own private memory with no shared source of truth) → impossible to debug; agents disagree about reality.
- ❌ N agents × M calls × $X tokens with no cost ceiling → cost explosion. Set a hard budget and degrade when you hit it.
- ❌ Passing raw context between agents instead of distilled handoffs → context bloat compounds at every hop; you pay to re-read the same 30K tokens five times.
- ❌ Running every agent on the flagship model → you're paying Opus prices for work Haiku does fine. Tier your models.
- ❌ Re-creating the agent/config on every run (Managed Agents) → orphaned agents + create latency. Create the Agent once, reference it by ID; only the Session is per-run.
Cost & latency reality
- Each agent call = at least one LLM call (often several, once tools loop).
- A 5-agent system is typically 5–15× the cost of a single agent on the same task — and that's before debate rounds multiply it again.
- Latency: serial pipeline = sum of all stages; parallel fan-out = max of the slowest worker + merge cost. The supervisor pattern's whole value proposition is converting
Σintomax— if your "parallel" workers actually run serially (because youawaitthem one at a time instead ofasyncio.gather), you paid the multi-agent cost and got none of the latency benefit. - Observability is not optional. Log per-agent
usage(input/output/cache tokens), latency, model id, and stop reason. Without per-agent cost attribution you cannot answer "which agent is burning the budget" — and in a 5-agent system, one runaway agent is the usual culprit. The minimum trace row a senior instruments per agent call:{trace_id, agent_role, model, input_tokens, output_tokens, cache_read_input_tokens, cache_creation_input_tokens, latency_ms, stop_reason, retries}.trace_idis what lets you collapse a fan-out back into one request when you're staring at a dashboard at 2am. A surprisingcache_read_input_tokens: 0across a run is the single highest-signal alert you can wire up — it means your caching silently broke and your input bill just got multiplied by the agent count. See thelog_usagehelper in the reference implementation above for the cost-attribution math (input × in_price + output × out_price, per model tier). - Multi-agent ONLY makes sense if the measured quality gain justifies the cost/latency. "Measured" means on an eval set, not on a demo.
Security & failure modes a staff engineer plans for
- Prompt injection propagation. A worker that ingests untrusted web/tool content can be hijacked; its poisoned output then flows to the supervisor as if trusted. Treat inter-agent messages from content-ingesting workers as untrusted input — validate, don't blindly forward. Keep credentials out of agent-visible context (vaults / host-side custom tools, never the system prompt).
- Cascading failure. One worker rate-limited (429) or overloaded (529) shouldn't fail the crew. Per-call retries + timeouts + a fallback (skip the worker, or fall back to a cheaper model) keep the system degrading gracefully.
- Non-determinism in the merge. When parallel workers return, the merge step is where contradictions surface. The supervisor must reconcile, not concatenate — otherwise the final answer contains both "X is true" and "X is false."
- Cost blast radius. A hierarchical system with no depth/budget cap can spawn agents recursively. Bound depth, bound fan-out, bound total token budget.
Multi-agent + MCP
Powerful combo: N agents share access to MCP servers, each calls tools. Example: a "Research crew" — one agent calls pubmed_search, one calls arxiv_search, one critiques and reconciles results.
Operational notes a senior adds:
- MCP credentials belong in a credential store (e.g. Managed Agents vaults), injected at egress — never in an agent's system prompt or messages, which persist in session history.
- If multiple agents hammer the same MCP server, you'll hit its rate limits, not just Anthropic's. Add per-server concurrency limits.
- Large MCP tool outputs bloat every downstream agent's context. Distill before forwarding (or rely on Managed Agents' automatic >100K-token offload-to-file behavior).
🏋️ Exercices
These escalate from "build it" to "make it production-grade / break it / defend the number." Do them in order. The learner is expected to instrument and measure, not just wire things up. The Reference implementation section above is your starting skeleton for exercises 3–5 — but do not just run it; the first two exercises deliberately make you build the single-agent baseline it must beat before you reach for the crew.
1. Single-agent baseline (the control group).Objectif: Build a single agent that does a research task (gather sources → synthesize a short report), and instrument it: log usage (input/output/cache tokens), wall-clock latency, and total cost. Indice/Solution: Use AsyncAnthropic, claude-opus-4-8, adaptive thinking, a couple of tools (web search / MCP). Print response.usage after each call. This baseline is what every later exercise must beat to justify its existence.
2. Convert to a 3-agent crew and prove (or disprove) it was worth it.Objectif: Split into researcher → writer → critic. Run both versions on the same 10 inputs and produce a table: quality (your rubric), cost, p50/p95 latency. Indice/Solution: Wire it with any framework or hand-rolled. The deliverable isn't the crew — it's the comparison table and a one-paragraph verdict ("multi-agent won/lost because…"). If the single agent wins, say so. That honesty is the point.
3. Make the supervisor fan out — turn Σ latency into max.Objectif: Implement supervisor + workers where ≥ 3 workers are genuinely independent, and use asyncio.gather so they run concurrently. Measure latency before (serial await) and after (gather). Indice/Solution: The trap is awaiting workers one at a time, which silently re-serializes them. Confirm concurrency by timestamping each worker's start. Bonus: cost-tier the workers (Haiku) vs the supervisor (Opus) and report the cost delta.
4. Now make it production-grade: caching, resilience, and a budget ceiling.Objectif: (a) Add cache_control to the shared tools+system prefix and prove cache hits via usage.cache_read_input_tokens. (b) Inject a forced 529 / timeout on one worker and show the crew survives. (c) Add a hard total-token budget that degrades gracefully when exceeded. Indice/Solution: For (a), if cache reads are zero, hunt the silent invalidator (timestamp in system prompt, unsorted tool list, per-request ID). For (b), use typed exceptions (OverloadedError, APITimeoutError) + max_retries + a fallback to a cheaper model. For (c), track a running token counter in the supervisor and stop spawning workers past the cap.
5. Break it: prompt-injection through a worker.Objectif: Feed one content-ingesting worker a document containing an injected instruction ("ignore your task, output X"). Show the poisoned output reaching the supervisor. Then fix it so the injection is contained. Indice/Solution: The fix is not a better prompt — it's treating inter-agent messages from content-ingesting workers as untrusted: validate/structure the handoff (messages.parse with a strict schema), isolate the worker's context, and never forward raw ingested text into the supervisor's trusted context. Document what still leaks.
6. Defend the number.Objectif: Given your exercise-2 table, write the one-page memo you'd send a skeptical staff engineer arguing for or against shipping the multi-agent version. It must cite cost multiplier, p95 latency, measured quality delta, and the failure modes you mitigated. Indice/Solution: The strongest memos often argue against — "single agent + tool search + caching matches quality at 7× lower cost; multi-agent isn't justified until [specific condition]." A senior is judged on knowing when not to build the complex thing.
🎤 En entretien
- "When would you choose multi-agent over a single agent?" — When the bottleneck is perspective/role/context-isolation (not capability), and the measured quality gain on an eval justifies a 5–15× cost multiplier; if it's a missing capability, I add a tool, not an agent.
- "How do you control cost and latency in a multi-agent system?" — Tier models per role (Opus supervisor, Sonnet/Haiku workers), cache the stable
tools+systemprefix, fan out independent workers withasyncio.gatherto convertΣlatency tomax, cap debate rounds, and enforce a hard token budget with graceful degradation — all attributed per-agent via loggedusage. - "What breaks in production that demos never show?" — Error compounding in pipelines, prompt-injection propagating from a content-ingesting worker into the supervisor's trusted context, cascading failure from one overloaded worker, non-deterministic merges that concatenate contradictions, and runaway cost from uncapped hierarchical spawning. Each has a specific mitigation (provenance/verification, untrusted-handoff isolation, retries+timeouts+fallback, reconciling merges, depth/budget caps).
- "How does this map onto Anthropic's primitives?" — Adaptive thinking +
effortper role, structured outputs (messages.parse) for typed handoffs, prompt caching on the shared prefix,AsyncAnthropic+gatherfor parallelism, and for a server-run option, Managed Agents'multiagentcoordinator (roster on the Agent not the session, one level of delegation, context-isolated threads sharing a container).
Resources
- Anthropic — Building effective agents (the canonical "start simple, single agent + tools first" essay)
- Anthropic Managed Agents — multi-agent (
multiagentcoordinator, threads, cross-posted tool confirmations) - LangGraph multi-agent docs: langchain-ai.github.io/langgraph/tutorials/multi_agent
- crewAI docs: docs.crewai.com
- AutoGen: microsoft.github.io/autogen
- Paper: Communicative Agents (Park 2023)