The number Anthropic put in its own engineering blog. In June 2025, Anthropic published a detailed writeup of how it built its multi-agent research system — an orchestrator model that plans a research task and fans it out to several subagents working in parallel. Past the architecture diagrams is the number that should be on every AI infrastructure team's dashboard: "agents typically use about 4× more tokens than chat interactions, and multi-agent systems use about 15× more tokens than chats." Source: Anthropic Engineering, "How we built our multi-agent research system," June 13, 2025 That is not a warning buried in a limitations section. It is stated as the cost of the architecture doing its job well. Anthropic's own data explains why teams pay it anyway: token usage alone explained 80% of the variance in how their internal research eval performed, and an Opus-lead-plus-Sonnet-subagents configuration beat a single Opus agent working alone by 90.2%. Multi-agent systems are not a novelty pattern anymore. They are measurably better at the tasks they are built for. On Anthropic's own numbers, they are also close to four times more expensive than the single-agent architecture most cost-optimization advice is still written for — and multi-agent takes that already-inflated baseline roughly 3.75x further. Multi-agent went from research pattern to standardized infrastructure in about six months. Google donated its Agent2Agent (A2A) protocol to the Linux Foundation in June 2025, with more than 100 companies — including AWS, Cisco, Microsoft, Salesforce, SAP, and ServiceNow — signing on as founding supporters. Source: Linux Foundation, "Linux Foundation Launches the Agent2Agent Protocol Project," June 2025. Six months later, Anthropic donated the Model Context Protocol to a newly formed Agentic AI Foundation, co-founded with Block and OpenAI and backed by Google, Microsoft, AWS, Cloudflare, and Bloomberg — reporting more than 10,000 active public MCP servers, 97 million-plus monthly SDK downloads, and 75+ Claude connectors already in production. Source: Anthropic, "Donating the Model Context Protocol and establishing the Agentic AI Foundation," December 9, 2025. Two of the largest labs in the industry just standardized how agents talk to each other and to tools, and put the governance in the hands of a neutral foundation instead of either vendor. That is not the behavior of an experimental pattern people will grow out of. Multi-agent orchestration — an orchestrator or supervisor spawning subagents, each with its own context window, tool bindings, and model call — is now default infrastructure for anything more complex than a single tool-use loop. Its token bill scales with every subagent you add to a workflow, not with every task the workflow completes. Tokens per task by agent architecture: a single chat turn runs 1,580 tokens (1x), a single ReAct-style agent runs 6,320 tokens (4x), and a multi-agent orchestrator-plus-subagents workflow runs 23,700 tokens (15x) — an illustrative model calibrated to Anthropic's published multipliers Why the multiplier compounds instead of adding up. A single agent's cost problem is well documented by now: every turn resends the full conversation history, so a 10-step task doesn't pay for 10 turns of context, it pays for the 1+2+3…+10 accumulation of all of them — the same pattern we've measured in ordinary multi-turn conversations. A multi-agent workflow inherits that problem and multiplies it by every subagent in the fan-out, then adds three failure modes a single agent never has to pay for: Duplicated static context. The system prompt and the tool-schema catalog aren't shared once across a task — they get reloaded in full by the orchestrator and by every subagent it spawns. The MCP tool-schema tax alone can run past 13,000 tokens per session before any subagent has done anything. Full-transcript handoffs. The naive way to pass work from an orchestrator to a subagent, and from a subagent back to the orchestrator, is to hand over the entire message history instead of a condensed result. Every hop through the chain re-bills tokens that were already paid for once, upstream. Fixed per-spawn overhead. Spinning up a subagent costs tokens before it does anything useful — system instructions, tool bindings, role framing. One independent audit of Claude Code's Task-tool subagents measured this fixed cost at roughly 20,000 tokens per spawn, with one traced example of a subagent burning 160,000 tokens on work that would have cost about 3,000 tokens run directly in the parent context. Source: Amit Kothari, "Claude Code: Task Tool vs. Subagents," 2025, updated June 2026 — an independent practitioner's measurement, not a vendor benchmark, but it matches the shape of Anthropic's own 4x-to-15x multiplier. Stack those three on top of the 4x single-agent baseline and Anthropic's 15x figure stops looking like an outlier. It looks like the expected outcome of an architecture where every subagent independently pays the system-prompt tax, the tool-schema tax, and the fixed spawn tax that a single agent only pays once. The fix isn't "don't use multi-agent." It's stop billing every subagent the orchestrator's rate. None of the published fixes for this ask teams to give up the architecture that's winning on quality. They target the three failure modes above directly. | Technique | What it does | Reported effect | |---|---|---| | Context editing | Drops stale tool results, including old subagent handoffs, once a session crosses a token threshold | 70,000 → 25,000 tokens, a 64% cut, on Anthropic's own worked example. Source: Anthropic, "Context editing," Claude Platform Docs | | Result-only handoffs | Subagent returns a condensed answer to the orchestrator, not its full transcript | Standard pattern in Anthropic's own architecture and Claude Code's subagent docs. Source: Claude Code documentation, "Workflows" | | Supervisor-managed fan-out | A lightweight supervisor role prunes redundant subagent calls before they run | 15.45% supervisor overhead paid back as an 18.9-29.7% net token reduction across GAIA, HumanEval, and AIME. Source: Lin et al., "Stop Wasting Your Tokens: Towards Efficient Runtime Multi-Agent Systems," arXiv:2510.26585, March 2026 | | Fan-out limits | Cap concurrent and total subagent spawns per task instead of letting depth grow unbounded | Shipped as a default in OpenAI's Codex subagents (6 concurrent threads, depth 1). Source: OpenAI, Codex subagents documentation | | Role-based model routing | Route each subagent's calls to the cheapest model that role needs, not whatever the orchestrator is running | Worked example below | The first four are architectural hygiene — most teams that ship a multi-agent workflow have implemented at most one of them. The fifth is the one this post focuses on, because it's the one that doesn't require rewriting your fan-out logic at all. Implementation: route subagents by role, not by default. Most multi-agent codebases hardcode one model string and every role — orchestrator, lookup, analysis, verification — calls it. That's the same mistake single-model defaults make everywhere else, just multiplied by the number of subagents in the fan-out. A bounded, single-purpose lookup subagent and the orchestrator planning the whole task are not the same job, and there's no reason they should be billed on the same model by default. from dataclasses import dataclass from enum import Enum class Role(Enum): ORCHESTRATOR = "orchestrator" LOOKUP = "lookup" # bounded, single-purpose retrieval ANALYSIS = "analysis" # multi-step reasoning over retrieved data VERIFICATION = "verification" # checks a claim against a source Map each subagent role to the cheapest model that role has been benchmarked to handle. The orchestrator keeps the frontier model -- it's the one place a bad call is expensive to recover from. ROLE_MODEL = { Role.ORCHESTRATOR: "claude-opus-4-8", Role.LOOKUP: "auto", # bounded, low-stakes -- let the router pick Role.ANALYSIS: "claude-sonnet-4-6", Role.VERIFICATION: "auto", } @dataclass class SubagentCall: role: Role prompt: str shared_context_id: str # points at a cached system-prompt/tool-schema block parent_task_id: str def dispatch(call: SubagentCall, client) -> str: """Send a subagent call through the shared gateway, routed by role. Returns only the condensed result, never the raw transcript, so the orchestrator's context doesn't inherit the subagent's intermediate tool calls.""" response = client.chat.completions.create( model=ROLE_MODEL[call.role], messages=[ {"role": "system", "content": f"@cache:{call.shared_context_id}"}, {"role": "user", "content": call.prompt}, ], extra_headers={"X-Task-Id": call.parent_task_id}, ) return response.choices[0].message.content # condensed handoff, not the full trace Two changes here have nothing to do with prompt engineering. First, the system prompt and tool schema are referenced once, by ID, and served from cache rather than retransmitted by every subagent. Second, every role except the orchestrator points at model="auto" or a specifically cheaper fixed model, so a lookup subagent doing a bounded retrieval never runs on the same model the orchestrator uses to plan the entire task. Nadir sits at exactly that dispatch point: point every subagent's OpenAI-compatible call at Nadir with model=auto, and each one routes independently — by role, cost, and observed difficulty — without a hardcoded model string anywhere in your orchestration code. Prompt caching handles the shared system prompt and tool catalog every subagent would otherwise reload from scratch, and Context Optimize strips redundant structure out of whatever handoff payload still crosses the wire. What the multiplier costs at 10,000 tasks a day. Model a workflow with one orchestrator and four subagents — two bounded lookups, one multi-step analysis pass, one verification check — running 10,000 times a day. Priced entirely on Claude Opus 4.8 ($5/M input, $25/M output) because that's what "every role calls the same model" defaults to in practice, the naive version costs $54,750 a month. Keep the orchestrator's planning and synthesis role on Opus, and route the two lookups and the verification pass to a small, fast model while the analysis step runs on a mid-tier model, and the same 10,000 tasks a day costs $27,600 a month. Same five roles. Same task. A 49.6% reduction, without changing what the orchestrator is allowed to do or how the workflow is structured. Monthly cost for a 5-agent workflow at 10,000 tasks a day: naive, with every agent on the frontier model, costs $54,750/month; routed by role, with the orchestrator staying on the frontier model and subagents routed to cheaper models by task, costs $27,600/month — a 49.6% reduction These figures are illustrative, modeled on the published pricing and multipliers cited throughout this post, not measurements from a single production deployment. Your real numbers depend on how many subagents you fan out to, how much of their context is genuinely shared, and how aggressively you prune handoffs — but the direction, and the fact that it's the routing layer doing the work rather than a smaller model everywhere, holds regardless of the exact fan-out shape. Checklist before you add a second agent. Measure your current fan-out. Log tokens per subagent spawn, separated from tokens per orchestrator turn. Most teams have only ever measured the total. Cache the shared context once. The system prompt and tool schema should be referenced, not retransmitted, by every subagent in the fan-out. Return results, not transcripts. A subagent's handoff to the orchestrator should be the answer, not the trace that produced it. Cap the fan-out. Unbounded subagent spawning is where the token-variance patterns documented in single-agent research — the same task costing 30x more on one run than another — show up multiplied across every branch. Route by role, not by default. The orchestrator's planning step and a subagent's bounded lookup are not the same task, and shouldn't default to the same model. Conclusion. Anthropic didn't publish the 15x number as a cautionary tale. They published it because the architecture it describes wins on quality, and they use it in production anyway. That's the right call for tasks where the value of the extra performance clears the extra cost. What Anthropic's post doesn't spend much time on, because it isn't their problem to solve, is that almost no team fanning out to subagents today routes each one to the model its role actually needs. The orchestrator and a bounded lookup subagent get billed identically, by default, in most multi-agent codebases shipping right now. Fixing that isn't a rewrite. It's a routing decision at the one point in the stack — the dispatch call — where every subagent's request already passes through. Data in this post is illustrative, modeled from the published research and documentation cited throughout, not derived from proprietary production traces. Sources: Anthropic Engineering, "How we built our multi-agent research system," June 13, 2025. Linux Foundation, "Linux Foundation Launches the Agent2Agent Protocol Project," June 2025. Anthropic, "Donating the Model Context Protocol and establishing the Agentic AI Foundation," December 9, 2025. Amit Kothari, "Claude Code: Task Tool vs. Subagents". Anthropic, "Context editing," Claude Platform Docs. Claude Code documentation, "Workflows". Lin et al., "Stop Wasting Your Tokens," arXiv:2510.26585, March 2026. OpenAI, Codex subagents documentation. Anthropic, Claude Opus 4.8 and Sonnet 4.6 pricing, 2026.