Abstract. Research agents built on the ReAct (Reason + Act) pattern are the fastest-growing category of LLM workload. They are also the most misunderstood from a cost perspective. This analysis breaks down exactly where tokens go within a single agent step: system prompt overhead, tool schema repetition, accumulated conversation history, retrieved document context, and final synthesis. It shows that the tokens driving the actual answer represent as little as 18% of total spend. Five targeted optimizations can cut that cost 75% while preserving 90-94% of answer quality. All data in this post is illustrative, modeled from public research and Anthropic pricing as of June 2026. Charts are generated from synthetic data. Not derived from proprietary production traces. Sources cited throughout. The experiment. Most teams know agentic AI costs more than chatbot AI. Few know exactly why. A ReAct agent (the Reason + Act pattern underlying most production AI assistants, coding agents, and research tools) works in cycles. Each cycle: reason about the task, call a tool, read the output, reason again, repeat. Each cycle re-sends the full accumulated context as input tokens. Cost is not linear with steps. It is proportional to context size, which grows with every step. To understand where the money goes, we modeled a realistic 5-step research agent: a retrieval-augmented assistant making 3 web searches and reading 5 document chunks per step. We tracked every token category across the agent's lifecycle. We measured at step 3 of 5 (the midpoint, when context accumulation is most representative of the steady-state production pattern). Agent configuration used in this analysis: Model: Claude Opus 4.8 at $5/M input, $25/M output (Anthropic pricing, June 2026) System prompt: 900 tokens (task instructions, output format, persona) Tool schemas: web search, document read, and answer synthesis definitions (2,100 tokens combined) Per step: 3 search calls returning 800 tokens each; 5 document chunks at 520 tokens each History: full prior conversation history re-sent on each turn Source: Microsoft Research, "How Do AI Agents Spend Your Money?", arXiv:2604.22750, April 2026. Source: Stanford Digital Economy Lab, "How are AI agents spending your tokens?", May 2026 Research question. In a 5-step ReAct research agent, how does token spend distribute across workflow stages, and which stages represent genuine value versus overhead? Token composition at step 3. At step 3, the agent sends 13,100 input tokens per API call. Here is where they go: | Token category | Tokens | % of input | Step cost at Opus | |---|---:|---:|---:| | Retrieved document chunks (5 × 520 tokens) | 2,600 | 20% | $0.0130 | | Accumulated conversation history | 3,800 | 29% | $0.0190 | | Tool schemas (resent every call) | 2,100 | 16% | $0.0105 | | System prompt + task description | 900 | 7% | $0.0045 | | Prior search results in context | 1,700 | 13% | $0.0085 | | Model synthesis (output, billed at output rate) | 1,100 | n/a | $0.0275 | | Total | 13,100 input + 1,100 output | | $0.0830/step | Chart 1: Token Composition of a Single Research Agent Step: 82% of billed tokens are overhead before the model writes a word of the answer The synthesis (the tokens that drive the answer) is 1,100 output tokens on a 13,100-token input. Even accounting for the 5x output rate, synthesis cost is 33% of total step cost. The remaining 67% is overhead: tool schemas resent every step, history accumulating across turns, and retrieved content the model only partially uses. The retrieval efficiency problem. The largest single overhead category is retrieved content: 2,600 tokens of document chunks plus 1,700 tokens of prior search results, totaling 4,300 tokens or 33% of input budget. Research on production RAG pipelines consistently finds that models meaningfully use 2–4 of every 10 retrieved chunks regardless of how many are provided. The remaining chunks are billed at full input rate. Source: Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," arXiv:2307.03172, 2023 At 5 chunks per step and a realistic 60% irrelevance rate, 3 of those 5 chunks (1,560 tokens) are overhead on every single step. At 1,000 daily sessions of 5 steps each, that is 7.8 million wasted retrieval tokens per day, or $39/day at Opus pricing. Annually: $14,235 per year on a workflow that would produce the same quality answers with precision retrieval returning only the 2–3 relevant chunks. The same pattern holds for accumulated history. By step 3, the agent carries 3,800 tokens of prior turns, much of it search result formatting, intermediate reasoning steps, and tool output noise that does not improve the step-3 answer quality. Summarizing and compressing history at each turn cuts this 30–40% without information loss. Cost comparison: naive versus optimized. Chart 2: Cost per 1,000 agent steps: naïve all-Opus agent costs $93; routing plus compression plus caching costs $22.80 Four configurations, same research task, same quality target: | Configuration | Cost/1,000 steps | vs. naive | |---|---:|---:| | Naïve: all Opus, no optimization | $93.00 | baseline | | Prompt caching + deduplicated tool schemas | $63.50 | -32% | | Two-stage retrieval + history compression | $44.20 | -52% | | Routing + compression + caching (full stack) | $22.80 | -75% | The optimized configuration cuts cost 75% while preserving 90-91% of always-Opus answer quality. The key driver is not cheaper models: it is reducing the volume of overhead tokens before they reach the expensive model. Prompt caching alone on repeated tool schemas and system prompts saves 32% immediately, with no architecture change. That is the fastest day-one optimization available to any team running agentic workflows. Latency does not follow cost. A common concern with optimization is that reducing cost increases latency. The data runs the other direction. Chart 3: Latency comparison across retrieval strategies: sequential search-read averages 8.4s P50; parallel selective retrieval reaches 3.2s Sequential search-read cycles are both expensive and slow. Parallelizing search calls and using selective reading (fetch document summaries first, retrieve full text only for confirmed-relevant documents) cuts both cost and latency simultaneously. | Retrieval strategy | P50 latency | P95 latency | vs. sequential | |---|---:|---:|---:| | Sequential search → read → search → read | 8.4s | 15.2s | baseline | | Parallel search, sequential read | 5.1s | 9.4s | -39% P50 | | Parallel search, selective read (top-3 only) | 3.8s | 7.1s | -55% P50 | | Parallel search, selective read + compressed history | 3.2s | 5.9s | -62% P50 | The selective read pattern is the highest-leverage single optimization for research agent workflows: it reduces both cost and latency at the same time, with no model change required. Cost versus quality: where the dangerous zone is. Chart 4: Cost vs. quality tradeoff: routing plus compression plus caching achieves 90% quality at 75% lower cost than always-Opus The fear with cost optimization is quality degradation. The data shows a clear pattern with one dangerous zone: aggressive model downgrading without optimization. | Strategy | Quality vs. Opus | Cost/1,000 steps | |---|---:|---:| | Always Opus (baseline) | 97% | $93.00 | | Always Haiku | 71% | $12.00 | | Routing only (no compression) | 93% | $51.00 | | Compression only (no routing) | 94% | $58.00 | | Routing + compression | 91% | $30.00 | | Routing + compression + caching | 90% | $22.80 | Always Haiku, the instinct of teams trying to cut costs by simply downgrading models, delivers 71% quality at $12/1,000 steps. Routing and compression together deliver 90% quality at $22.80, which is 75% cheaper than Opus, substantially better quality than Haiku. Routing without model change delivers 93% quality at 45% lower cost. The lesson: routing-then-compress is safer than compress-then-downgrade. Reduce overhead before reducing model capability. Where the 82% overhead actually goes. Chart 5: Token budget breakdown: 82% is overhead; 18% drives the answer. Largest waste category: irrelevant retrieved chunks at 31% Five categories account for the overhead, in order of magnitude: Irrelevant retrieved chunks (31%): document content fetched but not meaningfully used in the answer. Fixable with two-stage retrieval and reranking. Repeated tool schemas (18%): the same tool definitions resent on every step, never changing during a session. Fixable with prompt caching and lazy schema registration. Verbose tool output (14%): unformatted JSON from search APIs and tool responses that could be compressed 60–70% without information loss. Fixable with output preprocessing. Duplicate context in history (11%): prior tool outputs appearing in both raw output and accumulated conversation. Fixable with history summarization at each step. Oversized system prompt (8%): instruction text that grew unchecked. Fixable with a prompt audit; a system prompt audit typically finds 30–50% reduction in an afternoon. None of these require a model change. Implementation framework. Step 1: Instrument token usage by stage. Add a counter at each stage of your agent loop. Log tokens before and after appending retrieved chunks, tool outputs, and history. Most teams find their actual distribution differs from their mental model. import anthropic client = anthropic.Anthropic() def count_stage_tokens(system: str, messages: list, tools: list, model: str = "claude-opus-4-8") -> dict: base = client.messages.count_tokens( model=model, system=system, messages=[] ).input_tokens with_tools = client.messages.count_tokens( model=model, system=system, messages=[], tools=tools ).input_tokens full = client.messages.count_tokens( model=model, system=system, messages=messages, tools=tools ).input_tokens return { "system_tokens": base, "tool_schema_overhead": with_tools - base, "history_overhead": full - with_tools, "total_input": full, } Step 2: Separate reasoning tokens from context tokens. The split that matters is: tokens the model needs to reason (history, task, instructions) versus tokens provided as context (retrieved chunks, tool outputs). The second category is where most waste lives. Step 3: Detect over-retrieval. Log which retrieved chunks the model references in its output. If citation rate is below 50% across k retrieved chunks, retrieval precision is low. Two-stage retrieval (fetch summaries, rerank by relevance score, retrieve full text for top-k only) typically raises precision above 70%. Step 4: Route simple steps to cheaper models. Not every ReAct step requires a frontier model. Chunk relevance scoring, search query generation, and tool output summarization are classification-grade tasks. A smaller model handles these correctly 90%+ of the time at 5–10x lower cost per token. Step 5: Compress context before it compounds. JSON tool outputs can typically be compressed 60-70% without information loss. Minify API responses, deduplicate tool schemas, and summarize history at each step. The compression happens before the token is billed, not after. Step 6: Cache repeated inputs. System prompts and tool schemas are identical across every step in a session. Prompt caching on Anthropic applies a 90% discount to cached prefix tokens. On a 900-token system prompt repeated across 5 steps in 1,000 daily sessions, caching saves $4.05/day ($1,478/year) on a single static input. Step 7: Monitor cost, latency, and quality together. Cost per step is misleading in isolation. A cheaper model that retries twice costs more than an expensive model that succeeds once. Track cost per completed task, answer quality score, and step latency as a triad, not any single dimension. What this means for engineering teams. The 82% overhead finding is not a feature of LLMs; it is a feature of how agents use them. LLMs bill per token regardless of relevance. An agent that retrieves 10 chunks and uses 3 pays for 10. An agent that resends 2,100 tokens of tool schemas on every one of its 5 steps pays for those 10,500 schema tokens five separate times. This is the kind of problem model routing and observability systems are designed to solve. Instead of treating every agent step as an Opus-level problem, teams can classify the complexity of each step and route accordingly: sending chunk relevance scoring to Haiku, synthesis to Opus, and everything in between to Sonnet. The routing decision adds under 10 milliseconds. The savings are immediate. Nadir instruments this automatically: logging token cost by stage, detecting over-retrieval patterns, and routing each step to the minimum-cost model that preserves quality. The savings dashboard shows the real delta against always-Opus, per request. Two lines of code, one base URL change. Conclusion. The future of LLM cost optimization is not cheaper models alone. It is better retrieval discipline, per-step routing, context compression, prompt caching, and measurement at stage granularity, not just total session cost. An agent that costs $93 per 1,000 steps on a naive configuration costs $22.80 with the same models, the same quality bar, and five targeted optimizations applied. The overhead is not fundamental to the task. It is an engineering choice. Teams that measure token spend at the stage level, not just the session level, consistently find the same result: most of what they are paying for is not what they are paying for. The tokens that drive the answer are a minority of the bill. Optimizing the majority takes a day, not a quarter. Data in charts is illustrative, modeled from public research and Anthropic pricing as of June 2026. Not derived from proprietary production traces. Sources: Microsoft Research, "How Do AI Agents Spend Your Money?", arXiv:2604.22750, April 2026. Stanford Digital Economy Lab, "How are AI agents spending your tokens?", May 2026. Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," arXiv:2307.03172, 2023. Anthropic Claude Pricing, June 2026. Anthropic Token Counting API.