The first systematic study of agent token spending is here. In April 2026, researchers from Stanford's Digital Economy Lab and Microsoft Research published the first large-scale empirical study of how coding agents actually consume tokens. The paper, "How Do AI Agents Spend Your Money?", analyzed trajectories from eight frontier LLMs on SWE-bench Verified, the standard benchmark for real-world software engineering tasks. The findings are worth reading carefully if you pay for LLM inference at scale. Not because the numbers are surprising (practitioners already suspected most of this), but because the study puts hard data behind patterns that were previously anecdotal. Source: Stanford Digital Economy Lab / Microsoft Research, "How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks," arXiv:2604.22750, April 2026 Finding 1: Coding agents use 1,000x more tokens than chat. This is the headline number. Agentic coding tasks consume roughly 1,000 times more tokens than conversational code reasoning or code chat. The gap exists because agents loop: they read files, call tools, interpret results, make changes, verify, and repeat. Each loop iteration sends the full accumulated context back through the model. A chat interaction is a prompt and a response. An agentic task is 10 to 50 API calls, each carrying the full history of every previous call. Input tokens, not output tokens, drive the cost. The study found that input token volume is the primary determinant of total spend in agentic workflows. This tracks with what GitHub found when they audited their own agentic systems earlier this year (37% of tokens were waste) and with independent analyses showing that agentic sessions average 1 to 3.5 million tokens per task including retries. Finding 2: The same task can cost 30x more on one run than another. Token consumption in agentic coding is highly stochastic. The study found that repeated runs of the same agent on the same task can differ by up to 30x in total token usage. Same model, same prompt, same codebase. Different execution path, wildly different bill. This happens because agent behavior is non-deterministic. The model might find the right file on the first try or explore five wrong files first. It might generate a correct fix immediately or enter a debug-retry loop that consumes 500,000 additional tokens. The variance is inherent to the agentic paradigm, not a bug in any specific model. The practical implication: forecasting agent costs from averages is unreliable. A task that costs $0.50 on one run can cost $15 on the next. Budget planning based on average cost per task will underestimate actual spend for any team running agents at scale. Finding 3: Spending more tokens does not improve accuracy. This is the finding that matters most for cost optimization. The study found no meaningful correlation between token consumption and task success. Agents that consumed more tokens were not more likely to solve the task correctly. In other words, the expensive runs are not the productive ones. They are the ones where the agent got lost, explored dead ends, retried failed approaches, or accumulated irrelevant context. The cheapest successful runs were often the most direct: the agent identified the issue, made the fix, and stopped. This directly challenges the implicit assumption behind "let the agent run until it succeeds" strategies. Giving an agent more budget does not make it smarter. It makes it more expensive. The quality of the routing decision (which model handles which task) matters more than the token budget allocated to each task. Finding 4: Models vary dramatically in token efficiency. Not all models consume tokens at the same rate on the same tasks. The study found that on identical SWE-bench tasks, some models consumed over 1.5 million more tokens than others. The gap is not explained by accuracy. In several cases, more token-efficient models also achieved higher solve rates. This is a model selection problem, not a prompt engineering problem. Two frontier models given the same task and the same tools can differ by millions of tokens in total consumption. The cost difference at $5 per million input tokens is $7.50 or more per task, before accounting for output tokens. The implication for teams choosing a default agent model: benchmark token efficiency alongside accuracy. A model that scores 2 percentage points lower on SWE-bench but uses 40% fewer tokens per task will cost dramatically less at production scale. And at 500+ tasks per day, "dramatically less" means tens of thousands of dollars per month. Finding 5: Models cannot predict their own costs. The researchers asked each model to estimate its own token consumption before executing a task. The correlation between predicted and actual consumption was weak to moderate, peaking at 0.39. Models systematically underestimated what they would actually spend. This matters for pre-routing and budget allocation. If you are using an LLM to classify task difficulty as an input to routing decisions, the model's self-assessment of "how hard is this" is not a reliable proxy for "how many tokens will this consume." The study also found that human expert assessments of task difficulty only weakly correlated with actual token costs. Humans and models both misjudge the computational effort agents expend. Effective routing needs a purpose-built classifier trained on observed token consumption and task outcomes, not self-reported difficulty estimates. The classifier has to learn the mapping between prompt characteristics and actual routing cost from historical data, not from the model's introspection. Parallel research: 40 to 60% of agent tokens are removable. The Stanford/Microsoft study measured the problem. Other recent work measured the solution. AgentDiet, a trajectory reduction framework evaluated on SWE-bench Verified, demonstrated that 39.9 to 59.7% of input tokens in coding agent runs can be removed with no measurable loss in task accuracy. The cost reduction: 21.1 to 35.9% of total computational spend. Source: "Reducing Cost of LLM Agents with Trajectory Reduction," arXiv:2509.23586 The waste categories: Stale context. File listings, error messages, and tool outputs from early turns that are no longer relevant by turn 20+. Redundant tool schemas. Full JSON schemas for every registered tool re-sent on every API call, even when most tools are irrelevant to the current step. Uncompressed tool output. Raw JSON payloads, whitespace, and verbose formatting that can be minified without information loss. Failed attempt history. Full traces of approaches that did not work, carried forward as context for the rest of the session. A separate study, Squeez (task-conditioned tool-output pruning), found that extracting only the relevant evidence block from each tool observation, while discarding the rest, reduces token volume with no performance loss on downstream agent decisions. Source: "Squeez: Task-Conditioned Tool-Output Pruning for Coding Agents," arXiv:2604.04979 These are not marginal improvements. Cutting 40 to 60% of input tokens on a 2-million-token agentic session saves 800,000 to 1,200,000 tokens per task. At $5 per million input tokens, that is $4 to $6 per task. At 500 tasks per day, that is $2,000 to $3,000 per day, or $60,000 to $90,000 per month. What this means for your agent architecture. The research converges on a clear picture: coding agents are expensive not because the work is hard, but because the execution is wasteful. Most tokens are spent finding things, not doing things. Most retries do not need a frontier model. Most context carried forward is stale. Three levers move the needle: 1. Route per turn, not per session. The study confirms that not every API call in an agentic session requires the same model. File reads, status checks, and output formatting do not need Opus. Error parsing and simple code generation do not need Opus. A classifier that evaluates each turn independently and routes to the cheapest capable model captures the efficiency gap the researchers measured between models. 2. Compress context aggressively. The 40 to 60% removable token finding from AgentDiet is conservative for long sessions. Minifying JSON, deduplicating schemas, and summarizing stale turns can cut input tokens substantially without affecting the agent's ability to complete the task. Every token removed from context saves money on every subsequent turn, because agentic sessions are cumulative. 3. Scope tools per step. Every tool schema sent to the model is input tokens billed at full rate. An agent with 40 registered tools pays for 40 tool definitions on every API call, even when only 3 are relevant. Narrowing tool registrations to the current step reduces the fixed cost per turn. Where Nadir fits. Nadir addresses the first lever directly. The trained classifier evaluates each API call in under 10 ms and routes to the cheapest model that can handle it. For agentic workloads where 60 to 70% of turns are low complexity, this means the majority of your turns hit Haiku-class pricing ($1 per million input tokens) instead of Opus-class pricing ($5 per million). The research validates this approach from multiple angles: More tokens does not equal better results, so routing to a cheaper model on low-complexity turns does not degrade outcomes. Models vary dramatically in efficiency, so selecting the right model per turn is the highest-leverage cost decision. Self-assessment of difficulty is unreliable, so routing needs a trained classifier, not the model's own judgment. The integration is two lines. Change the base URL, set model="auto". Per-request response headers (x-nadir-routed-to, x-nadir-cost-saved) show exactly where each turn was routed and what it saved. For teams running coding agents at production scale, the Stanford/Microsoft data makes the case clearly: the cheapest token is the one you never send, and the next cheapest is the one routed to the right model. Sources: Stanford Digital Economy Lab / Microsoft Research, "How Do AI Agents Spend Your Money?" arXiv:2604.22750 (April 2026). AgentDiet, "Reducing Cost of LLM Agents with Trajectory Reduction," arXiv:2509.23586. Squeez, "Task-Conditioned Tool-Output Pruning for Coding Agents," arXiv:2604.04979. GitHub Blog, "How we reduced token consumption in GitHub Agentic Workflows by 37%" (April 2026). Artificial Analysis, "Coding Agent Index" (May 2026). Anthropic, OpenAI model pricing as of May 2026.