The paradox: prices fall, bills rise LLM API prices dropped roughly 80% between early 2025 and early 2026. Anthropic cut Opus pricing from $15/$75 per million tokens to $5/$25 with the 4.6 release. OpenAI, Google, and DeepSeek all followed with aggressive cuts of their own. Yet enterprise LLM spend is accelerating. The reason is not that teams are wasting money on chatbots. It is that the workloads changed. The shift from single-turn chat to multi-step agentic workflows multiplied token consumption per task by 5 to 30x, according to Gartner's March 2026 analysis. Falling prices times rising volume equals a bigger bill. Where the tokens actually go A standard chatbot interaction sends a prompt, gets a response, and is done. An agentic workflow is different. The agent reads the task, calls a tool, reads the output, decides what to do next, calls another tool, reads that output, and repeats. Every turn re-sends the full conversation history as input tokens. Microsoft Research quantified this in their April 2026 paper on agentic coding tasks. The numbers are striking: Agentic coding tasks consume 1,000x more tokens than code reasoning or code chat tasks A typical coding session uses roughly 1 million input tokens and 40,000 output tokens, a 25:1 ratio The same agent on the same task can vary by up to 30x in total token consumption between runs Stanford's Digital Economy Lab confirmed the pattern. Their research found that input tokens, not output tokens, drive the cost. An agent at turn 1 might send 5,000 input tokens. By turn 30, it carries 25,000 to 35,000 tokens of accumulated context on every single API call. By turn 50, the context is so large that each retry loop costs more than the first ten turns combined. The context accumulation tax Vantage published a detailed breakdown of where agentic coding costs hide. The key insight: context accumulates, and every API call pays the full accumulated price. Here is what a typical agentic coding session looks like: | Turn | Input tokens | Cumulative cost driver | |------|-------------|----------------------| | 1 | ~5,000 | System prompt + task description | | 10 | ~12,000 | + file reads, tool schemas, initial edits | | 20 | ~22,000 | + test output, error messages, first retry | | 30 | ~32,000 | + second retry cycle, more file reads | | 40 | ~40,000 | + third retry, accumulated conversation | When the agent hits a test failure at turn 35 and retries three times, those three retries each carry 35,000+ input tokens. That retry loop alone can cost more than the first twenty turns of the session. The model choice compounds this. A retry loop at turn 40 on Opus ($5/M input) costs 5x what the same loop costs on Haiku ($1/M input). Teams that default every request to a premium model pay the premium rate on wasted retry work, not just productive work. Why "just use a cheaper model" does not work The obvious response is to run everything on the cheapest model. But that breaks on complex tasks. A coding agent using Haiku to architect a distributed system will produce bad output, retry more, and potentially cost more in wasted tokens than if it had used Opus from the start. The real distribution of agentic work looks like this: 60 to 70% of turns are low complexity. Reading files, checking status, formatting output, running tests, parsing error messages. These are classification-grade tasks that Haiku handles correctly. 20 to 30% are medium complexity. Writing a function, explaining a bug, generating a test. Sonnet handles these well. 5 to 15% are genuinely hard. Architecture decisions, complex debugging, multi-file refactors where the agent needs to reason across a large codebase. These need Opus or an equivalent frontier model. Pinning everything to one model means either overpaying on the 70% (all Opus) or degrading quality on the 15% (all Haiku). Neither is a good trade. The math on per-turn routing Model routing evaluates each turn independently and sends it to the cheapest model that can handle it. Applied to agentic sessions, the savings compound because the expensive turns (high context, high token count) are exactly the ones most likely to be low complexity. Consider a 40-turn agentic coding session on Opus at $5/M input tokens: | Segment | Turns | Avg input tokens | Model (routed) | Cost (all Opus) | Cost (routed) | |---------|-------|-----------------|----------------|-----------------|---------------| | File reads, status checks | 25 | 18,000 | Haiku ($1/M) | $2.25 | $0.45 | | Code generation, tests | 10 | 28,000 | Sonnet ($3/M) | $1.40 | $0.84 | | Architecture, debugging | 5 | 35,000 | Opus ($5/M) | $0.88 | $0.88 | | Total | 40 | | | $4.53 | $2.17 | That is a 52% reduction on input tokens alone. Multiply by hundreds of sessions per week across a team, and the difference is five figures per month. The classifier overhead is under 10 ms per turn. In a session where each turn takes 2 to 30 seconds for the LLM to respond, 10 ms is noise. What the research says about routing The industry is converging on this approach. A 2026 survey found that 37% of enterprises now use five or more models in production. The teams seeing the best results treat model selection like air traffic control, routing each request to the right destination rather than sending everything to the same runway. Multiple independent analyses put the savings from intelligent routing at 40 to 60% for mixed-complexity workloads. That aligns with our own held-out benchmark: Nadir's verifier-gated cascade cuts cost 60% versus always-Opus on 11,420 RouterBench held-out triples, preserving 98% of always-Opus quality. A calibrated verifier (AUROC 0.961) reads the cheap-model answer before we ship it, so the routing decision is recoverable rather than absorbed. The key is that routing must be automatic and per-request. Manual rules break as prompt distributions shift. Static classifiers plateau at 88 to 93% accuracy. A trained classifier that adapts to live response quality (what we call Outcome-Conditioned Routing) closes the gap. Practical steps to cut your agentic AI bill 1. Audit your token distribution. Before optimizing, measure. Pull your API logs and bucket requests by input token count and task type. Most teams discover that 60%+ of their API calls are low-complexity turns that do not need a frontier model. 2. Route per request, not per session. Pinning an entire session to one model wastes money on low-complexity turns. Per-turn routing catches the file reads, status checks, and formatting tasks that accumulate through a session. 3. Compress context before it compounds. Minifying JSON, deduplicating tool schemas, and trimming old conversation turns can cut input tokens 30 to 60% on long sessions. These transforms are lossless. The model receives the same information in fewer tokens. 4. Watch the retry tax. If your agent retries failed tasks on a premium model, those retries carry the full accumulated context at premium rates. Routing retries to a cheaper model when the retry is a simple fix (syntax error, missing import) saves disproportionately. 5. Measure cost per completed task, not cost per token. A cheaper model that fails and retries five times can cost more than an expensive model that succeeds on the first try. Track task completion cost, not just per-token price. The bottom line Per-token prices will keep falling. That is not going to fix your bill. The shift to agentic workloads changed the unit economics: more turns, more context per turn, more tokens per task. The lever that matters now is not the price of a token. It is how many tokens each task consumes, and whether each of those tokens is hitting the right model. Nadir routes each turn to the cheapest model that can handle it. The classifier adds under 10 ms. The savings show up on the first request, in the x-nadir-cost-saved response header. No SDK swap, no refactor. Change the base URL, set model="auto", and let the router do the work. Sources: Stanford Digital Economy Lab, "How are AI agents spending your tokens?" (May 2026). Microsoft Research, "How Do AI Agents Spend Your Money?" (April 2026). Vantage, "The Hidden Cost Driver in Agentic Coding Sessions" (2026). Gartner, "Agentic AI Token Consumption Analysis" (March 2026). Anthropic Claude model pricing as of May 2026.