The bill that multiplies itself. LLMs have no persistent memory. Every API call is stateless. When you send turn 20 of a conversation, you include turns 1 through 19 in the messages array — not as a summary, but as the full text of every message, in order. The model sees everything, every time. This is not a bug or an oversight in the API design. It is an architectural requirement: the model needs the full conversation history to produce coherent, contextually aware responses. But it creates a billing structure that most engineering teams have never explicitly modeled: in a 20-turn conversation, your first message is tokenized and billed 20 times. Your second message is billed 19 times. Your 10th message is billed 11 times. The cumulative input tokens for a conversation are not the sum of each message's length — they are the sum of the cumulative history at each turn. For a conversation with n turns where each turn adds m tokens, the total input token count is approximately n² × m / 2. A 20-turn conversation with 200 tokens per turn does not consume 4,000 input tokens. It consumes roughly 40,000. Most teams never run this calculation. Source: Anthropic, "Messages API reference," Anthropic Docs Source: OpenAI, "Chat Completions API," OpenAI Docs The math that compounds faster than teams expect. A customer support session with 15 turns and an average message length of 200 tokens: | Turn | New tokens (user + assistant) | Input tokens billed this turn | |---|---:|---:| | 1 | 200 | 200 | | 2 | 200 | 400 | | 3 | 200 | 600 | | 5 | 200 | 1,000 | | 10 | 200 | 2,000 | | 15 | 200 | 3,000 | | Total | 3,000 | 22,500 | 3,000 tokens of actual content. 22,500 tokens billed. The ratio worsens as conversations grow longer. At Opus 4.8 input pricing ($5/M tokens), that 15-turn session costs $0.11 in input tokens. The same content sent as a single stateless call would cost $0.015. The conversational structure multiplied the input cost 7.5x. Scale that to a customer service platform handling 500,000 sessions per month averaging 15 turns. The input token bill from conversation history accumulation alone runs to $55,000 per month — $660,000 per year — for tokens that represent no new information. Just history. Source: Datadog, "State of AI Engineering 2026" Why this gets worse with agentic workloads. For conversational support bots, the accumulation is predictable. For agentic workflows — coding assistants, research agents, document processors — it compounds harder. A Stanford and Microsoft Research study found that coding agents consume 1,000x more tokens than chat applications per completed task. The mechanism is largely context accumulation: each tool call, each code block generated and evaluated, each error trace fed back to the model adds to the running history. By step 50 of a long coding session, the model context may contain 150,000 tokens — the majority of which are tool outputs and prior reasoning steps from earlier in the session. Source: Stanford and Microsoft Research, "SWE-bench Agent Token Analysis," 2026 At Opus 4.8 pricing, 150,000 input tokens per agent call is $0.75 per step. A 50-step agent session bills $37.50 in input tokens — before counting output. For an enterprise running 1,000 coding sessions per day, that is $37,500 per day, $13.7 million per year, with roughly 60 to 80% of input spend representing repeated context rather than new instructions. The four strategies that cut context accumulation costs. Rolling window truncation. The simplest fix: instead of sending the full history on each turn, truncate to the last N turns. A support bot with a 10-turn window sends turns 5 through 15 on turn 15, rather than turns 1 through 15. The tradeoff is context loss. Information from early turns is dropped. For many support conversations this is acceptable — the relevant context is recent context. For complex agentic workflows, truncation loses critical early decisions. A rolling window of 8 to 12 turns cuts input costs 30 to 50% on long sessions with no infrastructure changes. Most conversational LLM apps can implement this in a few lines. Progressive summarization. Instead of truncating and losing history, summarize older turns periodically. When the conversation reaches turn N, replace turns 1 through N/2 with a summary generated by a cheap, fast model. The summary is typically 70 to 90% shorter than the original turns. The core information is retained. This is more complex to implement than truncation but preserves semantic coherence. A 15-turn conversation with turns 1 through 8 summarized might reduce from 22,500 cumulative input tokens to 8,000 — a 64% reduction — while retaining the full context the model needs to continue. The summarization call itself costs tokens, but at cheap model rates ($0.08/M for Haiku-class models), the cost is negligible relative to the savings on long sessions. Prompt compression on conversation history. LLMLingua-2 and similar compression tools can compress conversation history in place, rather than summarizing or truncating it. At 3x compression, 10 turns of history might reduce from 2,000 tokens to 670 tokens before being passed to the frontier model. The advantage over summarization: compression is faster (no LLM call required), deterministic (no hallucination risk in the summary), and preserves more verbatim content. For agentic workflows where tool outputs dominate the history, compression on the tool output portions typically achieves 40 to 60% reduction without affecting reasoning quality. Source: Microsoft Research, "LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression," 2024 Turn-aware routing. Context accumulation changes the economics of routing in a way that most teams do not model. A routing decision made at turn 1 of a conversation (when the context is 200 tokens) carries a very different cost implication than the same decision at turn 15 (when the context is 3,000 tokens). A request that qualifies for a cheap model at turn 1 may carry enough accumulated context by turn 15 to push it past the context window of cheaper models, forcing an upgrade to a larger, more expensive one. A routing layer that is context-accumulation-aware routes differently based on the current conversation length: Early turns (1–5): Route aggressively to cheap models. Context is minimal. Mid turns (6–15): Apply rolling summarization or compression before routing. Keep the effective context window manageable. Long sessions (15+): Route to models with larger context windows only when needed. Apply aggressive compression on history before routing. This turn-aware routing reduces the average input cost per turn by 30 to 50% on conversational workloads, without any change in response quality. The math at enterprise scale. A customer service platform with 500,000 monthly sessions, average 15 turns, routed entirely to Claude Sonnet 4.5 without any history management. | Factor | No optimization | Rolling window (10 turns) + compression | Turn-aware routing | |---|---:|---:|---:| | Avg cumulative input tokens/session | 22,500 | 8,000 | 5,500 | | Monthly input tokens | 11.25B | 4.0B | 2.75B | | Blended model cost | $3.00/M | $3.00/M | $1.50/M | | Monthly input cost | $33,750 | $12,000 | $4,125 | | Annual input cost | $405,000 | $144,000 | $49,500 | Annual savings from optimization: $355,500. From conversation history management alone. No reduction in turns, no quality loss, no changes to the model or the prompt design. Implementation. A minimal rolling window implementation with progressive summarization for long sessions: import anthropic client = anthropic.Anthropic() def manage_conversation_history( messages: list, max_recent_turns: int = 10, summarize_threshold: int = 20, ) -> list: if len(messages) <= max_recent_turns: return messages if len(messages) >= summarize_threshold: to_summarize = messages[:-max_recent_turns] recent = messages[-max_recent_turns:] summary_response = client.messages.create( model="claude-haiku-4-5", max_tokens=512, messages=[{ "role": "user", "content": ( "Summarize the following conversation history in 3-5 sentences, " "preserving all decisions made, information exchanged, and action items:\n\n" "\n".join( f"{m['role'].upper()}: {m['content']}" for m in to_summarize ) ) }] ) summary_text = summary_response.content[0].text return [ {"role": "user", "content": f"[Previous conversation summary: {summary_text}]"}, {"role": "assistant", "content": "Understood. I have the context from our earlier conversation."}, recent, ] return messages[-max_recent_turns:] def chat(messages: list, user_input: str) -> str: messages.append({"role": "user", "content": user_input}) managed = manage_conversation_history(messages) response = client.messages.create( model="claude-sonnet-4-5", max_tokens=1024, messages=managed, ) assistant_reply = response.content[0].text messages.append({"role": "assistant", "content": assistant_reply}) return assistant_reply The claude-haiku-4-5 summarization call costs roughly $0.001 per session at typical conversation lengths — less than 1% of the savings it generates. Two implementation notes: Separate your message store from your API payload. Keep the full conversation history in your own database. Pass only the managed (truncated or summarized) version to the LLM API. Never truncate the stored history — you need it for auditing, debugging, and fine-tuning. Measure turn depth distribution before setting window sizes. Pull your production conversation logs and plot the distribution of session lengths. If 80% of sessions end within 8 turns, a 10-turn window costs you nothing and saves you significantly on the long tail. If 40% of sessions exceed 20 turns, you need the summarization path. Three changes that take less than a day. 1. Calculate your actual cumulative input token spend per session. Pull 1,000 production sessions from your logs, sum the input tokens per call within each session, and divide by the raw message content tokens. The ratio tells you how much of your input spend is history re-billing. If it exceeds 3x, you have an optimization opportunity larger than your system prompt. 2. Add a rolling window of 8 to 10 turns. For most conversational workloads, the last 8 to 10 turns contain the actionable context. Truncating older turns costs nothing in quality and cuts input costs 30 to 50% on sessions longer than your window. It is a one-line change in the API call construction. 3. Add Haiku-class summarization for sessions crossing 15 turns. When a session passes the 15-turn mark, trigger a cheap summarization of turns 1 through 8. The summarization call costs fractions of a cent. The input savings on the remaining turns of a long session repay that cost in a single API call. Sources: Anthropic, "Messages API reference," Anthropic Docs. OpenAI, "Chat Completions API," OpenAI Docs. Stanford and Microsoft Research, "SWE-bench Agent Token Analysis," 2026. Microsoft Research, "LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression," 2024. Datadog, "State of AI Engineering 2026". GitHub, "Auditing token consumption across agentic workflows," 2026. Anthropic, Claude Sonnet 4.5 and Claude Haiku 4.5 pricing as of June 2026.*