The window got bigger. The advice did not catch up. For the last two years, the pitch from every model provider has been the same: context windows keep growing, so the fix for "the model forgot something" is to give it more room. Gemini 2.5 Pro ships a 1-million-token window. Claude and GPT-4.1 sit at 200K and 1M respectively. The implicit promise is that a bigger window is strictly better than a smaller one, so if you are not sure whether something is relevant, include it anyway. Chroma Research tested that promise directly. They fed 18 frontier models, including GPT-4.1, the Claude 4 family, Gemini 2.5, and Qwen3, inputs ranging from 1,000 to 1,000,000 tokens and scored accuracy at every length on tasks simple enough that a human would not need the full document to answer them. Every model got worse as input grew, and for most of them, the decline started well before they ran out of window. Source: Chroma Research, "Context Rot: How Increasing Input Tokens Impacts LLM Performance," 2025. The finding has a name now: context rot. It matters for two reasons most teams treat as separate problems. The first is quality: a model buried in unnecessary context gives worse answers, silently, with no error message. The second is cost: the tokens causing the rot are almost always the same tokens inflating the bill. Fixing one fixes the other. What the report actually measured. Chroma's methodology extends the classic "needle in a haystack" test, which checks whether a model can retrieve one fact planted in a long document. The extension is what makes the result interesting: instead of one clean needle in one clean haystack, they varied needle-question similarity, added distractors (plausible-but-wrong answers scattered through the context), and tested structured versus unstructured input, all while scaling total length from 1K to 1M tokens. The result was not a clean cliff at some token count. It was a steady, task-dependent decline that got sharply worse once distractors were introduced: | Condition | What happens as input grows | |---|---| | Simple retrieval, no distractors | Accuracy drops gradually; the decline is present but modest | | Retrieval with 1+ plausible distractors | Accuracy drops sharply; models increasingly answer using the distractor, not the needle | | Low needle-question similarity (the answer must be inferred, not pattern-matched) | Accuracy degrades fastest of any condition tested | | Structured input (uniform formatting) | Degrades less than unstructured input at the same length | No model in the study held flat across its advertised window. Models marketed as 1M-token models did not behave like 1M-token models once you moved much past roughly 200K tokens in practice. Source: Chroma Research, "Context Rot," 2025; replication toolkit at github.com/chroma-core/context-rot. Why this happens. Two mechanisms compound. The first is architectural: transformer attention does not weight every position in the context equally, and the positional encoding schemes most models use (RoPE and its variants) have a long-term decay property, so the similarity signal between distant token pairs shrinks as the distance grows. Tokens 400,000 positions apart get a structurally weaker attention signal than tokens 400 positions apart, independent of relevance. The second is older and better known: a 2023 study found LLM accuracy follows a U-shaped curve based on where the relevant information sits in the prompt, highest when it is near the start or end, and dropping by more than 30% when it is buried in the middle, a pattern confirmed across six model families since. Source: Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," arXiv:2307.03172, 2023. Context rot is what happens when you combine that positional bias with distractor-heavy, poorly-structured input at scale: the model is not failing to see the answer, it is failing to weight it correctly against everything else you handed it. The part almost nobody connects: rot and spend are the same tokens. Here is the detail that turns a research finding into an engineering priority. The content responsible for context rot is not exotic. It is the same overhead that shows up in every token-cost audit: Stale conversation history carried forward turn after turn, most of it irrelevant to the current question. See multi-turn conversations bill your first message 20 times. Over-fetched retrieval chunks. Production RAG pipelines routinely fetch far more than the model uses, and the unused chunks are exactly the kind of low-similarity distractor content Chroma found most damaging. See your RAG pipeline fetches 20 chunks, the model reads 3. Duplicated tool schemas resent on every turn of an agent loop, adding volume without adding signal. See function calling adds 2,000 tokens to every agentic call. System prompts that grew by accretion, with old instructions nobody removed sitting between the ones that still matter. See your system prompt grew from 500 to 8,000 tokens and nobody noticed. None of these categories exist because the task needed them. They exist because sending more context is the path of least resistance, and until now the only visible cost was the invoice. Chroma's report adds a second, invisible cost: every one of those unnecessary tokens is also a small tax on accuracy, and the tax compounds with length. This changes the argument for context compression. It is not just "the same answer for less money." It is "a more reliable answer, for less money," because trimming stale history, over-fetched chunks, and duplicated schemas removes exactly the distractor-shaped content that most reliably confuses a model at scale. Three ways to handle a long-context task, compared. | Approach | Accuracy at scale | Cost per query | Where it breaks | |---|---|---|---| | Dump everything into context (rely on the 1M window) | Degrades 20-50% between 10K and 100K+ tokens on retrieval-style tasks | Highest — you pay full input rate for every distractor token | Silent: no error, just a wrong or vague answer | | RAG with aggressive top-k and reranking | Holds closer to baseline; irrelevant chunks are filtered before they reach the model | Low — only the retrieved chunks are billed | Recall risk if top-k is set too low for the task | | Compression + routing (dedupe, trim history, minify, route by complexity) | Preserves 90%+ of full-context quality on most workloads | Lowest — removes overhead before it is billed, then routes the remainder to the cheapest capable model | Requires instrumentation to verify quality is actually held | The first row is the default most teams ship with, because it requires no engineering. It is also the row with both the worst accuracy profile and the worst cost profile, which is an unusual combination: normally you trade one for the other. Context rot is the reason you do not have to. Providers are already building around this. The frontier labs are not ignoring the problem. Anthropic shipped context editing and a memory tool on the Claude Developer Platform: clear_tool_uses_20250919 automatically clears the oldest tool results once a conversation crosses a configured token threshold and replaces them with a placeholder, while the memory tool lets Claude persist information to files outside the context window instead of re-sending it every turn. Anthropic reports an 84% token reduction in extended agentic workflows using the two features together. Source: Anthropic, "Managing context on the Claude Developer Platform"; Context editing docs. That is a meaningful, real fix, and it is provider-side, opt-in, and specific to tool-heavy agent loops on one provider. It does not touch the other sources of rot: bloated system prompts, over-fetched RAG chunks, or repeated schemas sent to a provider without the feature enabled. A compression layer that runs before the request leaves your infrastructure catches all of it, on any provider, without waiting for each one to ship an equivalent. Implementation: keep context lean without losing recall. Step 1: Benchmark your own degradation curve. Do not trust the advertised context window as a proxy for usable context. Build a small golden-answer eval, run it at 5K, 20K, 50K, 100K, and 200K input tokens with your actual document and prompt shapes, and plot accuracy against length. Most teams find the useful window is a fraction of the advertised one. import anthropic client = anthropic.Anthropic() def eval_at_length(golden_qa: list, context_chunks: list, target_tokens: int, model: str) -> float: context = build_context(context_chunks, target_tokens) correct = 0 for question, expected in golden_qa: response = client.messages.create( model=model, max_tokens=200, system=context, messages=[{"role": "user", "content": question}], ) if grade(response.content[0].text, expected): correct += 1 return correct / len(golden_qa) Run across a length sweep and look for the point where accuracy drops below your quality floor, not the point where tokens run out. for length in [5_000, 20_000, 50_000, 100_000, 200_000]: score = eval_at_length(golden_qa, chunks, length, "claude-opus-4-8") print(f"{length:>7} tokens -> {score:.1%} accuracy") Step 2: Set an effective context budget as policy, not an emergency measure. If your eval shows accuracy dropping past 60K tokens, cap requests there by default and treat anything beyond it as a case that needs summarization or a second retrieval pass, not a bigger prompt. Step 3: Put the highest-value information at the start or end. Given the U-shaped positional bias, the worst place for the single most important fact is the middle of a long prompt. Reorder retrieved chunks so the top-ranked result sits last, immediately before the question. Step 4: Rerank retrieval to a real top-k, not a generous one. If your model cites 2-3 of the 10 chunks you retrieve, the other 7 are pure distractor risk. Cut the retrieval count and measure whether answer quality actually moves. Step 5: Compress and dedupe before the request leaves your infrastructure. Minify JSON payloads, deduplicate repeated tool schemas across turns, and summarize conversation history instead of carrying it verbatim. This is the same lossless-transform work covered in Context Optimize, and it removes distractor-shaped content along with the token count. Step 6: Route by task, not by context size. A short classification step buried inside a long-context session does not need the full window forwarded to it. Route it to a cheaper model with only the slice of context it actually requires. Step 7: Track accuracy and cost on the same dashboard. A cheaper prompt that quietly answers worse is not an optimization. Measure both together, or you will not know which one moved. What this means for engineering teams. The instinct to hand a model more context "just in case" made sense when context was scarce and mistakes from omission were the visible failure mode. Chroma's report says the opposite failure mode is now the bigger risk on long-context tasks: mistakes from dilution, where the model had the right answer somewhere in the input and lost it in the noise. That failure is invisible in your logs and expensive on your invoice at the same time. Nadir is built for exactly this overlap. Context Optimize removes duplicate tool schemas, over-fetched retrieval content, and stale history before a request is billed, and the routing layer sends the trimmed request to the cheapest model that can still handle it. The result is not just a smaller invoice. On workloads with the overhead patterns described above, it is fewer distractor tokens between the model and the right answer. Two lines of code, one base URL change. Conclusion. A bigger context window was never a substitute for a curated one. Chroma's data makes that explicit: every one of 18 frontier models got measurably worse as input grew, well before any of them ran out of room, and the tokens most responsible for the decline are structurally identical to the tokens engineering teams already flag as waste in every cost audit. Trimming stale history, over-fetched chunks, and duplicated schemas is not a tradeoff between quality and cost anymore. On the workloads Chroma tested, it is the same fix for both. Sources: Chroma Research, "Context Rot: How Increasing Input Tokens Impacts LLM Performance," 2025. Chroma context-rot replication toolkit, GitHub. Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," arXiv:2307.03172, 2023. Anthropic, "Managing context on the Claude Developer Platform," 2025. Anthropic context editing documentation.