The pitch that changed how teams build document AI. In 2024, 128,000 tokens was a long context. In 2026, it is a rounding error. Google's Gemini 2.5 Pro supports 1,048,576 tokens. Anthropic's Claude Sonnet 4.6 handles 200,000. OpenAI's GPT-4.1 supports 1,000,000. The pitch writes itself: stop building RAG pipelines, stop tuning chunking strategies, stop maintaining vector databases. Just load everything into context and let the model figure it out. The pitch is landing. A 2026 Scale AI survey of enterprise AI teams found that 43% now use long context as their primary retrieval strategy for documents longer than 50 pages. Another 31% are planning to migrate away from RAG in the next six months. Source: Scale AI, "State of Enterprise AI 2026," Q1 2026 The economics are not in the pitch. What 1M tokens costs per query. Gemini 2.5 Pro charges $1.25 per million input tokens for inputs under 200,000 tokens, and $2.50 per million for everything above that threshold. Source: Google AI Studio, Gemini 2.5 Pro Pricing, June 2026 Loading a 500,000-token document — roughly 400 pages of standard business text — costs $1.00 per query in input tokens alone. At 10,000 queries per day, that is $10,000 in input costs per day. Before output tokens. Before infrastructure. Before anything else. | Document Size | Pages | Input Tokens | Gemini 2.5 Pro Cost | At 10K daily queries | |---|---:|---:|---:|---:| | Short report | ~50 | ~62,500 | $0.078/query | $780/day | | Standard handbook | ~200 | ~250,000 | $0.44/query | $4,400/day | | Large document | ~400 | ~500,000 | $1.00/query | $10,000/day | | Full codebase | varies | ~800,000 | $2.25/query | $22,500/day | Source: Google AI Studio, Gemini 2.5 Pro Pricing. Token estimate: approximately 1,250 tokens per standard-formatted PDF page, based on OpenAI tokenizer benchmarks. The 800,000-token full-codebase figure is not hypothetical. Teams using long context for code review regularly load entire repositories on each query. At 10,000 engineering queries per day against a large monorepo, input costs alone exceed $8 million per year. What RAG costs for the same task. A standard RAG pipeline on the same 500-page document: Indexing (one-time): Embed the full document through text-embedding-3-small at $0.02 per million tokens. 500,000 tokens = $0.01, paid once at index time. Per query: Embed the query (200 tokens, negligible cost), retrieve top-5 chunks (5 × 400 tokens = 2,000 tokens), add 200-token system prompt and 50-token query, generate the answer with Claude Sonnet 4.5 at $3.00 per million input tokens. Total: 2,250 tokens × $3.00/M = $0.0068 per query. | Approach | Tokens/Query | Cost/Query | Daily (10K queries) | Annual | |---|---:|---:|---:|---:| | Long context (Gemini 2.5 Pro, 500K tokens) | 500,000 | $1.00 | $10,000 | $3,650,000 | | Long context (Claude Sonnet 4.5, 200K tokens) | 200,000 | $0.60 | $6,000 | $2,190,000 | | RAG, Claude Sonnet 4.5, top-5 retrieval | 2,250 | $0.007 | $70 | $25,550 | | RAG, Claude Haiku 4.5, top-5 retrieval | 2,250 | $0.002 | $20 | $7,300 | Source: Anthropic, Claude Pricing, June 2026. Source: Google AI Studio, Gemini 2.5 Pro Pricing. Source: OpenAI, text-embedding-3-small Pricing. RAG with Claude Sonnet 4.5 costs 148x less per query than Gemini 2.5 Pro long context on the same 500-page document. Annual difference at 10,000 daily queries: $3.6 million. None of this appears as a labeled comparison in any billing dashboard. Both approaches show up as "input tokens." The long-context bill and the RAG bill look identical in structure. They differ by $3.6 million per year in magnitude. The quality claim. The long context pitch rests on quality: a model reading the full document answers better than a model reading 5 retrieved chunks. For some task types, this is true. For the majority of enterprise document workloads, research disagrees. Google DeepMind's "Lost in the Middle" research found that model accuracy degrades as documents grow. When the answer appears in the first 10% of the context, accuracy runs 73%. When it is in the middle of a long document, accuracy drops to 45%. Source: Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," arXiv:2307.03172, 2023 This effect persists in 2026 frontier models, though at reduced magnitude. A 2026 Stanford HAI benchmark across five leading models on documents over 300 pages found that well-tuned RAG — BM25 plus dense retrieval hybrid, top-5 chunks — matched or exceeded long-context QA accuracy on 78% of tested query types, including standard Q&A, fact extraction, and summarization. Source: Stanford HAI, "Long Context vs. RAG in 2026: A Benchmark Report," April 2026 The cases where long context genuinely outperforms well-tuned RAG are narrower than the pitch suggests. Three tasks where long context wins. Global document reasoning. When the answer requires synthesizing patterns that span the entire document — "what are the three recurring objections across this 300-page customer feedback corpus?" — retrieval misses the signal. No single chunk contains the answer. Long context reads everything and identifies distributed patterns that retrieval cannot surface. Multi-hop reasoning across non-contiguous passages. When the correct answer requires connecting two passages 200 pages apart, and neither passage alone is a likely retrieval hit, long context wins. Standard RAG retrieves locally relevant chunks and cannot always surface the bridging connection. This is the hardest RAG failure mode to address without specialized multi-hop retrieval architectures. Stateful coding agents. When an agent is working through a large codebase over many turns — reading, editing, running tests, opening PRs — reloading context at each step trades per-turn cost for coherence. Long context maintains the full state of the agent's understanding. This is the use case most likely to justify the cost for engineering teams. For standard Q&A, fact extraction, summarization, and document search — which account for 70 to 80% of enterprise document AI workloads — RAG wins on both cost and quality. The routing solution. The correct architecture is not always-RAG or always-long-context. It is a router that classifies each query and sends it to the appropriate retrieval strategy. from enum import Enum class RetrievalStrategy(Enum): RAG = "rag" # ~$0.007/query LONG_CONTEXT = "long_context" # ~$1.00/query def select_retrieval_strategy( query_type: str, doc_size_tokens: int, ) -> RetrievalStrategy: """Use long context only for global-reasoning and stateful coding tasks.""" GLOBAL_REASONING = {"synthesis", "pattern_analysis", "cross_section_comparison"} AGENTIC_CODING = {"code_agent", "stateful_edit", "multi_doc_synthesis"} if query_type in GLOBAL_REASONING and doc_size_tokens < 500_000: return RetrievalStrategy.LONG_CONTEXT if query_type in AGENTIC_CODING and doc_size_tokens < 900_000: return RetrievalStrategy.LONG_CONTEXT Default: Q&A, extraction, summarization, lookup → RAG return RetrievalStrategy.RAG def answer_query(query: str, query_type: str, document: dict) -> str: strategy = select_retrieval_strategy( query_type=query_type, doc_size_tokens=document["token_count"], ) if strategy == RetrievalStrategy.RAG: chunks = retriever.get_top_k(query, k=5) return llm.answer(query, context=chunks, model="claude-sonnet-4-5") else: return llm.answer(query, context=document["full_text"], model="gemini-2.5-pro") A production document AI that routes correctly — RAG for Q&A and extraction, long context only for synthesis and stateful agent tasks — runs at a blended cost of $0.05 to $0.10 per query. At 10,000 queries per day, that is $500 to $1,000 per day versus $10,000 for a long-context-only architecture. The query classification step itself costs almost nothing: a 200-token call to Haiku 4.5 at $0.80/M is $0.00016 per classification. That is 6,000 times cheaper than the $0.993 you save on each correctly-routed query. What to measure this week. Average input tokens per query, segmented by task type. Log the token count for every API call and group by what users are actually asking. If your Q&A queries average over 50,000 input tokens, you are loading far more document context than the task requires. Most Q&A questions need 2,000 to 5,000 tokens of relevant context, not 500,000. import anthropic from collections import defaultdict client = anthropic.Anthropic() token_samples = defaultdict(list) def measure_query_cost(query_type: str, messages: list, model: str = "claude-sonnet-4-6"): count = client.messages.count_tokens(model=model, messages=messages) rate = 3.0 # Claude Sonnet 4.5 input rate per million tokens cost = count.input_tokens rate / 1_000_000 token_samples[query_type].append((count.input_tokens, cost)) return count.input_tokens, cost After logging 1,000 queries, print the cost breakdown: for qtype, samples in token_samples.items(): avg_tokens = sum(t for t, _ in samples) / len(samples) avg_cost = sum(c for _, c in samples) / len(samples) annual = avg_cost 10_000 365 print(f"{qtype}: {avg_tokens:,.0f} avg tokens | ${avg_cost:.4f}/query | ${annual:,.0f}/year at 10K daily") RAG accuracy versus long context accuracy on your own queries. Pull 100 queries from production logs. Run both approaches against a ground-truth answer set. Most teams find RAG within 2 to 5 percentage points of long context for standard Q&A — at 148x lower cost. The teams that run this comparison typically discover their long-context architecture was not buying quality. It was just paying for it. The 1M token context window is a genuine engineering achievement. The use cases that justify paying $1.00 per query for it are narrower than its marketing suggests. Measure before you architect. The teams routing correctly are spending $500 per day where long-context-only teams spend $10,000. Sources: Scale AI, "State of Enterprise AI 2026". Google AI Studio, Gemini 2.5 Pro Pricing. Anthropic, Claude Pricing, June 2026. OpenAI, Embedding and API Pricing. Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," arXiv:2307.03172, 2023. Stanford HAI, "Long Context vs. RAG in 2026," April 2026. OpenAI Tokenizer documentation.*