98% cost reduction from LLM cascades is achievable. Here is how to build it. FrugalGPT, published by Stanford researchers in TMLR 2024, demonstrated up to 98% cost reduction by running queries through a cascade: cheap model first, escalate to frontier only when necessary. Three papers since have refined the approach. None require machine learning expertise to implement. This is the implementation guide. The core idea. A cascade router has three components: A cheap model that attempts every query A quality estimator that evaluates the response An escalation threshold that decides when to retry with a frontier model The cheap model handles 70-90% of queries in most enterprise workloads. The frontier model only sees the hard ones. The cost math is straightforward: if 80% of queries route to a $0.10/M model and 20% to a $15/M model, your blended rate is $3.08/M instead of $15/M — an 80% reduction with no quality loss on queries the cheap model handles well. FrugalGPT found model combinations achieving 98% cost reduction at matched quality because the quality estimator was trained on the specific domain. Untrained implementations typically achieve 60-80%. Implementation: basic cascade. CHEAP_MODEL = "gemini-2.0-flash" # $0.10/M input FRONTIER_MODEL = "claude-opus-4-8" # $15/M input QUALITY_THRESHOLD = 0.75 def estimate_quality(response: str, query: str) -> float: """Heuristic quality scorer. Replace with trained classifier in production.""" score = 0.7 # base penalties = ["i don't know", "i cannot", "i'm not sure"] bonuses = ["specifically", "in particular", "according to"] for phrase in penalties: if phrase in response.lower(): score -= 0.2 for phrase in bonuses: if phrase in response.lower(): score += 0.05 length_factor = min(len(response) / 500, 1.0) 0.15 return min(max(score + length_factor, 0.0), 1.0) def cascade_query(query: str) -> tuple[str, str]: """Returns (response, model_used).""" cheap_response = call_model(CHEAP_MODEL, query) quality = estimate_quality(cheap_response, query) if quality >= QUALITY_THRESHOLD: return cheap_response, CHEAP_MODEL Escalate to frontier return call_model(FRONTIER_MODEL, query), FRONTIER_MODEL The BEST-Route improvement. Microsoft's BEST-Route (arXiv:2506.22716, ICML 2025) found a flaw in naive cascades: escalating after one failure discards information from the cheap model's output. BEST-Route samples the cheap model multiple times with varied temperatures, then checks for consensus before deciding whether to escalate. When multiple samples agree, the answer is likely correct even if a single sample looked uncertain. When samples diverge, escalation is warranted. def best_route_query(query: str, n_samples: int = 3) -> tuple[str, str]: """Sample cheap model N times. Escalate only on genuine disagreement.""" samples = [] for temp in [0.2, 0.5, 0.8]: samples.append(call_model(CHEAP_MODEL, query, temperature=temp)) quality_scores = [estimate_quality(r, query) for r in samples] avg_quality = sum(quality_scores) / len(quality_scores) score_variance = max(quality_scores) - min(quality_scores) Consensus: high average quality AND low variance between samples if avg_quality >= QUALITY_THRESHOLD and score_variance < 0.2: best = samples[quality_scores.index(max(quality_scores))] return best, CHEAP_MODEL Genuine uncertainty — escalate return call_model(FRONTIER_MODEL, query), FRONTIER_MODEL BEST-Route reduces frontier escalations by an additional 15-30% compared to single-sample cascades. If a naive cascade escalates 25% of queries, BEST-Route brings that to 17-21%. The ETH Zurich finding: cascade always beats pure routing. An ICML 2025 paper from ETH Zurich (arXiv:2410.10347) proved theoretically that a cascade router always dominates pure routing or pure cascading when the quality estimator is calibrated. Pure routing picks one model per query. Pure cascading tries cheap first, then frontier. The unified approach combines both: route clearly easy queries to cheap models directly, use cascades for medium-difficulty queries, and send genuinely hard queries straight to the frontier. def unified_router(query: str, difficulty_score: float) -> tuple[str, str]: """ Unified routing + cascading. difficulty_score: 0.0 = trivial, 1.0 = frontier required. Use a lightweight classifier to produce this score. """ if difficulty_score < 0.3: Easy: direct cheap model, skip cascade overhead return call_model(CHEAP_MODEL, query), CHEAP_MODEL elif difficulty_score < 0.7: Medium: BEST-Route cascade return best_route_query(query) else: Hard: direct frontier, escalation is certain return call_model(FRONTIER_MODEL, query), FRONTIER_MODEL The difficulty_score comes from a lightweight classifier (logistic regression on query length, keyword features, domain) — not a large model. The classifier itself costs negligible compute. Realistic cost outcomes. | Implementation | Escalation Rate | Cost vs. Always-Frontier | |---------------|----------------|--------------------------| | Naive cascade (single sample) | 20-30% | -70 to -80% | | BEST-Route (3 samples) | 15-22% | -78 to -85% | | Unified routing + cascade | 10-18% | -82 to -90% | | Domain-trained quality estimator | 5-15% | -85 to -98% | The 98% figure from FrugalGPT required a domain-trained estimator on a specific enterprise workload. For general workloads without training data, 80-85% is a realistic target with the unified approach. The quality estimator is the bottleneck. The heuristic estimator above is adequate for prototyping. Production systems need something better: Keyword heuristics — fast, cheap, domain-specific, breaks on novel failure modes Small classifier on response features — logistic regression on (query, response) pairs labeled good/bad by a frontier model. Requires 500-2,000 labeled examples. Reward model — fine-tune a small LLM to predict quality. Higher accuracy, higher cost per query. Most teams start with option 2. Labeling 1,000 pairs costs around $200 and recovers in the first hour of production traffic. Implementation checklist. Before going to production with any cascade: Establish quality baseline: run 200 queries through frontier model only, human-rate outputs Measure cheap model quality on the same 200 queries before implementing cascade Set escalation threshold to preserve 95% of frontier quality — start conservative Log every escalation decision with query features for offline analysis Monitor quality metrics weekly — model updates change calibration Skip the build if this is your first routing system. Building and calibrating a cascade from scratch takes 2-4 weeks for an engineering team that has done it before. Six to eight weeks is realistic for teams doing it the first time, and the quality estimator will need multiple calibration cycles before it is stable. Nadir ships a pre-calibrated cascade router with adaptive threshold tuning, available in one afternoon of integration work. Most teams achieve 75-85% cost reduction within the first week of traffic. Sources: FrugalGPT (arXiv:2305.05176), Stanford, TMLR 2024. BEST-Route (arXiv:2506.22716), Microsoft, ICML 2025. ETH Zurich cascade routing (arXiv:2410.10347), ICML 2025.*