The largest routing benchmark ever published found a surprising result. LLMRouterBench (arXiv:2601.07206, ACL 2026 Findings) benchmarked 10 routing methods across 33 models, 21 datasets, and over 400,000 query instances. It is the most comprehensive evaluation of LLM routing to date. The headline finding: most commercial routers fail to outperform a simple size-based baseline. What the benchmark measured. The 10 routing methods tested included embedding-based classifiers, reward model routing, LLM-as-judge routing, cascade systems, and commercial routing APIs. All were evaluated across three dimensions: Quality preservation: how close routed outputs are to always using the best model Cost reduction: savings relative to always using the frontier model Routing overhead: latency and compute added by the router 33 models were included in the pool — from 7B open-source to GPT-5.5 and Claude Opus 4.8 — across 21 datasets spanning coding, reasoning, summarization, and factual QA. The failure mode. Most routers underperform because they optimize for query difficulty rather than the actual failure mode they face: model recall failures. When a cheap model fails a query, it is rarely because the query was genuinely hard. It is because cheap model failures are inconsistent. The same query, rephrased slightly, succeeds. The same query, unchanged, fails on a different sample. Routers trained to detect difficulty learn from consistent failure patterns. When failure is stochastic, the training signal is noise. The study also found that embedding backbone choice changes routing quality by less than 2%. Swapping BERT for a larger model, or using domain-specific fine-tuned embeddings, made negligible difference. Most teams over-invest in the embedding pipeline and under-invest in the model pool. What worked. Ensemble routing consistently outperformed single-classifier baselines. Combining multiple independent predictors averages out stochastic failures that break any single classifier. Avengers-Pro (arXiv:2508.12631, ACM DAI 2025) pushed this further: instead of routing to one best model, it assembles a dynamic committee weighted by query type and domain. On the RouterBench evaluation set, it surpassed GPT-5-medium by 7% on quality while operating at 63% lower cost. RouterEval (EMNLP 2025) confirmed a parallel finding: adding more models to the routing pool improves quality more reliably than improving the router itself. More candidates mean better coverage of query types. Three tests before buying a commercial router. | Test | What to check | |------|---------------| | Beat the size baseline | Route cheap by default, frontier for hard queries by word count or domain. If the router doesn't beat this, it adds no value. | | Check oracle gap by query type | Aggregate 95% quality can hide a 40% gap on specific distributions. Audit by domain, not just average. | | Skip embedding optimization | The data shows it doesn't move the needle. Expand the model pool instead. | The practical implication. The ACL 2026 findings confirm what practitioners are discovering independently: routing quality is dominated by model pool breadth and ensemble signal aggregation, not router sophistication. If your routing setup is underperforming, the most likely cause is a training distribution mismatch or an insufficient model pool — not the classifier architecture. Nadir uses multi-signal routing with dynamic threshold calibration against observed production performance, directly addressing the oracle gap the study identifies. See where your current setup is leaving cost or quality on the table. Sources: LLMRouterBench (arXiv:2601.07206), ACL 2026 Findings. Avengers-Pro (arXiv:2508.12631), ACM DAI 2025. RouterEval, EMNLP 2025.