← Back to blog Home Docs Optimize GitHub

Routing Benchmark Results: 10-30% Savings on Real Prompts

March 22, 2026

LLM routing promises cost savings, but promises are cheap. We wanted hard numbers. So we designed a controlled benchmark that tests our routing and context optimization features against a pure baseline of sending everything to the most expensive model.

The short version: Router + Safe optimize saves 10% on aggregate cost across a diverse 30-prompt mix, with the real value showing up on medium-complexity prompts where savings reach 28-30%. Quality was verified by an independent LLM judge at 87% (26/30 prompts maintained). Here is how we ran the experiment and what we learned.

The configurations we tested

We tested several variants, progressively layering on cost-saving features. Three stood out as meaningful:

V1 — Baseline (Opus): Every prompt goes to Claude Opus 4.6. No routing, no optimization. This is the control.
V2 — Router Only: NadirClaw's heuristic classifier analyzes each prompt and routes simple ones to cheaper models (Haiku, Sonnet) while keeping complex prompts on Opus.
V4 — Router + Safe Optimize: Routing combined with lossless context compaction (JSON minification, whitespace normalization, tool schema dedup).

We also tested aggressive optimization (V6 — Router + Aggressive), which uses semantic deduplication via sentence embeddings. However, it did not improve over the baseline ($0.249 vs $0.246) in our benchmark, so we do not recommend it as a default. Safe mode remains the better choice for production.

The prompts

We assembled 30 diverse real-world prompts spanning simple factual questions to complex system design tasks. The prompts fell into three natural categories:

Simple — factual questions, status checks, format conversions. The kind of work where a smaller model is perfectly capable.
Medium — code generation, explanations with nuance, multi-step reasoning. These benefit from a capable model but do not require the most expensive tier.
Complex — system architecture design, multi-file refactoring plans, nuanced technical analysis. These genuinely need a frontier model.

This distribution mirrors what we see in real production traffic: a mix of trivial and hard requests, with the majority falling somewhere in the middle.

Methodology

Every configuration was tested under identical conditions:

Environment: Claude CLI for consistent request handling, eliminating variability from different HTTP clients or SDK versions.
Quality scoring: Each response was evaluated by an independent LLM-as-judge (Claude Sonnet 4.6) that compared the routed response against the baseline Opus response. The judge scored on correctness, completeness, and clarity.
Cost measurement: Actual token counts from the API response, priced at each model's published rates. No estimates or approximations.
Latency: Wall-clock time from request to last token, measured by the CLI.

Using an LLM as judge is not perfect, but it scales and removes human bias. We chose Sonnet specifically because it is capable enough to evaluate quality but is not the model being tested, avoiding circular evaluation.

Results: cost per 30-prompt run

Configuration	Total cost	Savings
V1 Baseline (Opus)	$0.246	—
V2 Router Only	$0.225	9%
V4 Router + Safe	$0.223	10%

The aggregate 10% savings is real but modest. However, the aggregate number masks where the real value lies: medium-complexity prompts, which make up the bulk of typical workloads.

Results: per-category savings (Router + Safe)

Breaking down V4 (Router + Safe) by prompt category reveals the full picture:

Category	Savings vs baseline
Simple prompts	14%
Medium prompts	28-30%
Complex prompts	0% (preserved)

Medium prompts are the sweet spot. These typically get routed to Sonnet, which costs significantly less than Opus while delivering equivalent quality for this tier of work. At 28-30% savings, this is where the router pays for itself.

Simple prompts saved 14%, routed to cheaper models like Haiku. Complex prompts were correctly preserved on the same tier — the router does not downgrade tasks that genuinely need a frontier model.

Quality verification

Quality was verified by an independent LLM-as-judge (Claude Sonnet 4.6). Of the 30 prompts, 26 maintained quality according to the judge — an 87% pass rate. The judge compared each routed response against the baseline Opus response on correctness, completeness, and clarity.

The 4 prompts that showed quality differences were edge cases at category boundaries where the classifier routed to a cheaper model than ideal. This is an area where classifier improvements can drive further gains.

Latency

Latency was similar across all configurations. The routing classifier adds negligible overhead to request processing, so response times are determined by the downstream model, not the router.

What about aggressive optimization?

We also tested aggressive optimization (V6 — Router + Aggressive), which applies semantic deduplication via sentence embeddings on top of routing. The result: $0.249 total cost, essentially flat compared to the $0.246 baseline. The aggressive optimizer did not help in this benchmark.

This does not mean semantic deduplication is useless — it may show value on workloads with heavily duplicated context (e.g., repeated tool schemas in long conversations). But for the diverse prompt mix we tested, safe mode (lossless transforms only) is the better default.

The heuristic classifier

NadirClaw's routing classifier is deliberately simple. It is a rule-based heuristic system, not a neural network. It analyzes prompt length, vocabulary complexity, presence of code or structured data, and a handful of other surface-level features to assign a complexity tier.

Why heuristic instead of ML? Three reasons:

Latency: Classification takes less than 1ms. An ML model would add 10-50ms of overhead.
Dependencies: Zero. No PyTorch, no model files, no GPU. The classifier is pure Python with no imports beyond the standard library.
Debuggability: When a prompt gets misrouted, you can trace exactly which rule fired and why.

The heuristic approach means the classifier is not perfect — the 87% quality rate reflects this. But the benchmark shows that even with these edge cases, the aggregate savings are meaningful, especially for medium-complexity workloads.

ROI at scale

At $10,000/month in LLM spend, the 10% savings from Router + Safe translates to roughly $1,000/month saved (before routing fees). After NadirClaw's routing fee, the net savings are approximately $750/month, or $9,000/year.

For workloads skewed toward medium-complexity prompts, savings can reach 20-30%, significantly improving the ROI.

The math scales linearly. The classifier runs locally at negligible cost, so the only variable cost is the routing fee, which is a percentage of savings — you only pay when you save.

How to enable routing

NadirClaw is a drop-in proxy. Point your base URL at it and add your API key:

# Install
pip install nadirclaw

# Start the router with safe context optimization
nadirclaw serve --optimize safe

Then configure your client to use the local proxy:

# Claude Code
claude config set --global apiBaseUrl http://localhost:8856/v1

# Or via environment variable for any OpenAI-compatible client
export OPENAI_BASE_URL=http://localhost:8856/v1

For per-request control, you can pass routing parameters in the JSON body:

curl http://localhost:8856/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-key" \
  -d '{
    "model": "auto",
    "optimize": "safe",
    "messages": [{"role": "user", "content": "What is 2+2?"}]
  }'

Setting model to "auto" enables routing. The router will classify the prompt and select the appropriate model tier. You can also set model to a specific model name to bypass routing for individual requests.

What we learned

Three takeaways from running this benchmark:

Medium-complexity prompts are the sweet spot. At 28-30% savings, this is where routing delivers the most value. Simple prompts save less in absolute terms, and complex prompts are correctly preserved on the same tier.
Aggregate savings depend on your prompt mix. Our 10% overall figure reflects a diverse mix. If your workload is heavy on medium-complexity prompts (code generation, explanations, multi-step reasoning), expect higher savings.
Simple heuristics work reasonably well. Our rule-based classifier, running in under 1ms with zero dependencies, achieved 87% quality preservation. There is room for improvement, and classifier upgrades are our top priority for the next release.

Try it yourself

NadirClaw is open source and MIT licensed. Install it with pip install nadirclaw, run nadirclaw serve, and check your savings in the built-in dashboard. The full benchmark methodology and raw data are available on GitHub.

If you want to see projected savings before committing, the ROI calculator on our homepage lets you plug in your monthly spend and see the numbers.