Context Optimize: How We Cut LLM Input Tokens 30-70% Without Losing Accuracy
The problem: you are paying for formatting, not meaning
LLM input tokens are expensive. Claude Opus 4.6 charges $15 per million input tokens. GPT-4 is in the same ballpark. And in production workloads — agent loops, RAG pipelines, tool-heavy workflows — a large fraction of those tokens carry zero semantic weight.
Consider what actually gets stuffed into context on every request: pretty-printed JSON with indentation and newlines that a model does not need. Identical tool schemas repeated on every turn of an agent loop. System prompts duplicated inside user messages by certain frameworks. Long chat histories where early turns are no longer relevant. Multiple spaces and blank lines that inflate token counts without adding information.
We measured real payloads from agentic coding sessions, RAG pipelines, and API response analysis workflows. The formatting waste ranged from 29% to 71% of the total input tokens. That is money burned on whitespace and repetition.
Two modes: safe and aggressive
Context Optimize ships with two modes, both off by default.
Safe mode: five deterministic, lossless transforms
Safe mode applies five transforms that are fully reversible and never alter semantic content:
- JSON minification — finds JSON objects and arrays in message content using
json.JSONDecoder.raw_decode, then re-serializes them with no whitespace. Content inside fenced code blocks is skipped. The compact form only replaces the original if it is actually shorter. - Tool schema deduplication — detects identical tool/function schemas across messages by comparing their canonical JSON representation. The first occurrence is kept in full. Subsequent copies are replaced with a short reference like
[see tool "web_search" schema above]. - System prompt deduplication — if the system prompt text appears verbatim in a later user message (common in some orchestration frameworks), the duplicate is removed from the later message. The system message itself is never modified.
- Whitespace normalization — collapses three or more consecutive blank lines to two, and collapses runs of multiple spaces to one. Lines inside fenced code blocks are skipped to preserve formatting where it matters.
- Chat history trimming — for conversations exceeding a configurable turn limit (default: 40), keeps the system prompt, the first user/assistant turn for task context, and the last N turns. A placeholder message notes how many turns were trimmed.
Every transform checks that its output is smaller than its input. If a replacement would somehow be larger, the original is kept untouched.
Aggressive mode: safe transforms + semantic deduplication
Aggressive mode runs all five safe transforms, then adds a sixth: semantic deduplication using sentence embeddings. It loads the all-MiniLM-L6-v2 model to compute embeddings for each message, compares them pairwise by cosine similarity, and compacts near-duplicate messages (similarity ≥ 0.85) into a short reference plus the unique differences.
System messages and short messages (under 60 characters) are never touched by semantic dedup.
The accuracy problem with naive semantic dedup
The obvious approach to semantic deduplication is straightforward: if two messages are 85%+ similar, replace the later one with a reference to the earlier one. But this approach has a critical flaw in practice.
Users frequently refine their instructions across turns. A user might say "write a function that returns the indices of the two numbers" in turn 1, then say "write a function that returns the actual values, not indices" in turn 3. Those two messages are about 90% similar by embedding cosine distance. Naive dedup would replace turn 3 entirely with a reference to turn 1, and the model would never see the correction.
That kind of silent accuracy loss is worse than wasting tokens. The model produces the wrong output, the user has no idea why, and debugging is a nightmare because the payload looks fine before the optimizer touched it.
Our solution: diff-preserving dedup
Instead of replacing the later message wholesale, we extract the specific words that differ and preserve them. The implementation uses difflib.SequenceMatcher for word-level diffing:
def _extract_diff_phrases(earlier: str, later: str) -> str:
from difflib import SequenceMatcher
a_words = earlier.split()
b_words = later.split()
sm = SequenceMatcher(None, a_words, b_words, autojunk=False)
diff_parts: list[str] = []
for tag, _i1, _i2, j1, j2 in sm.get_opcodes():
if tag in ("insert", "replace"):
diff_parts.append(" ".join(b_words[j1:j2]))
return " ".join(diff_parts)
The function splits both messages into word tokens, runs the sequence matcher, and collects only the inserted or replaced runs of words from the later message. For the "indices vs. values" example above, the extracted diff is: actual values, not indices.
The replacement message looks like this:
[similar to earlier message: "Write a Python function that takes a list of integers and re..."]
Key differences: actual values, not indices.
The critical refinement is preserved. The boilerplate that was repeated word-for-word is removed. And there is one final guard: if the replacement (reference + diff) would be larger than the original message, the dedup is skipped entirely. This guarantees that aggressive mode never produces negative savings.
Accurate token counting with tiktoken
Early in development, we used a len(text) // 4 heuristic for token estimation. This is the rule of thumb you see everywhere — roughly four characters per token for English text. We measured the actual error and found it was about 18% off compared to the real BPE tokenizer output, sometimes overcounting and sometimes undercounting.
For a feature whose entire value proposition is measured in token savings, an 18% measurement error is unacceptable. We switched to tiktoken with the cl100k_base encoding — the same BPE tokenizer used by GPT-4 and compatible with Claude-family token counting. The fallback to len // 4 only activates if tiktoken is not installed.
try:
import tiktoken as _tiktoken
_enc = _tiktoken.get_encoding("cl100k_base")
def _estimate_tokens_str(text: str) -> int:
return max(1, len(_enc.encode(text, disallowed_special=())))
except Exception:
def _estimate_tokens_str(text: str) -> int:
return max(1, len(text) // 4)
Token counting also handles multi-part messages (the list[dict] content format used by vision models) and adds a per-message overhead of 4 tokens for role markers, matching the actual overhead LLM APIs apply.
Benchmarks
We benchmarked safe mode against five real-world payload types, measured in tokens with tiktoken, priced at Claude Opus 4.6 rates ($15/1M input tokens):
| Payload | Before | After | Saved % | $/1K req |
|---|---|---|---|---|
| Agentic coding assistant | 3,657 | 1,573 | 57% | $31.26 |
| RAG pipeline (6 chunks) | 544 | 386 | 29% | $2.37 |
| API response analysis (nested JSON) | 1,634 | 616 | 62% | $15.27 |
| Long debug session (50 turns) | 3,856 | 1,414 | 63% | $36.63 |
| OpenAPI spec context (5 endpoints) | 2,649 | 762 | 71% | $28.30 |
The heaviest savings come from payloads with deeply nested JSON (the OpenAPI spec, the API response analysis) and from agent loops where tool schemas repeat across turns. RAG payloads benefit less because retrieval chunks are mostly prose with little formatting waste, but 29% is still meaningful at scale.
Combined with routing: NadirClaw's tier-based routing already saves 40-70% by sending simple prompts to cheaper models. Context Optimize stacks on top of that. If routing cuts your bill by 50% and optimize cuts input tokens by another 60%, your effective cost reduction is significantly higher than either alone.
How to enable
Context Optimize is off by default. When disabled, it adds zero overhead — the code path is never entered.
# Start the server with safe mode enabled
nadirclaw serve --optimize safe
# Or set it via environment variable
NADIRCLAW_OPTIMIZE=safe nadirclaw serve
# Per-request override in the JSON body
{"model": "auto", "optimize": "aggressive", "messages": [...]}
# Dry-run on a payload file (no server needed)
nadirclaw optimize payload.json --mode safe --format json
Three settings are available:
off— default. No processing. Zero overhead.safe— lossless transforms only. Recommended for production.aggressive— all safe transforms plus semantic dedup via sentence embeddings. Requiressentence-transformers.
The chat history trim threshold is configurable via NADIRCLAW_OPTIMIZE_MAX_TURNS (default: 40). Per-request overrides take precedence over the server-wide setting.
Test coverage
The optimize module has 60 automated tests covering correctness, edge cases, and roundtrip integrity. Here are the categories that matter most:
- JSON roundtrip —
json.loads(before) == json.loads(after)for every minification. No keys dropped, no values changed, no type coercion. - Code block preservation — content inside fenced code blocks is never modified by any transform.
- Unicode safety — all transforms use
ensure_ascii=False. CJK characters, emoji, and RTL text pass through unchanged. - No negative savings — every transform checks that its output is smaller than its input. If not, the original is kept.
Five tests specifically target aggressive mode accuracy:
- Refined instructions preserved — when a user changes "return indices" to "return values, not indices," the diff phrase survives dedup.
- Format changes preserved — switching from "use a list" to "use a dictionary" is captured in the diff output.
- Language changes preserved — if a user switches from "write in Python" to "write in Rust," the language change is kept.
- No negative savings — if the reference-plus-diff replacement is larger than the original, dedup is skipped.
- Exact duplicates compacted — truly identical messages produce a clean reference with no diff section.
Implementation details
The optimizer is a single Python module (nadirclaw/optimize.py) with no dependency on FastAPI, Pydantic, or the rest of the server. All functions operate on plain list[dict] messages, which makes the module easy to test in isolation and reuse outside the server context.
The main entry point is optimize_messages(), which returns an OptimizeResult dataclass containing the optimized messages, the original and optimized token counts, and a list of which transforms were applied. This makes it straightforward to log exactly what happened to each request.
from nadirclaw.optimize import optimize_messages
result = optimize_messages(messages, mode="safe", max_turns=40)
print(f"Tokens: {result.original_tokens} -> {result.optimized_tokens}")
print(f"Saved: {result.tokens_saved} ({result.tokens_saved / result.original_tokens:.0%})")
print(f"Transforms: {result.optimizations_applied}")
For aggressive mode, the sentence-transformers dependency is loaded lazily. If it is not installed, semantic dedup is silently skipped and the optimizer falls back to safe-mode transforms only. This means you can run pip install nadirclaw without pulling in PyTorch for safe mode, and only install the ML dependencies if you want aggressive dedup.
What is next
Context Optimize is available now in NadirClaw. Install with pip install nadirclaw, add --optimize safe to your serve command, and check your token savings in the dashboard. If you want to see what it would do before enabling it in production, the nadirclaw optimize CLI command lets you dry-run any payload file and inspect the results.
The Optimize feature page has the full benchmark table, before/after examples, and projected monthly savings at different request volumes. The source code is open and MIT-licensed.