Context Optimize
Routes to the right model, then trims the payload before it hits your bill. Two modes: lossless safe, or aggressive with diff-preserving semantic dedup. Off by default.
Before and after
Same data. Same meaning. Fewer tokens.
// Tool schema sent every turn (turn 4 of 8)
{
"name": "web_search",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Search query"
},
"num_results": {
"type": "integer",
"default": 5
},
"site_filter": {
"type": "string",
"description": "Restrict to domain"
}
},
"required": [
"query"
]
}
}
// First turn: full schema preserved {"name":"web_search","parameters":{"type":"object", "properties":{"query":{"type":"string","description": "Search query"},"num_results":{"type":"integer", "default":5},"site_filter":{"type":"string", "description":"Restrict to domain"}},"required": ["query"]}} // Turns 2-8: replaced with reference [see tool "web_search" schema above]
The LLM sees identical semantic content. The tool name, all parameters, types, and descriptions are preserved in the first occurrence. Subsequent turns reference it.
Benchmarked on Claude Opus 4.6
Real payloads. Measured savings. $15/1M input tokens.
| Payload | Before | After | Saved | % | Saved / 1K req |
|---|---|---|---|---|---|
|
Agentic coding assistant
tool_schema_dedup, json_minify, whitespace_normalize
|
3,657 | 1,573 | 2,084 | 57% | $31.26 |
|
RAG pipeline (6 chunks)
json_minify
|
544 | 386 | 158 | 29% | $2.37 |
|
API response analysis (nested JSON)
json_minify
|
1,634 | 616 | 1,018 | 62% | $15.27 |
|
Long debug session (50 turns)
json_minify, chat_history_trim
|
3,856 | 1,414 | 2,442 | 63% | $36.63 |
|
OpenAPI spec context (5 endpoints)
json_minify
|
2,649 | 762 | 1,887 | 71% | $28.30 |
Input tokens only. Output tokens are unaffected. Savings scale linearly with request volume.
Accuracy is not harmed
Every safe-mode transform is deterministic and lossless. The LLM receives identical semantic content.
JSON roundtrips exactly
Pretty-printed JSON is parsed and re-serialized compact. json.loads(before) == json.loads(after) always holds. No keys dropped, no values changed, no type coercion.
Code blocks untouched
Content inside fenced code blocks (```) is never modified. Indentation, whitespace, and formatting inside code are preserved exactly as written.
Tool schemas preserved
Deduplication keeps the full schema on first occurrence. Later turns get a named reference. The LLM sees the complete schema definition once and knows to look up.
Recent context kept
Chat history trimming preserves the system prompt, the first turn (for task context), and the last N turns. A placeholder notes how many turns were trimmed.
Unicode and emoji safe
All transforms use ensure_ascii=False. CJK characters, emoji, RTL text, and special Unicode are preserved byte-for-byte through the optimization pipeline.
URLs never altered
URLs, file paths, and query strings pass through unchanged. Whitespace normalization only affects runs of spaces outside code blocks, never inside structured strings.
Refinements survive dedup
Aggressive mode uses word-level diffing (difflib.SequenceMatcher) to extract unique phrases before compacting. "Return values, not indices" is never lost even when the rest of the message is near-identical.
No negative savings
Every transform checks that its output is smaller than its input. If a replacement would be larger (e.g., diff overhead exceeds savings), the original message is kept untouched.
Accurate token counting
Token estimates use tiktoken (cl100k_base BPE encoding, same as GPT-4/Claude). Falls back to len//4 heuristic only if tiktoken is unavailable.
Why lossless matters: Lossy compression (semantic summarization) risks changing meaning. An LLM might answer differently if a nuance is lost. Safe mode avoids this entirely — it only removes formatting redundancy that carries zero semantic weight. Aggressive mode goes further with semantic dedup, but preserves every unique phrase via word-level diff extraction. Both are backed by 60 automated tests covering accuracy, edge cases, and roundtrip integrity.
Projected monthly savings
Claude Opus 4.6 at $15/1M input tokens. Based on average 61.5% reduction.
These are input-token savings only. Combined with routing (40-70% via cheaper models), total cost reduction is higher.
Six transforms across two modes
Five deterministic transforms in safe mode. One embedding-based transform added in aggressive.
JSON minification safe
Finds JSON objects and arrays in message content using raw_decode. Re-serializes with no whitespace. Skips content inside fenced code blocks. Only replaces when the compact form is shorter.
Tool schema deduplication safe
Detects identical tool/function schemas across messages (by canonical JSON comparison). Keeps the first occurrence, replaces subsequent copies with [see tool "name" schema above].
System prompt deduplication safe
If the system prompt text appears verbatim in a later user message (common in some frameworks), the duplicate is removed from the later message. System message itself is never modified.
Whitespace normalization safe
Collapses 3+ consecutive blank lines to 2. Collapses runs of multiple spaces to one. Skips lines inside fenced code blocks to preserve formatting where it matters.
Chat history trimming safe
For conversations exceeding max_turns (default: 40), keeps the system prompt, the first user/assistant turn, and the last N turns. Inserts a placeholder noting how many turns were trimmed. Configurable via NADIRCLAW_OPTIMIZE_MAX_TURNS.
Semantic deduplication aggressive
Uses all-MiniLM-L6-v2 sentence embeddings to find near-duplicate messages (cosine similarity ≥ 0.85). Replaces the later message with a compact reference plus any unique differences extracted via word-level diffing. System messages and short messages (<60 chars) are never touched. Only fires when the replacement is actually smaller than the original.
Aggressive mode preserves what matters
Near-duplicate messages get compacted, but unique details are extracted and kept.
// Turn 1 Write a Python function that takes a list of integers and returns the two numbers that add up to a target sum. Use a hash map for O(n) time complexity. Handle edge cases like empty lists and duplicates. Return the indices of the two numbers. // Turn 3 (near-duplicate, different ending) Write a Python function that takes a list of integers and returns the two numbers that add up to a target sum. Use a hash map for O(n) time complexity. Handle edge cases like empty lists and duplicates. Return the actual values, not indices.
// Turn 1 — preserved in full Write a Python function that takes a list of integers and returns the two numbers that add up to a target sum. Use a hash map for O(n) time complexity. Handle edge cases like empty lists and duplicates. Return the indices of the two numbers. // Turn 3 — compacted with diff preserved [similar to earlier message: "Write a Python function that takes a list of integers and re..."] Key differences: actual values, not indices.
The diff is extracted using difflib.SequenceMatcher at the word level. Only inserted or replaced words are kept. If the replacement would be larger than the original, the dedup is skipped entirely.
Enable in one flag
Off by default. Zero overhead when disabled.
# On the server nadirclaw serve --optimize safe # Or via environment variable NADIRCLAW_OPTIMIZE=safe nadirclaw serve # Per-request override (in the JSON body) {"model": "auto", "optimize": "safe", "messages": [...]} # Dry-run on a file (no server needed) nadirclaw optimize payload.json --mode safe --format json
Route smarter. Send less. Pay less.
NadirClaw picks the right model and trims the payload before it hits your bill.