Most of what your agent does does not need a frontier model. NVIDIA's research team wrote the paper that says so. In June 2025, eight NVIDIA researchers led by Peter Belcak published a position paper with an unambiguous title: "Small Language Models are the Future of Agentic AI." The argument is not that small models are catching up to frontier models in general capability. It is narrower and more useful than that: most individual steps inside an agent are repetitive, scoped, and non-conversational, and a small model fine-tuned on that narrow slice of work matches a frontier model's accuracy at a fraction of the cost. The paper backs this with a number that should reorder every agent architecture diagram drawn in the last year: serving a 7-billion-parameter SLM is 10 to 30x cheaper, in latency, energy, and FLOPs, than serving a 70 to 175-billion-parameter LLM. Not 10 to 30% cheaper. Ten to thirty times. This post covers what the paper actually claims, the case-study numbers it published, the six-step conversion algorithm it proposes, and where a routing layer has to sit for any of this to work in production without breaking the agent on its hardest steps. The claim, precisely. The paper's central argument has three parts: SLMs are already capable enough. Small models in the 1 to 10 billion parameter range now match or beat much larger predecessor models on narrow tasks. The paper cites Microsoft's Phi-3 small (7B) hitting code generation scores that took 70B models a generation earlier, and DeepMind's RETRO-7.5B matching GPT-3's 175B language modeling performance using 25x fewer parameters. Agent architectures are structurally suited to this. An agent is not one long open-ended conversation. It is a sequence of narrow calls: parse this tool output, extract this field, decide which of these four functions to call, format this response as JSON. Each of those calls, in isolation, is a task a specialized small model can learn cold. The economics compound at agent scale. A chatbot makes one model call per user turn. An agentic loop makes five, twenty, sometimes hundreds of model calls to finish one task. If most of those calls are routed to a model that costs 10 to 30x less to serve, the savings apply to the majority of the loop, not a single request. The paper is careful not to claim SLMs replace frontier models outright. Where a step genuinely requires open-ended reasoning, broad world knowledge, or general conversation, a large model is still the right tool. The recommendation is a heterogeneous agent: small, specialized models handling the routine steps, a frontier model available for the steps that need it, and something deciding which is which on every call. That "something" is a router. NVIDIA's paper stops at the model layer. This is the layer directly below where a routing product like Nadir operates. The evidence: three production systems, audited step by step. The paper's appendix does not stop at theory. The authors took three published, open-source agentic systems and manually classified every LLM call they make by whether a properly specialized SLM could handle it as reliably as the LLM currently deployed. | Agent system | Domain | Share of LLM calls SLM-viable | |---|---|---| | Open Operator | Browser automation, form filling | 40% | | MetaGPT | Multi-agent software engineering | 60% | | Cradle | General computer control (GUI agent) | 70% | The variance across the three systems is the finding worth sitting with. Open Operator's lower number reflects heavier reliance on open-ended page understanding, a task where broad world knowledge still helps. Cradle's 70% reflects a system built almost entirely from bounded, repeated actions: click here, read this field, check this condition. The pattern holds across all three: the more an agent's step distribution skews toward narrow, repeated subtasks, the higher the fraction a small model can absorb without hurting quality. None of these systems were rebuilt around SLMs during the study. This is what a well-audited, unchanged production agent looks like once you separate "this step needs a frontier model" from "this step happens to run on one." The six-step conversion algorithm. The paper proposes a concrete process for turning an existing LLM-only agent into a heterogeneous one. It is not a training recipe. It is an operational sequence: Secure usage data collection. Log every model call the agent makes in production: the prompt, the response, the tool context, and whether the outcome was correct. This is the raw material for every later step. Data curation and filtering. Remove sensitive data, deduplicate near-identical calls, and discard low-signal traces (calls that errored before the model responded, calls with no ground truth to check against). Task clustering. Group the remaining calls by what they are actually doing, not by which prompt template produced them. A "summarize this tool output" cluster and a "decide next action" cluster behave very differently under a small model, even if today they hit the same LLM. SLM selection. For each cluster, pick a candidate small model based on the skill the cluster needs. Extraction and classification clusters usually need less capability than planning or multi-step reasoning clusters. Specialized SLM fine-tuning. Fine-tune the selected model on the curated, clustered data for that specific task. This is where most of the accuracy gap between "a small model" and "a small model that reliably handles your workload" closes. Iteration and refinement. Re-run the agent with the SLM live on a shadow or canary path, measure quality against the LLM baseline, and adjust the model, the cluster boundaries, or the fine-tuning data until the SLM matches baseline on that cluster. Step 1-3: collect and cluster agent traces before touching any model. from collections import defaultdict def cluster_agent_traces(traces: list[dict]) -> dict[str, list[dict]]: """ traces: [{"step_type": "extract_field", "prompt": ..., "response": ..., "tool_context": ..., "correct": True}, ...] step_type should come from your agent's own instrumentation, not a guess after the fact. If you do not tag steps today, start there. """ clusters = defaultdict(list) for trace in traces: if trace.get("correct") is None: continue # no ground truth, not useful for step 5 fine-tuning clusters[trace["step_type"]].append(trace) return clusters def rank_clusters_by_slm_fit(clusters: dict[str, list[dict]]) -> list[tuple[str, float]]: """ Rough heuristic for step 4: clusters with low output diversity and short, bounded outputs are the best first candidates for an SLM. Route the ranking, not the raw call volume, to a human for the first pass. """ scores = [] for step_type, examples in clusters.items(): avg_output_len = sum(len(e["response"]) for e in examples) / len(examples) Shorter, more uniform outputs are typically easier for a small model to match reliably. This is a starting heuristic, not a proof. fit_score = 1 / (1 + avg_output_len / 200) scores.append((step_type, fit_score, len(examples))) return sorted(scores, key=lambda x: (-x[1], -x[2])) Step 6 is where teams underinvest. A small model that matched baseline quality in offline evaluation can still regress once it is live, because production traffic drifts from the traces it was fine-tuned on. Treat the SLM conversion as a rolling process, not a one-time migration. The routing layer this creates. Once even two of an agent's step types run on a fine-tuned SLM instead of a single frontier model, the agent is no longer choosing one model. It is choosing among several, on every step, based on which step type it is currently executing. That decision, made correctly, is the entire job of a router. What a converted agent's per-step routing config looks like in practice. routes: extract_field: model: slm-extract-v3 # fine-tuned 3B, self-hosted fallback: claude-haiku-4-5 confidence_threshold: 0.85 # below this, escalate to fallback summarize_tool_output: model: slm-summarize-v2 # fine-tuned 7B fallback: claude-haiku-4-5 confidence_threshold: 0.80 plan_next_action: model: claude-opus-4-8 # open-ended reasoning, no SLM candidate yet general_conversation: model: claude-opus-4-8 The confidence threshold and fallback are not optional. NVIDIA's paper is explicit that SLMs are the right default for the narrow slice of steps they were fine-tuned on, not a universal replacement. A production router needs the same escalation discipline covered in cascade routing: try the specialized model, check its confidence or a lightweight quality signal, and fall back to a frontier model on the steps where the SLM's training data did not cover what actually showed up. This is also where the cost math gets better than a pure LLM cascade. A hosted frontier model at $0.10/M still costs something on every call. A self-hosted 3B to 7B SLM, once fine-tuned and deployed on your own GPU or CPU inference, has a cost profile closer to your compute bill than your API bill. The 10 to 30x figure in the paper is not a pricing discount, it is a different cost structure entirely. Why the caps that companies are reaching for do not solve this. Gartner's June 2025 forecast landed the same season as NVIDIA's paper: more than 40% of agentic AI projects will be canceled by the end of 2027, largely because of escalating costs, unclear business value, or inadequate risk controls once the project scales past a pilot. The projects that survive that cliff are, disproportionately, the ones where the cost per completed task was solved at the architecture level, not patched after the bill arrived. A hard spending cap on a fully-LLM agent does not change which model handles the "extract this field" step. It changes how many times the agent is allowed to run before it gets throttled. A converted, heterogeneous agent changes the unit cost of each of those steps. One approach buys you fewer runs of an expensive process. The other buys you a cheaper process. Where to start if you have not audited your agent's steps. You do not need NVIDIA's full six-step pipeline to get useful signal in a week: Tag every step type your agent executes, if you are not already logging structured traces of what each call is actually doing, that is the prerequisite for everything else. Cluster by step type, not by prompt template. Two prompts that look different can be the same task; two calls using the same prompt template can be doing genuinely different work. Pick your two or three highest-volume, most bounded clusters first. Field extraction, classification, and output formatting are usually the easiest wins and the most common across agent codebases. Test a small open-source model (Llama 3.2 1B/3B, Qwen2.5 1.5B/3B, or a fine-tuned Phi-3 small) against your current frontier model on that cluster's held-out traces before touching production traffic. Keep a confidence-gated fallback to the frontier model on every converted step. The savings come from the steps the SLM handles correctly, not from removing the option to escalate. Where Nadir fits. Nadir already runs the escalation half of this: classify a request, route it to the cheapest model that can handle it, fall back to a stronger model when the quality floor is not met, all under 10 milliseconds of overhead. Adding a fine-tuned or open-source SLM as a route target is the same integration as adding any other model, since Nadir is OpenAI compatible: point the route at your self-hosted endpoint, set the confidence threshold, and the same cascade logic that escalates from Haiku to Opus escalates from your SLM to a frontier model on the steps it was not trained for. Nadir does not fine-tune the SLM for you, that part of NVIDIA's algorithm is still your team's work. What it removes is building and maintaining the routing and fallback layer that makes deploying a small, specialized model in production safe instead of a gamble on every step type you have not audited yet. Conclusion. NVIDIA's paper is a research position, not a product pitch, and it reads like one: careful about where SLMs do not fit, specific about the fraction of calls that do, and honest that the conversion work is real engineering, not a config flag. The headline number, 10 to 30x cheaper to serve, is the kind of gap that changes an architecture decision on its own. The case-study numbers, 40 to 70% of calls in real agent systems, are the kind that changes a roadmap. The teams that act on this first will not be the ones that swap every model call for a small one. They will be the ones that audit their agent's steps, find the 40 to 70% that were always narrow and repetitive, and put a router in front of the result so the other 30 to 60% still gets the frontier model it actually needs. Sources: Belcak et al., "Small Language Models are the Future of Agentic AI," arXiv:2506.02153, NVIDIA Research, June 2025. NVIDIA Research, SLM Agents project page. Gartner, "Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027," June 25, 2025. Anthropic, Claude Opus 4.8 and Haiku 4.5 pricing as of June 2026.