The industry just spent billions proving something that won't show up on your invoice. On July 1, 2026, Together AI closed an $800 million Series C at an $8.3 billion valuation, with open-source model inference reportedly crossing a $1 billion industry revenue run rate as growth in closed-model API spend stalls. Source: Tech Times, "Together AI Raises $800M: Open-Source Inference Breaks $1B as Closed Models Stall," July 3, 2026. Three weeks earlier, Groq — fresh off licensing its Language Processing Unit chip architecture to NVIDIA for a reported $20 billion in December 2025 — raised $650 million to rebuild itself as an inference cloud. Source: TechCrunch, "AI chipmaker Groq confirms $650M raise, re-staffs after Nvidia's $20B not-acqui-hire deal," June 22, 2026; Source: CNBC, "Nvidia buying AI chip startup Groq's assets for about $20 billion," December 24, 2025. And as of late May, Fireworks AI — a $4 billion company in October 2025 — was reportedly in talks to raise at $15 billion, nearly quadrupling in seven months. Source: Bloomberg, "Fireworks AI in Talks for Funding at $15 Billion Valuation," May 27, 2026. Three of the largest rounds of the year, three companies selling the exact same output token — just delivered faster. The engineering underneath all three is the same family of technique: speculative decoding. It is real, it is mathematically lossless, and on a hosted API it is almost entirely invisible to what you're billed. Here's the mechanism, where the savings actually land, and what does move your bill. What speculative decoding actually does. The core idea predates the current funding wave by three years: a small, fast "draft" model proposes several tokens ahead, and the large "target" model checks all of them in a single forward pass instead of one token at a time. Accepted draft tokens are kept for free; the first rejected one gets resampled from the corrected distribution. Source: Leviathan, Kalman, Matias, "Fast Inference from Transformers via Speculative Decoding," Google Research, arXiv:2211.17192, ICML 2023. The output distribution is mathematically identical to running the target model alone, token by token — this is not a quality-for-speed trade. It's closer to a hardware-utilization bug fix: autoregressive decoding is memory-bandwidth-bound, not compute-bound, so most of a GPU's math throughput sits idle waiting on the next token. Verifying several draft tokens in one pass uses that idle capacity instead of wasting it. Three production variants dominate 2026 deployments: Medusa bolts extra decoding heads directly onto the target model, so it drafts its own candidate continuations with no separate model to serve. Source: Cai et al., "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads," arXiv:2401.10774, 2024. EAGLE-3 predicts tokens directly instead of intermediate features and fuses hidden states from early, middle, and late layers of the target model. It's the first version of EAGLE to show a genuine scaling law — more draft-head training data produces proportionally larger speedups — and reports 2.3x on 8B-class models climbing to 4-6x on 70B-class ones. Source: Li et al., "EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test," arXiv:2503.01840, 2025. Multi-Token Prediction (MTP) skips the separate draft model entirely: DeepSeek trained V3 from scratch to predict several tokens per step, reporting an 85-90% acceptance rate on the second predicted token alone. Source: DeepSeek-AI, "DeepSeek-V3 Technical Report," arXiv:2412.19437, 2024. Together AI's ATLAS, live on its platform since October 2025, adds a further layer: two speculators running simultaneously — a heavyweight one trained on broad traffic for a reliable floor, and a lightweight one that keeps retraining on your live traffic pattern in real time, arbitrated by a confidence-aware controller. Together's own benchmark: DeepSeek-V3.1 goes from 105 tokens/second on an FP8 baseline to about 500 — roughly 4x — and the system runs 2.65x faster than standard decoding on average, ahead of dedicated inference silicon in their internal comparison. Source: Together AI, "AdapTive-LeArning Speculator System (ATLAS)," October 10, 2025; Source: VentureBeat, "Together AI's ATLAS adaptive speculator delivers 400% inference speedup by learning from workloads in real-time". Speculative decoding cuts forward passes roughly 3-4x. It does not cut how many output tokens you're billed for: standard autoregressive decoding runs 1.00 large-model forward pass per output token, speculative decoding (EAGLE-3 / ATLAS-class adaptive speculator) runs about 0.30 The catch the funding announcements don't spell out. Every number above describes forward passes, wall-clock latency, and GPU throughput — not output tokens. A response that's 800 tokens long is 800 tokens long whether the provider generated it with 800 sequential forward passes of the large model (standard decoding) or roughly 200-350 passes because most draft tokens were accepted (speculative decoding). What a hosted API bills you is the 800 output tokens, at whatever the published $/MTok rate is. The forward-pass count is an internal serving-efficiency number that lives entirely on the provider's side of the API boundary; it is not a parameter you set, and it does not appear anywhere on your invoice. That's exactly why the technique commands billion-dollar funding rounds without ever becoming a line item a customer requests: it's the provider's unit-economics problem to solve, and it shows up to you, if it shows up at all, in one of two indirect ways — a faster response at the same price, or a lower price if competition forces the provider to pass the GPU-hour savings through. Where the savings actually land. If you call a hosted API, you pay the rate card, full stop. Whether the provider behind that endpoint used speculative decoding, MTP, or a warehouse of GPUs running at a loss doesn't change what a given response costs you. What it changes is the provider's floor: a team running EAGLE-3 or ATLAS can profitably quote a lower rate than one running naive decoding on the same hardware, which is a large part of why open-weight model hosting turned into an outright price war in 2026. Fireworks reportedly prices DeepSeek V4 Pro under Together on a per-token basis, while Groq is cheaper than Fireworks on the majority of shared models. Source: Morph, "Fireworks vs Together AI Pricing (2026): Per-Token and GPU Rates Compared". Speculative decoding is a meaningful part of what lets these providers sustain that price war without losing money on every call — and it's part of the same dynamic behind DeepSeek's own aggressive pricing moves. If you self-host an open-weight model — Llama, DeepSeek, Qwen, on your own or rented GPUs via vLLM, SGLang, or TensorRT-LLM — speculative decoding is a direct cost lever, because you're the one paying the GPU-hour bill, not a per-token rate card. Cutting large-model forward passes 3-4x on the same hardware means the same GPU fleet serves 3-4x more requests per hour, or the same request volume needs a proportionally smaller fleet. Industry estimates put the resulting inference-cost reduction from EAGLE-3, Medusa, and MTP-class techniques at roughly 47% against a naive decoding baseline. Source: SyncSoft.AI, "Speculative Decoding 2026: EAGLE-3, Medusa, DeepSeek MTP". Self-hosting an open-weight model: speculative decoding cuts your GPU bill directly, a roughly 48% drop on the same hardware. Naive vLLM decoding costs $0.42 per 1M output tokens; vLLM with speculative decoding (EAGLE-3 draft model) costs $0.22 Turning it on is a serving-config change, not a training run — draft models for the popular open-weight base models already ship on the Hub: vllm serve meta-llama/Llama-3.1-70B-Instruct \ --tensor-parallel-size 4 \ --speculative-config '{ "method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-70B", "num_speculative_tokens": 5 }' Source: vLLM documentation, "Speculative Decoding"; Source: vLLM documentation, "EAGLE Draft Models". Four lines of server config. The output is byte-for-byte identical to running the target model alone — there is no accuracy given up for the speed. What actually moves your bill on a hosted API. | Lever | Who controls it | Shows up on your bill? | |---|---|---| | Speculative decoding, MTP, custom kernels | The provider, server-side | No — it's serving-efficiency, invisible to the customer | | Provider's published $/MTok rate | The provider, but shaped by the price war above | Directly — this is the number you're actually billed | | Which model handles a given request | You | Directly — the single largest lever you control | | Which provider serves a given model | You | Directly — same model, different rate card, different day | Because the decoding trick is invisible to you as a customer, the lever you actually control is which model, and which provider, handles each request — and the speed race above is making the correct answer change faster than most routing setups can track. A rate card that's cheapest today can be undercut within weeks by a competitor whose adaptive speculator just got faster on traffic that looks like yours; that's the literal pitch behind ATLAS's "learns from live traffic" design. Hardcoding one vendor's endpoint into your codebase locks in whichever price war happened to be running when your team wrote the integration — the same static-default mistake that shows up whenever teams pick one model for every job. This is the layer Nadir sits at: point traffic at model=auto and every request gets priced and routed against the current state of that competitive market — DeepSeek through Fireworks this week, Llama through Groq the next — instead of whichever integration a team shipped before the last price cut. You get the benefit of the infrastructure arms race described above without re-benchmarking three inference providers on a calendar reminder. Checklist before you assume "faster" means "cheaper." If you call a hosted API: speculative decoding does not lower your bill directly. Treat it as a signal about a provider's serving efficiency and headroom to compete on price, not a savings mechanism you configure. If you self-host an open-weight model: turning on speculative decoding — EAGLE-3, Medusa, or whatever your inference server ships by default — is one of the highest-ROI, lowest-effort changes available. It cuts your GPU-hour bill directly and the output is unchanged. Don't conflate speed with cost. A faster response at an unchanged $/MTok rate is a latency win, not a cost win. Output tokens already cost 5x more than input on most rate cards — track the rate and the latency as two separate numbers, not one. Re-benchmark providers on a schedule, not once. The per-model pricing gap between Together, Fireworks, and Groq has moved meaningfully within a single year, driven substantially by decoding efficiency, not model licensing changes. Route, don't pin. A single hardcoded provider integration can't capture a price war it wasn't built to watch. Conclusion. Speculative decoding is real, it's mathematically lossless, and the money chasing it — Together's $800M, Groq's $650M, NVIDIA's reported $20B license, Fireworks' reported run at a $15B valuation — is not hype. It's several of the best-capitalized companies in AI infrastructure racing to own the same three-to-four-x. None of that changes what you owe for an output token on a hosted API today. It changes how fast that token arrives, and how much room your provider has to compete on price tomorrow. If your workload runs on someone else's endpoint, the decoding trick underneath it was never yours to configure — but which provider it points you toward, this week, still is. Data in charts is illustrative, modeled from the public research, documentation, and reporting cited throughout, not derived from proprietary production traces. Sources: Tech Times, "Together AI Raises $800M," July 3, 2026. TechCrunch, "AI chipmaker Groq confirms $650M raise," June 22, 2026. CNBC, "Nvidia buying AI chip startup Groq's assets for about $20 billion," December 24, 2025. Bloomberg, "Fireworks AI in Talks for Funding at $15 Billion Valuation," May 27, 2026. Leviathan, Kalman, Matias, "Fast Inference from Transformers via Speculative Decoding," arXiv:2211.17192. Cai et al., "Medusa," arXiv:2401.10774. Li et al., "EAGLE-3," arXiv:2503.01840. DeepSeek-AI, "DeepSeek-V3 Technical Report," arXiv:2412.19437. Together AI, "ATLAS," October 10, 2025. vLLM documentation, "Speculative Decoding".