The bottleneck nobody talks about A router is two systems wearing one name. A decision rule picks the cheapest model whose predicted quality clears a floor, and fails up to a frontier model when nothing clears. Underneath it sits the part that actually sets your bill: a fit predictor, f(prompt, model), that estimates whether this model will handle this prompt. The decision rule is a solved problem. The fit predictor is where routers win or lose money, and almost every one of them guesses. Ours used to guess too. Our first-generation fit engine, which we call DEADRECKON, scored model fit the way the rest of the field does, with a dot product. It worked, it shipped, and it left cost and quality on the table on exactly the prompts that matter most: the hard, multi-requirement ones. So we rebuilt the fit predictor from a different branch of statistics. The result is TUMBLER. It is more accurate than DEADRECKON on every model we serve, it scores models from any provider with zero setup, and it is rolling out inside Nadir's plan-space router now, behind a live shadow evaluation on real traffic before it takes the wheel. Every scorer in the field shares one silent assumption Survey the fit predictors in the routing literature and a pattern appears. RouteLLM factorizes a preference matrix: prompt vector times model vector. P2L fits per-prompt Bradley-Terry coefficients: a dot product against a model embedding. EmbedLLM, GraphRouter, IRT-Router, UniRoute's per-cluster error vectors, and DEADRECKON, our own first-generation heads, all reduce to the same move: project the prompt into a latent space, give each model a vector in that space, and score fit as an inner product or a sum. Inner products are compensatory. Predicted strength on one axis offsets predicted weakness on another, always, by construction. That is the correct inductive bias for preference ("which answer do humans like more") and the wrong one for competence. Real tasks are conjunctive. A prompt that requires writing SQL against a schema buried in 40,000 tokens of context needs the SQL skill AND the long-context skill. A model that is superb at SQL and mediocre at long-context integration fails the task, and no amount of SQL excellence buys that back. Run the numbers on a compensatory scorer and the failure is visible. Say a prompt demands two skills at strength 0.9 each, and a candidate model has a 5 percent failure rate on the first skill but a 60 percent failure rate on the second. A dot product blends those into a comfortable middling score, the strong skill compensates, the model clears the floor, and the route fails in production. This is exactly where catastrophic down-routes live: multi-requirement prompts, agentic payloads, format-strict extraction over long inputs. Psychometrics solved this problem decades ago, because human test items have the same structure. The DINA family of cognitive diagnosis models (the acronym is "deterministic input, noisy AND") scores a student against an item as a product over required skills: master every required skill or miss the item, softened by slip and guess parameters. To our knowledge, no LLM router has ever used a non-compensatory model class. TUMBLER is that model class, with the skill assignments learned end to end instead of hand-authored. The architecture The name is the mechanism. In a pin-tumbler lock the cylinder turns only when every pin lifts to the shear line. One low pin, no rotation, regardless of the others. TUMBLER architecture: a shared neural network reads a prompt's 417 features and outputs 32 skill demands. A per-model mastery table (learned, or predicted from model metadata for a model never seen before) gives each model its failure rate per skill. A noisy-AND gate multiplies demand by failure so a single weak skill vetoes the route, then the router sends the prompt to the cheapest model that clears the bar. TUMBLER has three learned components: A shared skill-demand extractor. The prompt's existing feature vector (a 384-dimensional MiniLM sentence embedding concatenated with 33 structural features: code blocks, reasoning markers, token counts, tool schemas) passes through a two-layer network into K = 32 latent skills. The output r(x) is a vector of demand intensities in [0, 1]: how much this prompt requires each skill. A sparsity penalty keeps demands few and sharp. Crucially this extractor is shared across every model, so it trains on every labeled outcome of every model in the corpus, not one model's slice. A per-model mastery table. Each model m carries one scalar per skill: phi(m, k), the probability that model m fails skill k when the skill is demanded. Plus two scalars for slip and guess. A model's entire identity in TUMBLER is K + 2 = 34 numbers. A noisy-AND gate. Fit is a product, not a sum: P(pass | x, m) = Π over k of ( 1 − r_k(x) · phi(m, k) ) P(correct | x, m) = (1 − slip) · P(pass) + guess · (1 − P(pass)) Take the log and the semantics are explicit: log P(pass) is a sum of penalties, one per demanded skill the model is weak on, and each penalty is unbounded below. A demanded skill the model fails often is a veto. A dot product cannot represent a veto; a noisy-AND is made of them. Everything downstream is unchanged. TUMBLER's calibrated probability feeds the same per-model isotonic calibrators, the same Mondrian conformal floors, the same cheapest-clearing selection with fail-up. It ships as a convex blend with the incumbent scorer, with the blend weight fit out-of-fold per model and a hard rule: if the blend does not beat the incumbent on held-out data, that model keeps its old head. Adoption is measurable, reversible, and per-model. The whole thing is about 58,000 parameters and runs in 0.09 milliseconds on a single CPU core. Routing latency did not move. Bring your own model: mastery from metadata The per-model mastery table has one flaw for a multi-tenant gateway: it requires labels per model, and different teams route different menus. Your stack might be Claude plus Qwen plus a fine-tuned Mistral behind vLLM. A fit engine that needs ten thousand labeled outcomes before it can score your model is useless on day one. So the production variant, TUMBLER-I, does not look the mastery table up. It predicts it. A small hypernetwork maps model metadata to the mastery vector: phi(m) = sigmoid( MLP(z_m) + residual_m ) where z_m is derived entirely from observed data: log mean cost per call, provider family, a reasoning-model flag. The residual is an L2-shrunk per-model correction that is exactly zero for a model we have never seen, and can be fit later from a small probe battery. For a known menu the hypernetwork collapses to per-model constants at export time, so serving code never changes. The honest test of model-agnosticism is leave-one-model-out: hide a model's entire outcome column from training and predict its per-prompt fit from metadata alone. We ran it on a pooled corpus of 451,352 labeled (prompt, model, outcome) cells covering 37 models from 10 provider families: Anthropic, OpenAI, Google, Qwen, DeepSeek, xAI, Meta, Mistral, and others. Across every held-out model, TUMBLER's family-aware metadata beat both a prompt-only baseline and naive cost-only scoring, with zero labels for that model. The sharpest case is the one that breaks price-based routing: Mixtral, whose price implies one capability tier and whose behavior is another. Cost-only scoring misreads it. TUMBLER's provider-family features recover about 7 points of AUC over cost-only with zero Mixtral labels, and a 64-call probe battery, a few cents of compute for an open model, adds several more. That is the "different users use different models" problem answered with a mechanism instead of a promise: any provider's model gets a working fit score on day one, and cheap probes sharpen it from there. One scope note we insist on: leave-one-model-out measures a new model on a known prompt distribution, which is the question a gateway actually faces, since it always has cross-model history on its own traffic. Generalization to entirely unseen prompt families is a stricter standard, reads lower, and we hold ourselves to it. Where the labels come from Fit predictors starve without labels, and per-model labels are the scarcest resource in routing. TUMBLER attacks this twice. First, the pooled corpus itself: because the demand extractor and the hypernetwork are shared, every labeled outcome of every model improves the representation every other model is scored against. The hardest, most label-starved model no longer learns from its own handful of examples. It learns from all 451,352 outcomes in the corpus, and only its 34 mastery numbers need model-specific evidence. Second, agreement labels. On any prompt where many models answered, the pattern of who was right is itself evidence about the model that did not answer. We fit a conditional model, P(target model correct, given which of the other 36 models answered and which were right), on rows where the target is labeled, then emit soft, down-weighted pseudo-labels where it is not. This is the Dawid-Skene idea from crowdsourcing with TUMBLER as its own annotator-confusion model, and it is fit fold-pure: the conditional never sees a test row. The effect on the hardest model was the largest single jump in this project: agreement labels lifted its fit score 17 percent under our strict held-out protocol, enough to clear the quality gate we set for it. The model DEADRECKON scored worst is the one TUMBLER improved most. The results Same benchmarks, three angles, and the comparison is always TUMBLER against DEADRECKON at identical capacity. The aggregation itself carries the win. On RouterBench (401,467 labeled outcomes, 11 models, cross-validated so no dataset family appears in both train and test), TUMBLER scores model fit 2.8 points of AUC better than DEADRECKON, and 1.8 points better than the dot-product scorer the rest of the field uses, with the trunk and the parameter budget held fixed so the only thing that changed is the veto. And at the strict floor of 98 percent of always-frontier quality, TUMBLER was the only shared scorer that could hold the floor at any threshold. The compensatory scorers overpredicted fit and slipped under it everywhere. It improved every model on our production ladder. Swapping DEADRECKON for TUMBLER on the deployed Claude ladder, under the strict unseen-family protocol, moved every head up, and moved the hardest, most label-starved model up 17 percent. That is the whole point of the shared design: the models with the least data of their own gain the most. It is steadier, and it downgrades less. Head to head against our previous router on held-out traffic, TUMBLER cuts costly model downgrades by about 23 percent at matched cost, and it is four times more stable when the same prompt is reworded: DEADRECKON flips 9 percent of its routing decisions under paraphrase, TUMBLER flips 2.2 percent. Nadir's router already delivers roughly 60 percent lower cost at 98 percent of always-frontier quality on the RouterBench held-out set; TUMBLER is the fit engine that pushes that frontier out and holds the risk budget while doing it. The quality floor stays a knob you set, not a number we pick for you. Interpretability came free A dot product gives you a score. A noisy-AND gives you a reason. Every routing decision decomposes into which skills the prompt demanded and which pin blocked which model: this prompt demands code generation at 0.83 and long-context integration at 0.61; Haiku fails the long-context pin at 0.35, blocked; Sonnet clears every pin at 0.91, routed. That decomposition is logged per decision, and the same structure gives per-tenant adaptation a surface: correcting a tenant's routing for their traffic means updating 34 numbers per model, shrunk toward the global prior, from roughly 50 to 200 labeled outcomes. No retraining, no new serving code. Straight talk We publish evals the way we wish vendors published theirs, so here is the honest edge of what we know. TUMBLER's advantage should be largest on hard, multi-skill prompts. On the multiple-choice-heavy academic benchmarks we have, we see it as better ranking and much safer behavior at strict quality floors, but the clean slice-level proof needs production agentic traffic. That is exactly why TUMBLER is rolling out in shadow first: it scores every live request next to the model actually serving it, and real traffic, not a benchmark, decides when it takes over. Under our strictest unseen-prompt standard the new-model scores read lower than the headline numbers, and we hold ourselves to that standard on purpose. And the line we will not cross: routed traffic has a nonzero downgrade rate at any useful operating point. Anyone quoting zero is measuring wrong. Where this ships TUMBLER ships inside Nadir's plan-space router as the fit scorer behind the conformal floor: same API, same decision rule, sharper head. It runs live in shadow today and takes the wheel per model, only where it beats DEADRECKON on your traffic. The probe-battery cold start and per-tenant mastery adaptation are the pieces that make it work for whatever menu you actually run, closed frontier models next to open weights behind your own vLLM. If you route meaningful traffic and want the downgrade-rate knob measured on your own mix rather than a benchmark, our design partner program is open. We will run the probe battery on your menu and show you your frontier, pin by pin.