Research Note

A 1B Model Should Route, Not Remember

Tetracta AI Teams · 15 June 2026

A small-model engineering note from our 1-billion-parameter assistant work. It shares a design lesson and honestly-stated results. No architecture, training recipe, dataset, or competitor comparison is disclosed.

There is a persistent temptation, when you build a small language model, to treat it like a small encyclopedia — to cram as much world-knowledge into the weights as they will hold and hope it answers like a big model on a budget. We spent real effort on a 1-billion-parameter assistant, and the most useful thing we learned is that this instinct is backwards. The temptation is understandable: the large frontier models do carry an astonishing amount of world-knowledge in their parameters, and the easiest mental model is to imagine a small model as the same thing, scaled down. But that analogy is exactly where a small-model project goes wrong. A 1B model is not a small encyclopedia. It is something more useful — if you ask it to do the right job.

Small models are bad encyclopedias — and that's fine.

Be honest about what a 1B model is. Its internal, parametric recall of specific facts is weak; pushed on open-ended factual questions, a model this size is not a reliable knowledge store, and no amount of wishful prompting changes that. Pretending otherwise is how you ship a confident, wrong assistant. We measured this on our own model and did not flinch from it: raw, internal knowledge at this scale is not competitive, and we do not sell it as if it were.

It helps to understand why this is structural rather than a deficiency you can prompt your way out of. World-knowledge lives in parameters, and parameters are finite. A 1B model has a fixed, modest budget of weights, and those weights are spent on everything at once — grammar, syntax, reasoning patterns, instruction-following, and facts — all competing for the same capacity. The long tail of human facts is enormous and mostly low-frequency: any single rare fact appears so seldom in training that there is little pressure to store it precisely, and storing it precisely would cost capacity that the model needs for things it sees constantly. The predictable result is that a small model learns the shape of language and reasoning well, but its grip on specific, low-frequency facts is loose and unreliable. It will often produce something fact-shaped and fluent that is simply wrong, and — worse for a product — it has no native sense of which of its recollections are solid and which are confabulated. That last property is the dangerous one. A wrong answer delivered with the same fluent confidence as a right one is a liability, not a feature.

The mistake is treating that as a failure. It isn't — it's a design constraint that points at a better architecture. Once you accept that the parameters are not where the facts should live, the question changes from "how do I make this model know more?" to "where should the knowing happen, and what is the model uniquely good at instead?" That reframing is the whole game.

Route, don't remember.

The useful role for a small model is not knowing; it is orchestrating. A 1B model can be very good at the things that don't require a huge memory: understanding what the user wants, deciding which tool to call, forming a clean query, and composing a grounded answer from what the tools return. Arithmetic goes to a calculator, not the weights. Dates and computations are resolved exactly, by code. Facts come from a live search with the source attached, not from a hazy parametric memory. The model's job is to be the fast, cheap conductor — not the orchestra.

It is worth being concrete about what "routing" actually decomposes into, because the phrase can sound like a slogan until you see the moving parts. In practice the model is doing a sequence of small, bounded jobs, none of which depends on it holding much world-knowledge:

Intent and decomposition. Read the user's request and work out what is actually being asked — and, for a compound request, break it into the sub-questions that each need answering. "What's the population of the city where the next summit is being held, and is that bigger than my hometown?" is three jobs, not one, and the first competence is noticing that.
Tool selection. Decide which capability each sub-question needs — a calculator, a live web search, a date/time computation, a code execution step, or simply a direct answer — and recognise the increasingly common case where the honest move is to use a tool rather than answer from memory. A model that reaches for the weights when it should reach for a tool is the failure mode this whole design exists to prevent.
Query formation. Turn a messy human question into a clean, well-scoped query or a correct snippet of code. A good search query is its own small skill; so is writing arithmetic that a calculator can evaluate exactly instead of approximating in-weights.
Grounded composition. Take what the tools return — search snippets, a computed number, a date — and assemble an answer that stays faithful to those returns and carries the sources with it, rather than smoothing them over with plausible-sounding parametric filler.

None of those four jobs is a memory task. They are language, structure, and judgment tasks — and those are exactly the things a small model can be trained to do well. The contrast is the useful intuition: asking a 1B model to recite a rarely-seen fact from its weights is asking the part of the system that is structurally weakest to carry the load; asking it to recognise that the question needs a lookup, form the lookup, and faithfully report the result plays to the part that is structurally strong. Same model, completely different reliability, because you moved the hard requirement off the weights and onto the tools.

When we rebuilt our assistant around that idea, the honest results were the ones that actually matter for a product. Tool selection landed reliably on the right tool. Answers that are exactness-checkable — math and dates — came out exact, because they were delegated to code rather than guessed in-weights; "what is 3,847 times 219" and "how many days until the end of the quarter" are computed, not recalled, and so they are simply correct rather than confidently approximate. Web-grounded answers carried their sources, so a reader can check the claim against the citation instead of trusting the model. And responses were fast — on the order of a second or two when served from cache, which for an interactive assistant is the difference between something people use and something they abandon. These are qualitative, preliminary observations from our own product work, not benchmark claims; the point is only that none of them depend on the model knowing much. They depend on it routing well.

A worked feel for the difference.

It is easy to nod along with "route, don't remember" in the abstract, so consider how the two postures handle the same question. Ask an ungrounded small model "who won the regional election last week and by how much?" and the parametric instinct produces a fluent, specific, and quite possibly fabricated answer — a plausible name, a plausible margin, delivered with no signal that any of it was invented, and no chance of being right about an event after the model's training. The routing posture treats the same question entirely differently: it recognises a recent-event factual query, forms a search, reads back what the sources actually say, and composes an answer anchored to those sources — or, if the search returns nothing usable, says so instead of filling the gap. The model in the second case "knows" no more than the model in the first. What changed is where the answer is allowed to come from. That is the whole thesis in one example: the competence you want is not a bigger memory, it is the discipline to look things up and the faithfulness to report what was found.

Why this is the right trade for small teams.

Grounding-over-parameters is not just an accuracy story; it is an economics one. A small routing model is cheap to run, cheap to host, and small enough to deploy on-premise — which is exactly what budget-bound teams, sovereign-AI efforts, and data-resident (regulated) deployments need. You get an assistant whose facts are as current as your search index and as auditable as the sources it cites, without renting a large model's inference bill. The capability you care about per dollar spent is high precisely because you stopped asking the parameters to do a job retrieval does better.

There are three further advantages that fall out of this design almost for free, and they tend to matter more in production than the headline accuracy. The first is freshness without retraining. A model whose facts live in its weights is frozen at its training cut-off, and the only way to update it is the expensive, slow business of retraining. A routing model is as current as the index it queries; yesterday's event is answerable today with no model change at all, because the knowledge was never the model's to hold. The second is auditability. When the answer carries its source, a wrong answer becomes diagnosable — you can see whether the model misread a good source or faithfully reported a bad one, which is the difference between a debuggable system and a black box you can only shrug at. For regulated and sovereign deployments, "show me where this came from" is often not a nice-to-have but a requirement, and a grounded routing assistant answers it natively. The third is a smaller, cheaper footprint that you actually control. A 1B conductor can live on modest, on-premise hardware where the data already sits, instead of shipping sensitive queries off to a large hosted model. For a small team without a large-model inference budget, this is not a compromise — it is frequently the only shape of assistant that is economically and legally deployable at all, and it happens to be a good one.

The honest limits.

We state the ceiling plainly. A 1B assistant is strongest in its primary language and weaker in others; its unaided, open-domain language quality is not at the frontier, and we don't claim it is. On raw open-domain quality it is a competent small model, not a frontier one, and we would mislead you to imply otherwise.

Grounding reduces hallucination relative to ungrounded small-model output, but it does not abolish it — a bad source or a bad query still produces a bad answer, so the tool layer and the sources matter as much as the model. Worse, grounding introduces its own characteristic failure modes that an honest team has to own. A search can return confidently-wrong material, and a faithful model will faithfully relay it; garbage in, grounded garbage out. A poorly-formed query retrieves the wrong thing, and the model composes a fluent answer to the wrong question. And the model can still drift from its sources during composition, papering over a gap in the retrieved material with a plausible sentence of its own — the very parametric habit the whole design is trying to suppress. None of these are solved by routing; they are relocated by it, from the weights to the tool layer, where at least they are visible and fixable. That relocation is a genuine improvement, but calling it a cure would be the kind of overclaim this note exists to avoid.

And "route, don't remember" is a design philosophy, not a magic switch: the routing itself has to be good, and making it good is most of the work. A model that selects the wrong tool, forms a sloppy query, or composes unfaithfully turns the whole architecture into an expensive way to be wrong with citations attached. The competence simply moves; it does not disappear. The hard engineering is no longer "make the model know everything" — it is "make the model a reliable conductor, build a tool layer worth conducting, and keep the answer honest to what the tools returned." That is a more tractable problem than cramming the world into a billion parameters, but it is not a free one, and any honest account has to say so.

What we keep closed.

The lesson in this note is general, and we are glad to see it used. What we do not discuss here is anything about how the underlying model itself is built — its internals, how it is trained, what it learns from, or how its outputs are produced are all out of scope, and deliberately so. This is a note about an architectural posture for small-model products, not a disclosure of the model behind ours.

None of the caveats above undercut the lesson, which we think generalizes well beyond our own assistant: if your model is small, spend your effort on what it routes to, not on what it memorizes. Be honest about the ceiling, build the tool layer with the same care you would give the model, keep every answer faithful to its sources — and you end up with the cheapest, most honest assistant there is: the one that knows when to look something up, and then actually does.

— Tetracta AI Teams · for humans, like humans.

← All research notes Talk methodology