The Quality Threshold: Why Capability Doesn't Scale Linearly

On emergent abilities, the limits of parallelism, and the minimum viable model.

The Intuition That Fails

When a model is too small for a task, the instinct is to compensate: use more models, run more passes, add more structure. If one 3B model can't reliably produce a tool call, perhaps three of them — voting, checking each other, retrying — can approximate the reliability of a single capable model.

This intuition is wrong. It fails not for engineering reasons, but for reasons intrinsic to how language models develop capability. Understanding why requires looking at what "capability" actually means at these scales.

Emergent Abilities and the Threshold Effect

Large language models do not scale smoothly. Certain abilities are absent at small scale and appear abruptly as scale increases — not as gradual improvements, but as phase transitions.

Wei et al. (2022) documented this across a range of tasks: few-shot arithmetic, logical reasoning, multi-step inference, instruction following. At small scales, performance is near chance. At a threshold — different for each capability, typically in the range of 10–100 billion parameters — performance jumps discontinuously. The models are not gradually getting better at these tasks; they are not doing them at all, and then they are.

The relevant threshold for outheis is structured tool dispatch: producing a valid JSON tool call with correct argument names and types, derived from a natural-language query. This is not pattern matching against training examples. It requires:

Understanding the query semantically
Identifying which tool satisfies the intent
Extracting the correct arguments from context
Producing syntactically valid structured output

Models below the threshold fail on step 1 or 3. They produce plausible-looking output — often structurally valid JSON — but with wrong tool names, hallucinated arguments, or complete non sequiturs. The failures are not graceful degradations; they are category errors.

Why Parallelism Does Not Help

If capability is absent below the threshold, adding more below-threshold models does not produce it. This is not a resource argument. It is an argument about the nature of what is missing.

Consider what it would mean for three 3B models to "vote" on a tool call. Each produces its best guess. Voting requires comparing outputs and finding a majority — but if none of the outputs is correct, majority vote returns the most common error, not a correct result. Worse: the errors from similar small models are correlated. They fail in the same ways because they have the same limitations. Ensemble methods gain from diverse failures, not from identical failures.

More passes don't help either. A model that cannot derive the correct argument from the query on the first pass will not derive it on the third. The pass is not where the capability lives; the capability would need to live in the weights, and the weights don't have it.

The engineering parallel: you cannot implement a feature in software by running buggy code faster. The bug must be fixed. For language models, below the threshold, there is no equivalent of "fixing the bug" without retraining. The model is what it is.

The Structural Argument

There is a deeper reason why the threshold cannot be bypassed. Transformer models process a query in a single forward pass through all layers. Each layer refines the representation of the query, building increasingly abstract features. The final layer produces the output distribution.

The capability to follow a complex instruction depends on having enough layers, enough heads, and enough parameters to represent the necessary intermediate abstractions. A small model does not have a degraded version of this ability; it has a different architecture that reaches a lower ceiling. You can run it more times, but the ceiling does not move.

Larger models also have more capacity to represent what is not said — inference from context, disambiguation, recognition of unstated intent. This is exactly what tool dispatch requires: "check my agenda" must be resolved to an agenda_query call, not a file_read or memory_lookup. The resolution requires contextual inference that a small model cannot reliably perform.

The Practical Threshold for outheis

Empirically, the 7B boundary holds reliably for instruction-following in outheis's context. Below it:

Tool call syntax is correct; tool selection is wrong
Arguments are plausible-sounding but don't correspond to actual parameters
Multi-step reasoning (which tool, then which arguments) collapses

At 7B with instruction tuning, the threshold is crossed for routing-class tasks (relay). At 12–14B, it holds for tool-use agents (zeno, cato) if verified. Below 7B, no amount of prompt engineering, retrying, or structural scaffolding produces reliable results — because the capability the scaffold assumes is not there.

This means the relevant question when selecting a model is not "how small can I go?" but "is this model past the threshold for the task?" It is a binary, not a gradient. Testing confirms this: test_ollama_tool_use.py produces ✓ or ✗, not a score. A ✗ model cannot be provisionally used with more care.

What This Means for outheis Design

One capable agent beats many weak ones. outheis is built around a small number of specialized agents, each with a focused tool set and a model that has crossed the relevant capability threshold. This is not a concession to practical limitations; it is a structural property of how these systems work.

The boundary is a quality threshold, not a resource threshold. The decision to require a minimum model size is not about running costs or latency. It is about whether the capability required by the task exists in the model at all. A model that fails the tool-use test is not a slower or cheaper version of one that passes; it is a different category of tool.

Verification is mandatory before deployment. Because capability is threshold-based and not inferrable from parameter count alone (different architectures, different training regimes, different fine-tuning), empirical testing is required. The parameter count is a heuristic for where to start looking; the test result is the ground truth.

Disable rather than degrade. When no model past the threshold is available for a given agent role, the correct response is to disable that agent's scheduled tasks, not to run them with a below-threshold model. A nightly pattern inference run that produces hallucinated memories is worse than no run: it injects false data into the knowledge base. A failed run leaves the system unchanged. A degraded run corrupts it.