Research Note

Put Your Gradient-Norm on a Control Chart: A Six-Sigma View of Training Stability

Tetracta AI Teams · 15 June 2026

A methodology note. It describes how we measure training stability — not the mechanism that produces it. No formula, threshold value, learning rate, or recipe is disclosed.

Ask ten engineers whether a training run was "stable" and you will get ten judgments, each formed by squinting at a loss curve. Stability, in practice, is usually a vibe. Someone says the run "looked healthy," someone else points at a bump near the middle, a third person shrugs because it converged anyway. None of these statements survives contact with a second observer, and none of them can be written down, audited, or promised to a customer. We think that is a mistake — and that a discipline manufacturing solved decades ago fixes it.

Treat the gradient-norm as a process.

On a factory line you do not certify a part by glancing at it. You measure a characteristic, you set control limits, and you ask a quantitative question: is this process capable of staying inside its limits, and how often does it excur outside them? That is statistical process control (SPC), and the Six-Sigma vocabulary built on top of it — process capability (Cpk), defect rates, control charts — is exactly the language a training run needs. A machinist does not say a shaft "looks round." They say the diameter holds tolerance with a capability index of such-and-such, across so many samples. We want a training run to be describable with the same precision.

The gradient-norm is a natural candidate for the controlled characteristic. It is cheap to log — you already compute it, or nearly do, on every step. It is sensitive to the onset of instability, often twitching before the loss curve shows anything a human would notice. And — crucially — it is a process that unfolds over thousands of steps, which is precisely what SPC was designed for. SPC is not a tool for judging single measurements; it is a tool for judging the behavior of a stream of measurements over time. A training run is exactly such a stream. The fit is almost embarrassingly natural; the surprise is that the field does not already do this by default.

What the chart actually looks like.

Concretely: you take the gradient-norm trace, slice it into consecutive windows of steps, and for each window you compute summary statistics — a central tendency and a measure of spread. Those windows become the points on the chart. A control limit is drawn from the process's own early, well-behaved behavior, so the chart is calibrated to this run's natural variation rather than to an arbitrary global number. Then you watch. A point that pokes above the limit is an excursion — a spike. A long stretch of points hugging the limit, never quite breaching but never relaxing, is a process living dangerously. A tight band low on the chart is a process with margin to spare. The eye that has learned to read a control chart sees the difference between "noisy but in control" and "quietly heading for the cliff" instantly — and, more importantly, so does an automated monitor that never gets tired at 3 a.m.

What you get.

Once the gradient-norm lives on a control chart, three things stop being subjective:

Capability (Cpk). A single number for how comfortably the run stays within its control band. A high Cpk is a process with margin to spare; a low Cpk is one living on the edge of divergence even if it never quite blows up. This is the distinction the loss curve hides: two runs can land on nearly the same final loss while one of them spent the whole time one bad batch away from blowing up. Cpk separates the run that was robustly fine from the one that was luckily fine.
Excursion (defect) rate. The fraction of windows that breach the control limit — the spikes. "A little spiky" becomes "this run is out of control on N% of its windows," a statement a reader can audit. A defect rate is a number you can put in a table next to another number and let a stranger draw their own conclusion.
A fair, early signal. You can see a run getting unhealthy long before the loss visibly diverges, and you can compare two configurations on stability as a measured quantity, not an impression. The early-warning property is worth dwelling on: by the time a loss curve has visibly turned upward, you have usually already wasted the compute. A capability index that degrades window-over-window gives you the chance to intervene — or, at minimum, to record honestly that the run was deteriorating before it died.

That third point matters more than it looks. A very common way to manufacture a fake "win" for a new method is to compare it against a baseline that quietly destabilized — the baseline's worse number is an artifact of its instability, not the new method's quality. We have watched this trap catch careful people, including ourselves on earlier work: a headline edge that looked like architectural superiority turned out, on inspection of the raw traces, to be the baseline collapsing under a learning rate it could not tolerate. The "win" evaporated the moment the baseline was stabilized and re-run. If you are scoring stability on a control chart, that collapse is visible and quantified instead of hidden. You cannot accidentally bank a comparison against a broken baseline when the chart is shouting that the baseline was out of control on a large fraction of its windows.

In our own runs, this stopped being hypothetical. A stabilizer-free softmax pushed to a high learning rate fell to a low capability, with excursions on a sizable fraction of its windows, while a stabilized variant stayed spike-free at high capability. We saw the same qualitative pattern in preliminary, diagnostic-scale runs at more than one model size, which is the part that makes us take it seriously rather than treating it as a single lucky trace. The point is emphatically not the control-limit setting, and not the operator behind it; it is that the gap was measured, not asserted. Two configurations stood next to each other on the same chart, with the same calibration, and one of them was demonstrably in control where the other was demonstrably not. That sentence is auditable. "It felt more stable" is not.

A note on what these numbers are and are not. Cpk values are a familiar manufacturing dialect: indices near and above the comfortable threshold describe a process with real headroom, while low indices describe a process that is producing defects at a rate no factory would ship. We are borrowing the dialect deliberately, because it lets a reader from outside machine learning calibrate the claim against an intuition they already trust. But we tag every one of these figures as preliminary and diagnostic-scale for a reason: they come from runs sized to diagnose a phenomenon, not from a production training campaign, and we would rather under-claim now than walk back a headline later.

Why this is nearly free.

A reasonable objection: instrumentation usually costs something, and a method that taxes every training run to watch itself is a hard sell. Here the accounting is friendly. The gradient-norm is already a byproduct of the optimizer step; turning it into windowed statistics is arithmetic on a scalar stream, not a second forward pass. In our measurements the overhead of carrying this monitoring sat well under one percent of training cost (preliminary, diagnostic-scale). That is close enough to free that the question flips: not "can we afford to chart stability?" but "why would we run blind when watching costs almost nothing?" Compute you did not spend on monitoring, but lost to a divergence you did not see coming, is far more expensive than the monitoring would ever have been.

Stability you can put in a contract.

The quiet payoff is commercial. "Stable training" is usually an aspiration — a thing a vendor gestures at in a sales call and cannot back with a number. A capability number is a specification. A team that monitors gradient-norm capability can write training stability into an SLA — promise a Cpk floor, alert on an excursion-rate ceiling — the same way a supplier guarantees a tolerance on a machined part. This is the difference between a craft and an industry. Crafts produce good outcomes when a skilled person is paying attention; industries produce specified outcomes because the specification is measured and enforced.

Consider the customer's side of that contract. Instability is GPU budget set on fire: a run that diverges is compute you paid for and threw away, plus the engineer-hours spent babysitting learning-rate sweeps, plus the schedule slip while someone figures out which knob to nudge. A buyer commissioning a large training run is, today, buying a probability — "we think this will converge" — with no instrument on it. Turning stability into a measured, contract-able number is how you stop paying that tax blindly. It lets a provider say: here is the capability floor we hold to, here is the excursion ceiling that triggers an alert and a pause, and here is the chart you can audit afterward. That is a sellable promise precisely because it is a falsifiable one.

There is an internal dividend too, separate from any customer. When stability is a logged number rather than a memory, post-mortems get honest. A run that died at step N has a chart showing exactly when its capability started to slip, which makes the question "what changed?" tractable instead of forensic. Teams that chart this build an institutional memory of which configurations were robust and which were merely lucky — and that memory compounds. The cheapest experiment is the one you do not have to re-run because you already know, from the chart, that the configuration was living on the edge.

What we deliberately do not share.

The calibration of the control limit — where the line sits, and how it is set for a given architecture and budget — is the part that takes judgment, and it is the part we keep. Anyone can window a gradient-norm and draw a control chart; the technique above is textbook SPC pointed at a new target, and we are glad to see it used widely. What is not textbook, and not in this note, is the specific calibration that makes the chart sharp for the kind of training we do — tight enough to catch trouble early without crying wolf on ordinary noise. That tuning is the difference between a chart that is decorative and a chart that is decisive, and it is earned, not free.

Two things we are careful to keep distinct. First, the control-limit threshold value itself is not disclosed here; a borrowed threshold is worse than no threshold, because it carries false authority. Second — and separately — the attention mechanism that produces the stability we measure is not the subject of this note at all. That is a different body of work, and there is a method patent application in preparation around it. This note is strictly about the instrument, not the engine it happens to be pointed at.

The honest caveats.

We would rather state the limits plainly than have a reader find them.

SPC measures stability, not quality. A perfectly capable run can still be a mediocre model — a process can hold its tolerance flawlessly and still be machining the wrong part. Nothing on this chart tells you whether the thing you trained is any good; it tells you whether the training behaved. Those are different questions and we do not conflate them.

Control limits must be calibrated per setup. A threshold borrowed from someone else's run, or even from your own run at a different scale or budget, will mislead — sometimes by missing real instability, sometimes by flagging healthy noise as a defect. The calibration is not a one-time constant; it is part of the experimental design.

Capability is one signal among several, not a verdict on its own. We read it alongside the loss, the eval metrics, and the raw traces, and we are suspicious of any story that rests on a single number — including this one. The diagnostic-scale figures above are evidence, not proof, and the preliminary cross-scale reproduction is what gives us confidence, not any single Cpk in isolation.

And the gap we measured is a gap in stability, full stop. It is not a claim that one approach produces a better model, or a faster one, or that the advantage widens with scale. Those are separate questions with their own evidence, and this note does not adjudicate them. We are making the narrow claim, and only the narrow claim, that stability can be measured rather than asserted — and that when we measured it, the difference between a controlled and an uncontrolled process was stark and reproducible.

None of that diminishes the core point: if you are going to claim a run was stable, you should be able to put a number on it — and gradient-norm SPC is how. Stability stops being a vibe and becomes a measurement. That is a small change in vocabulary and a large change in what you can honestly promise.

— Tetracta AI Teams · for humans, like humans.

← All research notes Talk methodology