# Tetracta Rational Attention — Training-Stability Benchmark (sanitized summary)

**Status:** diagnostic-scale, pre-registered, reproducible. **Provisional** (sub-Chinchilla, ≤236M
tokens, single-seed). Effects and measured outcomes only — **no formula, kernel, recipe, or
learning-rate schedule is disclosed in this document.**

---

## What was measured

A controlled A/B: identical model body, optimizer, seed, and training data — **only the attention
nonlinearity differs** (standard softmax vs. Tetracta rational attention). Training-process health
was measured with a gradient-norm SPC (Statistical Process Control / Six-Sigma) discipline:
process capability (Cpk), excursion (defect) rate, peak gradient norm, and spike count.

## Headline result — stabilizer-free, high-learning-rate regime

Without any external stabilizer (e.g. qk-norm), at an aggressive learning rate, bare softmax
**drifts and collapses**; rational stays **bounded and calm** under identical conditions.

| Metric (stabilizer-free, high-LR)            | Bare softmax | Tetracta rational | Better |
|----------------------------------------------|:------------:|:-----------------:|:------:|
| Process capability — Cpk (↑)                 | 0.30         | ~1.59             | rational |
| Excursion / defect rate (↓)                  | ~18%         | ~0%               | rational |
| Peak gradient norm (↓)                       | ~42.9        | ~6.6              | rational |
| Gradient-norm spikes (↓)                     | 29           | 0                 | rational |

The collapse is a **monotonic drift, not a single blip** — i.e. intrinsic to the operator under
this regime. Consistent across multiple scales (≈601M, 1B, 3B).

## With an external stabilizer, both are stable (honest framing)

When softmax is given a modern stabilizer (qk-norm), **both** operators train cleanly — the
dramatic gap above is specifically the **stabilizer-free** regime. We do **not** claim "softmax
collapses" in general; modern stacks train it routinely.

| Metric (with stabilizer, both arms)          | Softmax | Rational |
|----------------------------------------------|:-------:|:--------:|
| Process capability — Cpk                      | 1.39    | 1.41     |
| Gradient-norm spikes                          | 0       | 0        |

## Compute & integration (measured)

| Property                                            | Value |
|-----------------------------------------------------|:-----:|
| Extra parameters vs. softmax                        | **0** (identical parameter count) |
| Op-behaviour preservation (bf16 cosine)             | ~0.99999 (drop-in equivalent) |
| Model-FLOPs Utilization (MFU) vs. softmax           | parity (~equal) |
| Measured compute tax vs. softmax                    | **≤ 1%** |
| Memory (fused kernel vs. our own naïve O(N²) ref)   | ~ −86% |

## Honest limits (read before citing)

- All numbers are **sub-Chinchilla / provisional**, single-seed, single width-step — not
  production-validated.
- The strongest, dramatic stability gap is measured against **stabilizer-free** softmax. A
  fair-baseline test against a fully stabilized softmax (qk-norm + z-loss + µP) is on the
  validation roadmap and **has not yet been run**.
- We sell **stability and capability-per-dollar**, not a quality leap. (See the scale-curve summary
  for the modest, direction-consistent loss edge.)

> Reproducible from sealed, pre-registered result files (deterministic recompute, md5-sealed).
> Mechanism, kernel internals, and training recipe are trade secret and are **not** part of this
> document. © 2026 Tetracta — Rational Attention™, method patent application in preparation.