# Tetracta Rational Attention — Results & Evidence Report

**Scope:** a drop-in replacement for softmax attention, evaluated in a controlled A/B at diagnostic
scale. **Everything here is a measured RESULT.** The mechanism, formula, kernel internals, exact
learning rates, schedule, gradient-norm SPC thresholds and tokenizer internals are trade secret and
are **deliberately not in this document.** Results are abundant; the recipe is not.

> **Status:** provisional — sub-Chinchilla (≤236M training tokens, a single shared corpus, seed 42),
> single width-step per scale. Pre-registered, deterministically recomputed from raw per-step logs,
> md5-sealed. Downstream task accuracy (MMLU / GSM8K) and a fully-stabilized fair-baseline are on the
> validation roadmap and **not yet measured.**

---

## 1. How it was measured (clean isolation)

Every comparison fixes the **model body, optimizer (Muon + AdamW), random seed, and training data**,
and changes **only the attention nonlinearity** (standard softmax vs. Tetracta rational). Paired runs
start from a byte-identical initialization, are compared at the **same step** on a **fixed,
deterministic held-out evaluation** (zero eval noise), and every number is regenerated from raw
per-step JSONL by a single extractor (re-running reproduces the tables bit-for-bit). A deliberately
**corrupted control** is included in the kernel test and is correctly rejected — the tests are not a
rubber stamp.

- **3B configuration:** d3072 / L24 / h24 / kv8 / ff12288 ≈ 3.519B total / 2.16B active, head_dim 128.
- **7B configuration:** d4096 / L28 / h32 / kv8 / ff16384 ≈ 7.07B total / 4.36B active, head_dim 128, MoE.
- **Hardware:** H100 / H200. **Token budget:** ≤236M (sub-Chinchilla) — provisional by design.

Full per-experiment ledger: **`tetracta-experiments.csv`** (17 runs). Full per-step curve:
**`tetracta-convergence-3b.csv`** (36 points). Stability pairs: **`tetracta-stability-pairs.csv`**.
Kernel equivalence: **`tetracta-kernel-equivalence.csv`**. Scale: **`tetracta-scale-curve.csv`**.

---

## 2. Parameterless stability ⭐ (strongest, most-measured result)

Without any external stabilizer, at the **same** aggressive learning rate, bare softmax **drifts and
collapses** while rational stays **bounded**. Gradient-norm SPC (Six-Sigma) on the cleanest pair:

| Metric (stabilizer-free, same LR) | Bare softmax | Tetracta rational | Ratio |
|---|---:|---:|---:|
| Process capability — **Cpk** (↑) | **0.297** | **1.587** | 5.3× |
| Sigma level | 0.89σ | 4.76σ | — |
| Defect rate (excursions) | **~18.6%** | **~0%** | — |
| Peak gradient norm (headline) | **~42.9** | **~6.6** | ~6.5× |
| Large gradient-norm excursions | **29** | **0** | — |
| Final BPB (this regime) | 1.6671 (collapsing) | 1.3698 (clean) | — |

A second pair at a **lower** LR: softmax still spikes (gn-max 22.84, Cpk 0.676, 2.13% defect) while
rational stays calm (gn-max 5.09, Cpk 1.429, ~0%). The collapse is a **monotonic drift, not a single
blip** — i.e. intrinsic to the operator in this regime. Consistent across ≈601M / 1B / 3B.

> **Honest framing:** this is the **stabilizer-free** regime. We do **not** claim "softmax collapses"
> in general — with qk-norm / z-loss / µP it trains fine. The fair-baseline test against a
> fully-stabilized softmax is on the roadmap and not yet run.

---

## 3. Quality edge — modest, direction-consistent (best-vs-best)

Each operator at its own optimal LR, both stabilized:

| Scale | Rational BPB | Softmax (tuned) BPB | **Edge** | Relative |
|---|---:|---:|---:|---:|
| **3B** | 1.1368 | 1.1514 | **−0.0146** | ~−1.3% PPL |
| **7B** | 1.1223 | 1.1395 | **−0.0172** | ~−1.5% PPL |

At 3B the full matched-step trajectory (`tetracta-convergence-3b.csv`) shows rational **ahead at every
converged step**, with the gap stable around −0.014 / −0.015 throughout (it does not appear only at the
endpoint). At 7B the converged-phase edge is stable around **−0.017** (−0.0178 … −0.0172).

> **Honest framing (important):** the edge is best attributed to **learning-rate headroom** — rational
> tolerates **~2× the stable LR** of softmax. At **equal LR**, stabilized softmax is actually +0.0153
> ahead (parity; "no separate nonlinearity-magic quality edge"). Two width points (3B, 7B) establish a
> **direction, not a scaling law.** We sell stability + capability-per-dollar, not a quality leap.

---

## 4. Drop-in, zero extra parameters, ~0 compute tax

- **0 extra parameters** — identical parameter count and state-dict structure.
- **Op-behaviour preserved** — fused-kernel forward/backward cosine ≈ **0.99999** vs. our naive
  reference (see `tetracta-kernel-equivalence.csv`); a corrupted control is correctly rejected.
- **Compute tax ≤ 1%** vs. softmax (7B MFU at parity); **MFU rises with scale** (paired 3B ~11.7% →
  7B ~20.7% on a comparable reference, ≈+77%).
- **Memory:** the fused kernel uses ~**−86%** VRAM vs. our own naive O(N²) reference for the same
  workload. (Speedups like "14.5×" are vs. that naive reference only — against softmax it is parity.)
- Compatible with GQA, RoPE and MoE.

---

## 5. Methodology & provenance (why the numbers survive scrutiny)

- **Pre-registration** — hypotheses + expected bands sealed before the measuring run (no HARKing);
  the 3B/7B results landed inside the pre-registered bands.
- **Deterministic recompute** — every figure regenerated from raw JSONL; re-running reproduces it
  bit-for-bit.
- **Gradient-norm SPC / Six-Sigma** — process capability (Cpk) + defect rate as auditable statements
  (the control-limit calibration itself is withheld).
- **md5-sealed evidence chain** — raw logs checksummed; a manifest pins each public artifact to its
  hash (`tetracta-reproducibility-manifest.txt`).
- **Negative control** — a deliberately-corrupted variant correctly fails the equivalence test.

---

## 6. Honest limits (read before citing)

1. **Provisional / sub-Chinchilla** (≤236M tokens), single seed, single width-step per scale.
2. The dramatic stability gap is vs. **stabilizer-free** softmax; the fully-stabilized fair-baseline
   is **not yet run.**
3. The edge is **direction, not a law** (two points); best attributed to LR-headroom, not nonlinearity.
4. **Downstream** capability (MMLU / GSM8K) and behavioural hypotheses (calibration / abstain /
   anti-hallucination) are **not yet measured** — exactly what the 3B validation is for.

---

## 7. What is deliberately withheld (the IP)

The formula and sink design, the kernel internals (tiling / blocking), the exact learning rates and
schedule, the SPC threshold calibration, and tokenizer internals. None of these appear in any
downloadable file. **Results are shared in full; the recipe is not.**

*© 2026 Tetracta — Rational Attention™, method patent application in preparation. Provisional, diagnostic-scale evidence.*
