# Tetracta Rational Attention — Loss Edge & Scale Direction (sanitized summary)

**Status:** diagnostic-scale, pre-registered, reproducible. **Provisional** (sub-Chinchilla,
≤236M tokens, single-seed). Results only — **no learning rates, schedules, or recipe disclosed.**

---

## Best-vs-best held-out loss (BPB, lower is better)

Each configuration trained at **its own optimal learning rate** (values withheld — trade secret),
identical body/optimizer/seed/data, only the attention nonlinearity differs. "Edge" = rational − softmax.

| Scale | Rational BPB | Softmax (tuned) BPB | Edge (rational − softmax) | Relative |
|:-----:|:------------:|:-------------------:|:-------------------------:|:--------:|
| 3B    | 1.1368       | 1.1514              | **−0.0146**               | ~ −1.3% PPL |
| 7B    | 1.1223       | 1.1395              | **−0.0172**               | ~ −1.5% PPL |

The edge is **small but direction-consistent**, and slightly larger at 7B than 3B
(≈ +18% relative). Pre-registered prediction band held; raw logs md5-sealed.

## How to read this (honest framing — important)

- **Direction, not a law.** Two width points (3B, 7B) establish a *direction*, not a scaling law.
  An earlier 1B point sits in a different regime and is **excluded** from this clean comparison.
- The edge is best attributed to **learning-rate headroom** (rational tolerates a higher stable
  LR), not a separate "nonlinearity magic." At *equal* LR with both arms stabilized, the two are
  at parity. We therefore sell **stability + capability-per-dollar**, not a quality leap.
- **Provisional.** All points are sub-Chinchilla (≤236M tokens), single-seed. The 2T-token regime
  may differ. **Downstream capability (MMLU / GSM8K / long-context) has not been measured.**

## The decisive next step (the ask)

Compute + capital to run a **3B validation at a Chinchilla-sufficient token budget**. We expect it to
(i) confirm the edge over vanilla and show whether it comes from the nonlinearity or LR-headroom,
(ii) put our downstream and behavioural hypotheses (calibration / abstain / anti-hallucination) to the
test for the first time — the difference we expect but haven't yet been able to measure, and
(iii) turn provisional signal into production-grade evidence.

> A representative convergence curve (3B, held-out BPB vs. step) is provided in
> `tetracta-scale-curve.csv`. Mechanism / recipe not disclosed.
> © 2026 Tetracta — Rational Attention™, method patent application in preparation.