# Tetracta — Evaluation Methodology (one-pager)

> Why our numbers survive due diligence. This describes **how we measure**, not what the mechanism
> is. No formula, kernel, recipe, threshold value, or learning rate is disclosed here.

---

### 1. Clean isolation (apples-to-apples)
Every comparison fixes the model body, optimizer, random seed, and training data, and changes
**only the attention nonlinearity**. Paired runs start from a byte-identical initialization. Any
measured difference is therefore attributable to the operator under test, not to confounds.

### 2. Pre-registration (no HARKing)
Hypotheses and expected outcome bands are **sealed before** the measuring run. We report the
prediction next to the result, so a reader can see we did not retrofit the story to the data.

### 3. Matched-step, fixed evaluation
Curves are compared at the same step on a fixed, deterministic held-out evaluation (zero eval
noise). We avoid the classic "train/eval mismatch" artifact that can manufacture a fake edge.

### 4. Gradient-norm SPC (Six-Sigma)
Training-process health is treated like a manufacturing line: we compute **process capability
(Cpk)** and **excursion (defect) rate** on the gradient-norm signal, against a fixed control limit.
This turns "did the run stay healthy?" into an auditable, quantitative statement. (The control-limit
calibration itself is withheld.)

### 5. Deterministic recompute
Every headline number is regenerated from the raw training logs by a single extractor script;
re-running it reproduces the table **bit-for-bit**. No hand-copied numbers, zero hallucination.

### 6. md5-sealed evidence chain
Raw logs are checksummed and archived; a manifest pins each artifact to its hash. The evidence
survives even if the original compute is torn down. A deliberately-wrong control (a corrupted
variant) is included and correctly **fails** the equivalence test — proving the test is not a
rubber stamp.

---

### Honest limits we state up front
- Numbers are **sub-Chinchilla / provisional**, single-seed, single width-step.
- The dramatic stability result is vs. **stabilizer-free** softmax; a fully-stabilized fair-baseline
  test is on the roadmap and not yet run.
- Downstream capability (MMLU / GSM8K) and behavioural hypotheses (calibration / abstain) are
  **not yet measured** — they are exactly what we are raising compute to test.

> This methodology is itself part of the product: the same gradient-norm SPC gate can live-monitor a
> customer's own training run. © 2026 Tetracta.