All (54)Axioms (13)Measurements (32)Derivations (9) directly from source observed/measured derived from other factsCompetition rules(10)−A1Artifact size = code bytes + compressed model bytes. All counted code must live in train_gpt.py. Cap is decimal 16MB (16,000,000 bytes, not 16 MiB).+A2Submissions must reproducibly run in under 10 minutes on 8x H100 SXM GPUs.+A3Evaluation must complete in under 10 minutes on 8x H100. No external downloads, training dataset access, or network calls allowed during evaluation, unless you pay for those bits in the <16MB limit.+A3bIssue #1017 (unofficial community field guide, not OpenAI rules) defines: exactly one left-to-right pass, no rescoring, no retrospective revision of earlier probabilities.+A4Metric: bits per byte (BPB) on the fixed first-50k-document FineWeb validation set. Lower is better.+A5"You are only allowed to test-time train on validation set tokens you've already evaluated your model on, since those tokens have already been graded!"+A6Issue #1017 Condition 3: "The current symbol may not influence its own assigned probability, whether directly or indirectly through same-symbol adaptation, self-exclusion, or any equivalent mechanism." The FAQ's A5 quote partially covers this officially; the full formulation is unofficial.+A7New SOTA must beat the existing by at least 0.005 nats, with enough run logs to show p < 0.01. This requirement is waived for submissions that improve speed through systems optimization without changing the ML.+A8"You're free to import any package or library you want." Any architecture, any method.+A9"The challenge runs from March 18th to April 30th."+OpenAI wishlist(3)−A10Wishlist (unchecked items): JEPA, text diffusion, H-net tokenization, universal transformer, megakernels, state-space models / E2E TTT / super long context for evaluation or training, learning adapters on random linear maps.+A11"We encourage competitors to push the bounds of evaluation methods as aggressively as with training methods." (This sentence appears immediately after the 10-minute and no-external-access eval constraints.)+A12"We strongly encourage participants to submit implementations for weird or out-of-the-box ideas, in-progress or unoptimized solutions."+Metric(3)−M1BPB averages the encoding cost per byte using the model's predictions. The baseline code computes it as (val_loss / ln(2)) * (token_count / byte_count).+M2Training loss is nats per token. 1 nat = 1/ln(2) = 1.4427 bits. Baseline BPB conversion: bits_per_token = val_loss / ln(2), then BPB = bits_per_token * (token_count / byte_count).+M3Leaderboard ranks by BPB (tokenizer-agnostic). Merge threshold is in nats (0.005 nats at p < 0.01).+Baseline(3)−M4Naive baseline: ~17M params, 9 layers, 512d, SP1024, int8 quantization, zlib compression. Scores 1.2244 BPB.+M5Baseline trains ~7.2B of ~12.8B available tokens (about 56%). 13,780 steps of 524,288 tokens each.+M6Baseline eval (non-sliding, int8+zlib roundtrip) takes 1401ms (~1.4 seconds). The baseline does not use sliding-window eval.+External references(5)−M7Deletang et al.: Chinchilla 70B achieved 8.3% raw compression rate on enwik9 (1GB Wikipedia). Model weights excluded from compressed size. (Table 1, page 6)+M8Nacrith: SmolLM2-135M neural network combined with an online predictor ensemble (including N-gram model orders 1-4, adaptive context mixer, adaptive log-space bias head, confidence-based LLM skip). Scored 0.9389 bpb on enwik8 (100MB Wikipedia). Model weights (~500MB) excluded. (Table 1, page 3)+M9Kumar et al.: lower precision reduces effective parameter count. Gains from more bits saturate around 6-7 bits per weight. Authors state: "We emphasize our numerical constants are unlikely to be useful... rather, the trends we identify are the key findings." (Figure 1, Appendix K, page 31)+M10ParetoQ: ternary, 2-bit, and 3-bit quantization "generally exceeds" 4-bit on their accuracy-size Pareto frontier. Tested on MobileLLM (125M-1.5B) and LLaMA-3 (1B-8B). (Abstract, page 1)+M11GPTQ post-training quantization at 3 bits: OPT-125M perplexity goes from 27.65 (full precision) to 53.85. (Table 3, page 7)+Current SOTA (Apr 25)(4)−M12Merged: PR #1184 — Scylla tokenizer (998 tokens, TokenMonster-derived per PR #1143) + Full Hessian GPTQ (Cholesky error compensation) + XSA on all 11 layers + FlashAttention 3. 0.9485 BPB (3-seed mean, std 0.0008). Merged 2026-04-23.+M13Open: PR #1795 — SP4096 base + byte-level PPM-D mixture at eval. Original body: order 5, 0.95165 BPB (illegal gate). Updated in comment (commit cb5ad95): order 4, 1.01252 BPB (legal gate, strict-legal). Legality of per-byte online predictor pending.+M14Open: PR #1791 — GatedDeltaNet with flash linear attention, KV sharing stride=2, no TTT. 1.0339 BPB (3-seed mean, std 0.0012). Reproduction of PR #1687.+M15Previous merged SOTA: PR #1493 — SP8192, 3-layer depth recurrence (L3-5, each 3x, 17 virtual from 11 physical), parallel residuals L7+, QK-Gain 5.25, legal score-first TTT (SGD lr=0.005, 3 epochs). 1.0810 BPB (3-seed mean, std 0.0002). Merged 2026-04-09.+Architecture(5)−M16PR #1394: 11 transformer layers, 512d, 8 heads / 4 KV heads (GQA), 4x MLP, 8192-token vocabulary, tied embeddings. Parameter count is approximately 35.9M (inference from architecture dimensions, not directly stated in PR).+M17PR #1394 quantization: matrix weights int6 (GPTQ with Hessian, SDClip k=12.85), embeddings int8 (k=20). Brotli compression. Artifacts range 15,983,318 to 15,988,983 bytes across seeds.+M18PR #1394 analysis: H(q) ≈ b - log2(k) + constant. "The standard deviation of a matrix correlated very strongly (R²=0.995) with the compression ratio of that matrix under a fixed clip width." Wider clip = lower entropy.+M19Optimizers: Muon handles 2D matrix parameters (attention, MLP) using Newton-Schulz iteration on the momentum buffer. AdamW handles embeddings and scalar parameters.+M20PR #1493: depth recurrence — layers 3-5 each run 3x (17 virtual layers from 11 physical). Activates at 35% of training.+Eval approaches(4)−M21PR #1241: text diffusion scored val_var_bpb 0.9901 (sub-1.0). "Non-record reason: Trained on 1x AWS A10G (1267 min)." Far over the 10-minute limit.+M22PR #1222: eval-time adapters gave -0.079 BPB on a 1.3696 BPB base. PR #1272 tested the same approach on a ~1.11 BPB base: improvement was -0.00009 (effectively zero). PR #1272 title: "Comprehensive Negative Results."+M23PR #1493 TTT: SGD 3 epochs (lr=0.005, momentum 0.9, cosine decay) on already-scored chunks. Sliding BPB 1.0827 → TTT BPB 1.0810 (delta -0.0017).+M24PR #1795 legal version (comment cb5ad95, order 4): NN token-BPB 1.09764, mix byte-BPB 1.01252, delta -0.07435. Original body (order 5, illegal gate): NN token-BPB 1.09776, NN byte-BPB 1.08699, mix byte-BPB 0.95165, delta -0.13535.+Dead PRs(4)−M25PR #614: min-NLL epoch selection — tracks minimum NLL across TTT epochs and uses the best epoch's scores. Reviewer: "min-NLL scheme leaks information... same as training on the val set."+M26PR #720: hashed n-gram caches. Reviewer comment: "look ahead to the target token to mix probabilities and therefore leak eval tokens."+M27PR #518: adapt-then-score TTT. Reviewer: "trains on the validation set by reporting the score on a doc after its weights have adapted to it."+M28PR #606: GPTQ calibration done during eval time (after full 10-min training) using training data. Reviewer comment: "it's disallowed to use training data in any way during evaluation, which your GPTQ calibration is currently doing."+Code lineage(2)−M29Code lineage: nanoGPT (Karpathy) → llm.c (GPT-2 reproduced in ~45 min on 8xH100) → modded-nanogpt (Keller Jordan, speedrun to 1.426 min) → Parameter Golf baseline (OpenAI adapted modded-nanogpt).+M30Muon was introduced as world record #3 in the modded-nanogpt speedrun. It orthogonalizes the momentum buffer (not raw gradients) via Newton-Schulz iteration.+Tokenizer(1)−M31PR #1184 merged with Scylla (998 tokens). PR #1143 body describes Scylla as "a custom TokenMonster-derived tokenizer" with retokenized FineWeb and per-token byte-length metadata.+Complementary training(1)−M32PR #803 (open): complementary training — downweight tokens predictable by bigram during training (COMPLEMENT_ALPHA=0.5). Model specializes on statistically-hard tokens. Backoff N-gram mixer orders 2-10 at eval. Reports 0.4416 BPB (3-seed).+Derivations(7)−D1Sub-1.0 BPB has been achieved and merged under competition constraints (PR #1184: 0.9485 BPB, merged).+D2PR #1184 (Scylla, 998 tokens, 0.9485 BPB) and PR #1493 (SP8192, 1.0810 BPB) differ in both tokenizer and training stack, so the 0.13 BPB gap cannot be attributed to the tokenizer alone.+D3GatedDeltaNet (PR #1791, 1.0339 BPB, no TTT) scores lower BPB than the previous merged transformer-based SOTA (PR #1493, 1.0810 BPB, with TTT). GDN is a linear-attention architecture, not a standard transformer.+D4PPM eval-time mixture (PR #1795) drops NN byte-BPB from 1.087 to 1.013 (delta -0.074). Complementary training (PRs #803/#1033) goes further but their N-gram scorer violates Condition 2 (evaluates probability at realized target only, no full distribution).+D5M4: baseline ~17M params, int8+zlib. M17: PR #1394 ~35.9M params (inferred), int6+brotli, artifact ~15.98MB. Both fit under the 16MB cap (A1) with roughly double the parameters in the later PR.+D6PR #1394 found R²=0.995 between per-matrix standard deviation and compression ratio (M18). Higher weight decay reduces std, which should improve compression. PR #1218 confirmed: WD 0.04 → 0.085 improved compression.+D7Baseline eval takes ~1.4 seconds (non-sliding, M6). SOTA stacks add sliding-window eval and TTT which use more of the 10-minute budget (M23). OpenAI encourages pushing eval methods (A11).+Moonshot components(2)−D8Competition eval code recomputes all 2048 tokens on every stride-64 step. Standard KV-caching (computing only the 64 new tokens against cached K,V states) is not implemented. This is a known optimization used in vLLM/TGI but not applied here.+D9These untested combinations exist: Scylla + GDN, Scylla + PPM, GDN + PPM, legal complementary training + any base, CERWU + any base. Verified by scanning all competition PRs (100+ checked, zero matches for these combinations).+