Brainstorming page. Ideas I came up with, found in papers, or spotted in other people's competition PRs. None of these are tested unless the legend says otherwise. If anything here actually works, it gets its own blog post with a proper write-up.
0 already in our code0 implemented, not yet tested0 tested, results unclear0 tested, made things worse89 not yet tested by us
Architecture 19 untried · 19 JEPA untried
Text diffusion untried
H-net tokenisation untried
Universal transformer untried
State-space models / Mamba untried
HGRN2 hybrid untried
GDN (Gated DeltaNet) untried
384d x 20L untried
Fractal architecture untried
Tree architecture untried
DEQ untried
Linear attention / SLA untried
Differential attention untried
SmearGate untried
Attention output gate untried
Megakernels untried
Depth recurrence untried
FlashAttention 3 untried
Recursive transformers untried
Training 17 untried · 17 Multi-token prediction untried
Progressive growing untried
Curriculum learning untried
Smart data selection untried
MiLe loss untried
z-loss untried
Hebbian learning untried
Self-distillation untried
Population-based training untried
LeakyReLU slope tune untried
Sequence warmup untried
max-autotune compile untried
CUDA graphs + dataset preload untried
Compute-optimal QAT timing untried
Coprime multi-shard loader untried
FP8 training untried
Newton-Muon untried
Quantisation & compression 14 untried · 14 Water-filling bit allocation untried
CERWU untried
SeedLM untried
Codebook + Huffman untried
rANS entropy coding untried
GPTQ block size B=64 untried
Byte-shuffle stride 3-4 untried
Higher weight decay untried
Fisher-weighted quantisation untried
Prune-then-quantize untried
Working QAT via forward hooks untried
EfficientQAT untried
Fourier/wavelet compression untried
BitNet/ternary at 30M untried
Eval-time 11 untried · 11 Continuous pretraining during eval untried
Nacrith logit bias untried
qTTT with momentum untried
Three-predictor system untried
State propagation untried
Adaptive stride eval untried
Prequential eval untried
CPU n-gram during eval untried
Multi-pass scoring untried
Prime MLPs untried
SLOT untried
Parameter efficiency 10 untried · 10 Monarch matrices untried
Kronecker factorisation untried
TT embedding untried
Weight folding untried
DNA encoding untried
Hypernetwork untried
Sparse strategic weights untried
BatchEnsemble untried
Random linear map adapters untried
Low-rank / LoRA untried
Other ideas 18 untried · 18 MoE untried
Knowledge distillation untried
TrigramHash untried
Model soup untried
Two separate models untried
Pipeline parallelism untried
N-gram cache untried
Anti-layer removal untried
Complementary training untried
GLU on attention values untried
Catalytic residuals untried
PEER untried
Engram n-gram hashing untried
Resonance initialisation untried
Entropy-penalised training untried
Meta-learn for eval adaptation untried
Casefold tokenizer untried
WaveletGPT untried