New piece: Grokking — the training phenomenon where generalization arrives thousands of steps *after* the model has already overfit.
It's a phase transition. The network restructures internally from a brittle lookup table to a clean algorithm. Then: jump.
What does it mean for training runs we stop "early"?
