Sobre grokking: “probablemente el fenómeno más importante de la IA del que casi nadie habla”
Sobre grokking: “probablemente el fenómeno más importante de la IA del que casi nadie habla”
Как я поймал Трансформер на читерстве: гроккинг, математика и Mechanistic Interpretability
Феномен Grokking и Mechanistic Interpretability — главные тренды в исследованиях лабораторий уровня OpenAI и Anthropic. Я решил потрогать эти концепции своими руками на уровне тензоров. Цель казалась тривиальной: заставить кастомный микро-Трансформер (всего 1М параметров) выучить базовую арифметику с нуля. Однако вместо математического гения я получил ленивого мошенника. Эта статья — инженерный детектив о том, как нейросети пытаются нас обмануть (Specification Gaming), и как вскрытие Attention-матриц помогает поймать их за руку. Вскрыть Трансформер
https://habr.com/ru/articles/1008656/
#machine_learning #transformers #grokking #mechanistic_interpretability #pytorch #specification_gaming #ai_alignment
Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking
https://arxiv.org/abs/2509.21519
#HackerNews #ProvableScalingLaws #FeatureEmergence #LearningDynamics #Grokking #AIResearch
While the phenomenon of grokking, i.e., delayed generalization, has been studied extensively, it remains an open question whether there is a mathematical framework to characterize what kind of features emerge, how and in which conditions it happens from training, for complex structured inputs. We propose a novel framework, named $\mathbf{Li_2}$, that captures three key stages for the grokking behavior of 2-layer nonlinear networks: (I) Lazy learning, (II) independent feature learning and (III) interactive feature learning, characterized by the structure of backpropagated gradient $G_F$ across layers. In (I), $G_F$ is random, and top layer overfits to random hidden representation. In (II), the gradient of each node (column of $G_F$) only depends on its own activation, and thus each hidden node learns their representation independently from $G_F$, which now carries information about target labels, thanks to weight decay. Interestingly, the independent dynamics follows exactly the gradient ascent of an energy function $E$, and its local maxima are precisely the emerging features. We study whether these local-optima induced features are generalizable, their representation power, and how they change on sample size, in group arithmetic tasks. Finally, in (III), we provably show how hidden nodes interact, and how $G_F$ changes to focus on missing features that need to be learned. Our study sheds lights on roles played by key hyperparameters such as weight decay, learning rate and sample sizes in grokking, leads to provable scaling laws of memorization and generalization, and reveals the underlying cause why recent optimizers such as Muon can be effective, from the first principles of gradient dynamics. Our analysis can be extended to multi-layer architectures.
e509 — Maverick and Marbles
e509 with Michael and Michael – stories and discussion all around #AI, #LLMs, #llamas, generated #Quake, #grokking, #generalization and much more.
https://media.blubrry.com/gamesatwork/op3.dev/e,pg=6e00562f-0386-5985-9c2c-26822923720d/gamesatwork.biz/wp-content/uploads/2025/04/E509.mp3Podcast: Play in new window | Download (Duration: 32:10 — 44.8MB) | Embed
Subscribe: Apple Podcasts | Spotify | Amazon Music | Android | Podcast Index | Youtube Music | RSS | More
Share this:
https://gamesatwork.biz/2025/04/14/e509-maverick-and-marbles/
e509 — Maverick and Marbles
e509 with Michael and Michael - stories and discussion all around #AI, #LLMs, #llamas, generated #Quake, #grokking, #generalization and much more.
https://gamesatwork.biz/2025/04/14/e509-maverick-and-marbles/
e509 — Maverick and Marbles
e509 with Michael and Michael - stories and discussion all around #AI, #LLMs, #llamas, generated #Quake, #grokking, #generalization and much more.
https://gamesatwork.biz/2025/04/14/e509-maverick-and-marbles/
Grokking at Edge of Numerical Stability
https://arxiv.org/abs/2501.04697
https://old.reddit.com/r/MachineLearning/comments/1i34keg/grokking_at_the_edge_of_numerical_stability
https://en.wikipedia.org/wiki/Grokking_(machine_learning)
* sudden generalization after prolonged overfitting
* massively overtrained NN can acq. "emergent"/supra performance/unexpected abilities
* unexp./accid. finding
* mechanisms starting to unravel
Grokked Transformers are Implicit Reasoners: Mechanistic Journey to Edge of Generalization
https://arxiv.org/abs/2405.15071
https://news.ycombinator.com/item?id=40495149
Grokking, the sudden generalization that occurs after prolonged overfitting, is a surprising phenomenon challenging our understanding of deep learning. Although significant progress has been made in understanding grokking, the reasons behind the delayed generalization and its dependence on regularization remain unclear. In this work, we argue that without regularization, grokking tasks push models to the edge of numerical stability, introducing floating point errors in the Softmax function, which we refer to as Softmax Collapse (SC). We demonstrate that SC prevents grokking and that mitigating SC enables grokking without regularization. Investigating the root cause of SC, we find that beyond the point of overfitting, the gradients strongly align with what we call the naïve loss minimization (NLM) direction. This component of the gradient does not alter the model's predictions but decreases the loss by scaling the logits, typically by scaling the weights along their current direction. We show that this scaling of the logits explains the delay in generalization characteristic of grokking and eventually leads to SC, halting further learning. To validate our hypotheses, we introduce two key contributions that address the challenges in grokking tasks: StableMax, a new activation function that prevents SC and enables grokking without regularization, and $\perp$Grad, a training algorithm that promotes quick generalization in grokking tasks by preventing NLM altogether. These contributions provide new insights into grokking, elucidating its delayed generalization, reliance on regularization, and the effectiveness of existing grokking-inducing methods. Code for this paper is available at https://github.com/LucasPrietoAl/grokking-at-the-edge-of-numerical-stability.
Le #grokking : Les #chercheurs ont identifié un phénomène étrange : après une longue période d' #apprentissage #infructueux , l' #intelligence #artificielle #IA #AI donne soudain des résultats.