Mastodawn

analysis of coupled dynamical system to study learning #cybernetics #learningdynamics
Ger, Y., & Barak, O. (2025). Learning dynamics of RNNs in closed-loop environments. In arXiv [cs.LG]. arXiv. http://arxiv.org/abs/2505.13567

Learning Dynamics of RNNs in Closed-Loop Environments

Recurrent neural networks (RNNs) trained on neuroscience-inspired tasks offer powerful models of brain computation. However, typical training paradigms rely on open-loop, supervised settings, whereas real-world learning unfolds in closed-loop environments. Here, we develop a mathematical theory describing the learning dynamics of linear RNNs trained in closed-loop contexts. We first demonstrate that two otherwise identical RNNs, trained in either closed- or open-loop modes, follow markedly different learning trajectories. To probe this divergence, we analytically characterize the closed-loop case, revealing distinct stages aligned with the evolution of the training loss. Specifically, we show that the learning dynamics of closed-loop RNNs, in contrast to open-loop ones, are governed by an interplay between two competing objectives: short-term policy improvement and long-term stability of the agent-environment interaction. Finally, we apply our framework to a realistic motor control task, highlighting its broader applicability. Taken together, our results underscore the importance of modeling closed-loop dynamics in a biologically plausible setting.

arXiv.org

Hacker News Oct 4, 2025

Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking

https://arxiv.org/abs/2509.21519

#HackerNews #ProvableScalingLaws #FeatureEmergence #LearningDynamics #Grokking #AIResearch

$\mathbf{Li_2}$: A Framework on Dynamics of Feature Emergence and Delayed Generalization

While the phenomenon of grokking, i.e., delayed generalization, has been studied extensively, it remains an open question whether there is a mathematical framework to characterize what kind of features emerge, how and in which conditions it happens from training, for complex structured inputs. We propose a novel framework, named $\mathbf{Li_2}$, that captures three key stages for the grokking behavior of 2-layer nonlinear networks: (I) Lazy learning, (II) independent feature learning and (III) interactive feature learning, characterized by the structure of backpropagated gradient $G_F$ across layers. In (I), $G_F$ is random, and top layer overfits to random hidden representation. In (II), the gradient of each node (column of $G_F$) only depends on its own activation, and thus each hidden node learns their representation independently from $G_F$, which now carries information about target labels, thanks to weight decay. Interestingly, the independent dynamics follows exactly the gradient ascent of an energy function $E$, and its local maxima are precisely the emerging features. We study whether these local-optima induced features are generalizable, their representation power, and how they change on sample size, in group arithmetic tasks. Finally, in (III), we provably show how hidden nodes interact, and how $G_F$ changes to focus on missing features that need to be learned. Our study sheds lights on roles played by key hyperparameters such as weight decay, learning rate and sample sizes in grokking, leads to provable scaling laws of memorization and generalization, and reveals the underlying cause why recent optimizers such as Muon can be effective, from the first principles of gradient dynamics. Our analysis can be extended to multi-layer architectures.

arXiv.org