Why Do You Grok? A Theoretical Analysis of Grokking Modular Addition (2407.12332v1)

Published 17 Jul 2024 in cs.LG and stat.ML

Abstract: We present a theoretical explanation of the grokking'' phenomenon, where a model generalizes long after overfitting,for the originally-studied problem of modular addition. First, we show that early in gradient descent, when thekernel regime'' approximately holds, no permutation-equivariant model can achieve small population error on modular addition unless it sees at least a constant fraction of all possible data points. Eventually, however, models escape the kernel regime. We show that two-layer quadratic networks that achieve zero training loss with bounded $\ell_{\infty}$ norm generalize well with substantially fewer training points, and further show such networks exist and can be found by gradient descent with small $\ell_{\infty}$ regularization. We further provide empirical evidence that these networks as well as simple Transformers, leave the kernel regime only after initially overfitting. Taken together, our results strongly support the case for grokking as a consequence of the transition from kernel-like behavior to limiting behavior of gradient descent on deep networks.

Citations (3)

View on Semantic Scholar

Summary

The paper shows that early kernel behavior in neural networks limits generalization, necessitating nearly complete training data for effective modular addition.
The paper demonstrates that a transition from kernel regimes to rich feature learning, enabled by small ℓ∞ regularization, significantly improves sample efficiency.
The paper establishes rigorous generalization bounds for both regression and classification, linking margin-based analysis to enhanced neural network performance.

A Theoretical Analysis of Grokking in Modular Addition

The paper, titled "Why Do You Grok? A Theoretical Analysis of Grokking Modular Addition," endeavors to provide a theoretical explanation for the phenomenon of grokking in neural networks, initially observed by Power et al. (2022). Grokking describes a situation where a neural network drastically improves its generalization performance long after achieving low training error. Specifically, the paper focuses on the task of modular addition, which has broader implications for understanding similar behaviors in various tasks beyond this.

Main Contributions

The authors present multiple contributions aimed at delineating why grokking occurs:

Kernel Regime and Early Training Phase: The paper asserts that in the early stages of gradient descent, neural networks operate near the kernel regime. Here, the capacity of the network to generalize is substantially limited. This claim is substantiated by showing that permutation-equivariant models cannot achieve small population error on modular addition unless they have access to almost all possible training data points. The empirical neural tangent kernel (eNTK) analysis reveals that kernel methods achieve zero training error yet fail to generalize initially.
Transition to Rich Feature Learning: The paper posits that networks eventually escape the kernel regime and begin learning richer features, primarily influenced by small $\ell_\infty$ regularization. They demonstrate this transition through two-layer quadratic networks exhibiting improved generalization with fewer training points once they leave the kernel regime. This feature learning aligns with the intrinsic bias of optimization algorithms like Adam, which implicitly regularizes the $\ell_\infty$ norm.
Generalization Bound Analysis: By focusing on $\ell_\infty$ norm-bounded networks in the regression setting, the authors provide theoretical bounds demonstrating that these networks generalize well with substantially fewer samples than required in the kernel regime. They go further to prove the existence of such networks and verify empirically that gradient descent can find these networks with small $\ell_\infty$ regularization. This assertion is formalized through the development of novel techniques for analyzing Rademacher complexity and empirical risk minimization in smooth losses.
Margin-Based Generalization in Classification: Beyond the regression task, the paper generalizes its findings to classification tasks. The paper reveals that normalized margin maximization controlled by $\ell_\infty$ norm biases can explain delayed generalization. They establish sample complexity bounds showing that a sample size of $\tilde\bigO(p^{5/3})$ suffices for accurate generalization.

Implications and Future Directions

The implications of these findings are multi-faceted:

Practical Implications: Understanding the transition from kernel to rich regimes provides actionable insights into training neural networks more efficiently. It suggests why initialization scale matters and offers practical considerations for regularization techniques to hasten generalization.
Theoretical Implications: The dichotomy between early kernel-like behavior and eventual feature learning offers deep insights into the dynamics of gradient-based optimization. Theoretical bounds on generalization further anchor the observed phenomena in a rigorous mathematical framework.
Future Research: The intriguing separation between kernel and rich regimes invites further exploration into other algorithmic tasks exhibiting grokking. Additionally, refining initialization and regularization techniques to better harness early feature learning, thus reducing training times, remains an open area for practical optimizations.

Conclusion

This paper rigorously analyzes the grokking phenomenon in neural networks, particularly for modular addition tasks. By establishing a clear dichotomy between the kernel regime and rich feature learning, and providing theoretical and empirical evidence for weak regularization driving effective feature learning, it offers comprehensive insights into why and how grokking occurs. This theoretical analysis sheds light on fundamental aspects of deep learning generalization, contributing significantly to the understanding and optimization of neural networks under various learning conditions.

References:

Power, et al. (2022). "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets."
Jacot, et al. (2018). "Neural Tangent Kernel: Convergence and Generalization in Neural Networks."
Wainwright, (2019). "High-Dimensional Statistics: A Non-Asymptotic Viewpoint."

These contributions illuminate the intricate dynamics of neural network training and generalization, providing a valuable framework for further research and practical advancements.

PDF Markdown

Related Papers

Tweets

https://twitter.com/QuelMohamadAmin/status/1816385617409847637