Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Progress measures for grokking via mechanistic interpretability (2301.05217v3)

Published 12 Jan 2023 in cs.LG and cs.AI

Abstract: Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding emergence is to find continuous \textit{progress measures} that underlie the seemingly discontinuous qualitative changes. We argue that progress measures can be found via mechanistic interpretability: reverse-engineering learned behaviors into their individual components. As a case study, we investigate the recently-discovered phenomenon of ``grokking'' exhibited by small transformers trained on modular addition tasks. We fully reverse engineer the algorithm learned by these networks, which uses discrete Fourier transforms and trigonometric identities to convert addition to rotation about a circle. We confirm the algorithm by analyzing the activations and weights and by performing ablations in Fourier space. Based on this understanding, we define progress measures that allow us to study the dynamics of training and split training into three continuous phases: memorization, circuit formation, and cleanup. Our results show that grokking, rather than being a sudden shift, arises from the gradual amplification of structured mechanisms encoded in the weights, followed by the later removal of memorizing components.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Hidden progress in deep learning: Sgd learns parities near the computational limit. arXiv preprint arXiv:2207.08799, 2022.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. Thread: Circuits. Distill, 2020. doi: 10.23915/distill.00024. https://distill.pub/2020/circuits.
  4. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html.
  5. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
  6. Predictability and surprise in large generative models. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pp.  1747–1764, 2022.
  7. Array programming with NumPy. Nature, 585(7825):357–362, September 2020. doi: 10.1038/s41586-020-2649-2. URL https://doi.org/10.1038/s41586-020-2649-2.
  8. Comparing measures of sparsity. IEEE Transactions on Information Theory, 55(10):4723–4741, 2009.
  9. Towards understanding grokking: An effective theory of representation learning. arXiv preprint arXiv:2205.10343, 2022.
  10. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  11. Acquisition of chess knowledge in alphazero. arXiv preprint arXiv:2111.09259, 2021.
  12. Beren Millidge. Grokking ’grokking’, 2022. URL https://www.beren.io/2022-01-11-Grokking-Grokking/.
  13. In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
  14. The effects of reward misspecification: Mapping and mitigating misaligned models. arXiv preprint arXiv:2201.03544, 2022.
  15. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  16. Plotly Technologies Inc. Collaborative data science, 2015. URL https://plot.ly.
  17. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177, 2022.
  18. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  19. Alex Rogozhnikov. Einops: Clear and reliable tensor manipulations with einstein-like notation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=oapKSVM2bcj.
  20. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
  21. Jacob Steinhardt. More is different for ai, Feb 2022. URL https://bounded-regret.ghost.io/more-is-different-for-ai/.
  22. The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon. arXiv preprint arXiv:2206.04817, 2022.
  23. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593, 2022.
  24. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a.
  25. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022b.
  26. Wes McKinney. Data Structures for Statistical Computing in Python. In Stéfan van der Walt and Jarrod Millman (eds.), Proceedings of the 9th Python in Science Conference, pp.  56 – 61, 2010. doi: 10.25080/Majora-92bf1922-00a.
Citations (303)

Summary

  • The paper demonstrates that grokking emerges by gradually amplifying structured Fourier-based mechanisms through distinct training phases.
  • It reverse-engineers small transformers on modular addition tasks using detailed analyses of weights, activations, and Fourier space ablations.
  • The study introduces restricted and excluded loss as progress measures to track the network's transition from memorization to a generalized solution.

Mechanistic Interpretability and Emergence in Neural Networks

Neural networks often exhibit emergent behaviors where qualitatively new capabilities arise as a result of scaling parameters, training data, or training steps. This paper presents an approach to understanding such emergent behaviors through mechanistic interpretability, focusing on the phenomenon of "grokking" observed in small transformers trained on modular addition tasks. The authors provide a comprehensive reverse engineering of the learned algorithm, confirming it via analysis of activations, weights, and Fourier space ablations.

The paper investigates the dynamics of training and phases these into three continuous phases: memorization, circuit formation, and cleanup. They argue that grokking results from the gradual amplification of structured mechanisms encoded in the weights, followed by the removal of memorizing components.

Detailed Analysis of Grokking

Grokking is defined as the abrupt transition of models to generalizing solutions after extensive training steps, even when models initially overfit. The authors specifically examine this phenomenon using a modular addition task where inputs a,b{0,,P1}a, b \in \{0, \ldots, P-1\} for a prime PP are given to predict their sum cc mod PP. Small transformers trained with weight decay are observed to exhibit grokking consistently. Through mechanistic interpretability, the authors reverse-engineer the algorithm to establish that these networks perform addition by converting the task into rotations on a circle, leveraging discrete Fourier transforms and trigonometric identities.

The principal findings are based on four lines of evidence which detail:

  1. Consistent Periodic Structures in Weights and Activations: The weights and activations exhibit a periodic structure, with the embedding matrix WEW_E being sparse in the Fourier basis, focusing on key frequencies wkw_k.
  2. Mechanistic Evidence: The neuron-logit map WLW_L is well approximated by a combination of sine and cosine terms of key frequencies, verifying the model utilizes trigonometric identities.
  3. Approximation of Neuron Activations: Most neurons in the multi-layer perceptron (MLP) layers are well-approximated by degree-2 polynomials of sines and cosines of key frequencies.
  4. Faithful Component Ablations: Replacing components of the model with their approximations generally does not harm and sometimes even improves performance, validating the accuracy of the mechanistic model.

Progress Measures for Grokking

The authors utilize their mechanistic understanding to define two progress measures: restricted loss and excluded loss, tracking the model's evolution towards a generalized solution. These metrics improve continuously before grokking occurs and allow the understanding of the training dynamics.

  1. Restricted Loss: Measures performance when all but the critical frequencies are ablated.
  2. Excluded Loss: Measures performance when only the critical frequencies are ablated, differentiating memorization from generalization.

Phases of Training

The training process is divided into three distinct phases:

  1. Memorization Phase: The network memorizes training data without leveraging the key frequencies.
  2. Circuit Formation Phase: The network starts forming the Fourier multiplication circuit, aided by weight decay, showing continuous improvement in restricted loss.
  3. Cleanup Phase: Weight decay significantly reduces non-key frequency components, transitioning the network to a simplified form that generalizes well.

Implications and Future Work

The paper's findings have significant practical and theoretical implications. They not only elucidate how emergent behaviors and grokking manifest at a mechanistic level but also highlight the critical role of weight decay in promoting generalized solutions. For future work, the authors suggest scaling mechanistic interpretability to larger, more complex models, and defining task-independent progress measures. They also advocate for developing a theory to predict the timing of phase transitions in emergent behaviors.

Conclusion

In summary, this paper successfully demonstrates the use of mechanistic interpretability to uncover the underlying dynamics of emergent behavior in neural networks. Through a detailed case paper on small transformers trained for modular addition tasks, the authors provide clear evidence of the structured mechanisms that lead to grokking. This approach offers a promising direction for understanding and potentially predicting emergent behaviors in more complex and larger-scale models.

Youtube Logo Streamline Icon: https://streamlinehq.com