Progress measures for grokking via mechanistic interpretability (2301.05217v3)

Published 12 Jan 2023 in cs.LG and cs.AI

Abstract: Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding emergence is to find continuous \textit{progress measures} that underlie the seemingly discontinuous qualitative changes. We argue that progress measures can be found via mechanistic interpretability: reverse-engineering learned behaviors into their individual components. As a case study, we investigate the recently-discovered phenomenon of ``grokking'' exhibited by small transformers trained on modular addition tasks. We fully reverse engineer the algorithm learned by these networks, which uses discrete Fourier transforms and trigonometric identities to convert addition to rotation about a circle. We confirm the algorithm by analyzing the activations and weights and by performing ablations in Fourier space. Based on this understanding, we define progress measures that allow us to study the dynamics of training and split training into three continuous phases: memorization, circuit formation, and cleanup. Our results show that grokking, rather than being a sudden shift, arises from the gradual amplification of structured mechanisms encoded in the weights, followed by the later removal of memorizing components.

References (26)

Citations (303)

View on Semantic Scholar

Summary

The paper demonstrates that grokking emerges by gradually amplifying structured Fourier-based mechanisms through distinct training phases.
It reverse-engineers small transformers on modular addition tasks using detailed analyses of weights, activations, and Fourier space ablations.
The study introduces restricted and excluded loss as progress measures to track the network's transition from memorization to a generalized solution.

Mechanistic Interpretability and Emergence in Neural Networks

Neural networks often exhibit emergent behaviors where qualitatively new capabilities arise as a result of scaling parameters, training data, or training steps. This paper presents an approach to understanding such emergent behaviors through mechanistic interpretability, focusing on the phenomenon of "grokking" observed in small transformers trained on modular addition tasks. The authors provide a comprehensive reverse engineering of the learned algorithm, confirming it via analysis of activations, weights, and Fourier space ablations.

The paper investigates the dynamics of training and phases these into three continuous phases: memorization, circuit formation, and cleanup. They argue that grokking results from the gradual amplification of structured mechanisms encoded in the weights, followed by the removal of memorizing components.

Detailed Analysis of Grokking

Grokking is defined as the abrupt transition of models to generalizing solutions after extensive training steps, even when models initially overfit. The authors specifically examine this phenomenon using a modular addition task where inputs $a, b \in \{0, \ldots, P-1\}$ for a prime $P$ are given to predict their sum $c$ mod $P$ . Small transformers trained with weight decay are observed to exhibit grokking consistently. Through mechanistic interpretability, the authors reverse-engineer the algorithm to establish that these networks perform addition by converting the task into rotations on a circle, leveraging discrete Fourier transforms and trigonometric identities.

The principal findings are based on four lines of evidence which detail:

Consistent Periodic Structures in Weights and Activations: The weights and activations exhibit a periodic structure, with the embedding matrix $W_E$ being sparse in the Fourier basis, focusing on key frequencies $w_k$ .
Mechanistic Evidence: The neuron-logit map $W_L$ is well approximated by a combination of sine and cosine terms of key frequencies, verifying the model utilizes trigonometric identities.
Approximation of Neuron Activations: Most neurons in the multi-layer perceptron (MLP) layers are well-approximated by degree-2 polynomials of sines and cosines of key frequencies.
Faithful Component Ablations: Replacing components of the model with their approximations generally does not harm and sometimes even improves performance, validating the accuracy of the mechanistic model.

Progress Measures for Grokking

The authors utilize their mechanistic understanding to define two progress measures: restricted loss and excluded loss, tracking the model's evolution towards a generalized solution. These metrics improve continuously before grokking occurs and allow the understanding of the training dynamics.

Restricted Loss: Measures performance when all but the critical frequencies are ablated.
Excluded Loss: Measures performance when only the critical frequencies are ablated, differentiating memorization from generalization.

Phases of Training

The training process is divided into three distinct phases:

Memorization Phase: The network memorizes training data without leveraging the key frequencies.
Circuit Formation Phase: The network starts forming the Fourier multiplication circuit, aided by weight decay, showing continuous improvement in restricted loss.
Cleanup Phase: Weight decay significantly reduces non-key frequency components, transitioning the network to a simplified form that generalizes well.

Implications and Future Work

The paper's findings have significant practical and theoretical implications. They not only elucidate how emergent behaviors and grokking manifest at a mechanistic level but also highlight the critical role of weight decay in promoting generalized solutions. For future work, the authors suggest scaling mechanistic interpretability to larger, more complex models, and defining task-independent progress measures. They also advocate for developing a theory to predict the timing of phase transitions in emergent behaviors.

Conclusion

In summary, this paper successfully demonstrates the use of mechanistic interpretability to uncover the underlying dynamics of emergent behavior in neural networks. Through a detailed case paper on small transformers trained for modular addition tasks, the authors provide clear evidence of the structured mechanisms that lead to grokking. This approach offers a promising direction for understanding and potentially predicting emergent behaviors in more complex and larger-scale models.

PDF Markdown

Related Papers

Tweets

https://twitter.com/katie_kang_/status/1858980077217944000

https://twitter.com/burny_tech/status/1807772821348560970

https://twitter.com/burny_tech/status/1811610906129834116

https://twitter.com/aryaman2020/status/1781134793624936576

https://twitter.com/VictorLevoso/status/1795470815007228299

https://twitter.com/austinc3301/status/1870288488375169537

YouTube

Show All Videos