Papers
Topics
Authors
Recent
2000 character limit reached

Emergence in non-neural models: grokking modular arithmetic via average gradient outer product (2407.20199v3)

Published 29 Jul 2024 in stat.ML and cs.LG

Abstract: Neural networks trained to solve modular arithmetic tasks exhibit grokking, a phenomenon where the test accuracy starts improving long after the model achieves 100% training accuracy in the training process. It is often taken as an example of "emergence", where model ability manifests sharply through a phase transition. In this work, we show that the phenomenon of grokking is not specific to neural networks nor to gradient descent-based optimization. Specifically, we show that this phenomenon occurs when learning modular arithmetic with Recursive Feature Machines (RFM), an iterative algorithm that uses the Average Gradient Outer Product (AGOP) to enable task-specific feature learning with general machine learning models. When used in conjunction with kernel machines, iterating RFM results in a fast transition from random, near zero, test accuracy to perfect test accuracy. This transition cannot be predicted from the training loss, which is identically zero, nor from the test loss, which remains constant in initial iterations. Instead, as we show, the transition is completely determined by feature learning: RFM gradually learns block-circulant features to solve modular arithmetic. Paralleling the results for RFM, we show that neural networks that solve modular arithmetic also learn block-circulant features. Furthermore, we present theoretical evidence that RFM uses such block-circulant features to implement the Fourier Multiplication Algorithm, which prior work posited as the generalizing solution neural networks learn on these tasks. Our results demonstrate that emergence can result purely from learning task-relevant features and is not specific to neural architectures nor gradient descent-based optimization methods. Furthermore, our work provides more evidence for AGOP as a key mechanism for feature learning in neural networks.

Citations (2)

Summary

  • The paper demonstrates that task-specific feature learning via the Average Gradient Outer Product enables grokking in modular arithmetic tasks.
  • It introduces an iterative RFM algorithm that leverages kernel machines to reveal a sharp transition from random to perfect test accuracy.
  • The study bridges insights between non-neural models and neural networks by highlighting the role of block-circulant features in emergent behaviors.

Emergence in non-neural models: grokking modular arithmetic via average gradient outer product

Introduction

The paper investigates the phenomenon of grokking in Recursive Feature Machines (RFM) to understand emergence in modular arithmetic tasks. Traditionally associated with neural networks, grokking describes a scenario where a model achieves perfect test accuracy long after training accuracy becomes perfect. By extending the notion of emergence beyond neural networks, the authors demonstrate that it is a result of task-specific feature learning rather than intrinsic properties of neural architectures or gradient descent methods.

Recursive Feature Machines and Grokking

RFM is an iterative algorithm that utilizes the Average Gradient Outer Product (AGOP) for feature learning across general machine learning models, including kernel machines. The paper reveals that RFMs undergo a sharp transition from random test accuracy to perfect accuracy independent of training and test loss metrics. Figure 1

Figure 1: Recursive Feature Machines grok the modular arithmetic task ${f^*(x,y) = (x + y) \mod 59$.

To implement RFM for modular arithmetic, the training involves three steps: fitting a kernel machine to data, computing its AGOP, and transforming input data using learned features (Figures 2 and 3).

Feature Learning and Circulant Structures

RFM learns block-circulant features critical for solving modular tasks. These structures are essential for implementing the Fourier Multiplication Algorithm (FMA), posited as a generalizing solution in modular arithmetic tasks. The learning progress is exhibited through gradual improvements in circulant deviation and AGOP alignment metrics, which precede sharp transitions in test accuracy and loss. Figure 2

Figure 2: RFM with the quadratic kernel on modular arithmetic with modulus p = 61 trained for 30 iterations...

Neural Networks Parallelism

Emergence observed in fully connected networks is analogous to RFMs. Neural networks automatically learn block-circulant features corresponding to AGOP, which supports the Neural Feature Ansatz positing feature learning via AGOP. Figure 3

Figure 3: One hidden layer fully-connected networks with quadratic activations trained on modular arithmetic with modulus p = 61...

Neural networks trained on random circulant-transformed data exhibit faster generalization, further emphasizing the circulant features' role (Figures 4 and 5).

Theoretical Insights: Fourier Multiplication Algorithm

The authors demonstrate that block-circulant features equip kernel machines to implement the FMA, aligning learning mechanisms in neural networks and RFMs. The result underscores a shared algorithmic solution to modular arithmetic, a finding corroborated by task-specific circulant structures in AGOP features. Figure 4

Figure 4: AGOP evolution for quadratic RFM trained on modular multiplication with p=61 before reordering (top row) and after reordering by the logarithm base 2...

Conclusions

Emergence in RFM and neural networks is driven by feature learning through AGOP, independent of architecture specifics or gradient-based optimization. This understanding challenges conventional measures of progress and generalization prediction, emphasizing hidden progress and task-specific feature structures. The insights could inform the development of efficient algorithms capable of grokking complex tasks beyond modular arithmetic by leveraging similar feature learning mechanisms.

In summary, this research extends our understanding of emergent phenomena in machine learning, demonstrating the pivotal role of learned features and suggesting a unified underlying mechanism across different model types for modular arithmetic tasks. Future work could explore the broader applicability of AGOP in feature learning, contributing significantly to the development of efficient machine learning systems.

Whiteboard

Video Overview

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 51 likes about this paper.