Emergent Mind

Grokfast: Accelerated Grokking by Amplifying Slow Gradients

(2405.20233)
Published May 30, 2024 in cs.LG and cs.AI

Abstract

One puzzling artifact in machine learning dubbed grokking is where delayed generalization is achieved tenfolds of iterations after near perfect overfitting to the training data. Focusing on the long delay itself on behalf of machine learning practitioners, our goal is to accelerate generalization of a model under grokking phenomenon. By regarding a series of gradients of a parameter over training iterations as a random signal over time, we can spectrally decompose the parameter trajectories under gradient descent into two components: the fast-varying, overfitting-yielding component and the slow-varying, generalization-inducing component. This analysis allows us to accelerate the grokking phenomenon more than $\times 50$ with only a few lines of code that amplifies the slow-varying components of gradients. The experiments show that our algorithm applies to diverse tasks involving images, languages, and graphs, enabling practical availability of this peculiar artifact of sudden generalization. Our code is available at https://github.com/ironjr/grokfast.

Grokfast algorithm accelerates model generalization during the grokking phenomenon, advancing sudden generalization after overfitting.

Overview

  • The paper 'Grokfast: Accelerated Grokking by Amplifying Slow Gradients' by Jaerin Lee et al. proposes a method to expedite the grokking process in machine learning, where models generalize to unseen data only after extensive additional training.

  • The proposed method, Grokfast, utilizes gradient spectral decomposition to isolate and amplify the slow-varying gradient components, accelerating the transition from overfitting to generalization.

  • Empirical validations across various tasks, including modular arithmetic, MNIST classification, and sentiment analysis, demonstrate significant reductions in training iterations and improvements in validation accuracy using Grokfast.

Grokfast: Accelerated Grokking by Amplifying Slow Gradients

The paper "Grokfast: Accelerated Grokking by Amplifying Slow Gradients" by Jaerin Lee et al. addresses a significant phenomenon in machine learning known as grokking. Grokking involves instances where models, having initially overfitted to training data, undergo delayed but sudden generalization after extensive additional training iterations. The authors propose a method to expedite this generalization process, leveraging gradient decomposition and spectral analysis.

Background and Motivation

The grokking phenomenon was first observed in training scenarios involving a two-layer Transformer using algorithmic datasets, such as modular arithmetic. Despite achieving near-perfect training accuracy early on, the model did not generalize well to unseen data until much later in the training process. Existing theories have related grokking to the double descent phenomenon but have not fully characterized its mechanisms.

Given the computational cost associated with grokking, the primary motivation of this work is to bolster the practicality of models experiencing grokking by accelerating the generalization phase. The proposed method extends the utility of grokking-related models under resource constraints prevalent among machine learning practitioners.

Proposed Method: Gradient Spectral Decomposition

The authors introduce a novel approach by treating the series of gradients during training as a stochastic signal. This method spectrally decomposes parameter trajectories into fast-varying (overfitting) and slow-varying (generalization-inducing) components. The core hypothesis is that amplifying the slow-varying gradient components can expedite the grokking.

The primary algorithm, termed Grokfast, integrates the following steps:

  1. Gradient Filtering: The gradients are processed through a low-pass filter, isolating the slow component.
  2. Gradient Amplification: The slow components are amplified and added back to the original gradients before being fed into the optimizer.
  3. Optimizer Application: This modified gradient is then applied using standard optimization algorithms (e.g., SGD or Adam).

Empirical Validation

The authors rigorously validate their hypothesis across various tasks:

  1. Algorithmic Data (Modular Multiplication): Utilizing a Transformer model, Grokfast demonstrated a reduction in training iterations (achieving 95% validation accuracy) by approximately $\times 50$ compared to the baseline.
  2. MNIST Classification: Applied to a three-layer MLP, Grokfast reduced the grokking delay by $\times 22.0$, improving final evaluation accuracy from 89.5% to 91.5%.
  3. QM9 Molecular Dataset: Training a GCNN for polar prediction with Grokfast resulted in both faster and better convergence in validation loss metrics.
  4. IMDb Sentiment Analysis: Training a two-layer LSTM, the Grokfast algorithm provided quicker generalization with enhanced validation performance.

Discussion and Implications

Transience and Parameter Space Dynamics: The authors interpret the grokking as a state transition in parameter space, exploring states* initialized, *overfitted, and generalized. Grokfast effectively shortens the parameter space traversal between the overfitting and generalized states, thus accelerating generalization.

Compatibility with Weight Decay: A synergistic effect was observed when combining Grokfast with weight decay, further accelerating the grokking process. This joint application also resulted in reduced training instability.

Memory Efficiency: An exponential moving average (EMA) filter introduced significant memory efficiency, critical for large models. This adaptation retained the $\times 50$ acceleration benefit within practical computational constraints.

Theoretical Implications and Future Work: The work underlines the utility of frequency domain analyses in understanding and manipulating neural network training dynamics. Future research could explore adaptive filter designs, deeper theoretical investigations into model state transitions, and broader applications across different architectures and datasets.

Conclusion

The research by Jaerin Lee et al. delivers a compelling method to harness the grokking phenomenon more effectively. By amplifying slow gradients, their approach provides significant computational savings, enabling more practical deployment of models that otherwise experience delayed generalization. This contributes valuable insights and tools for both theoretical explorations and practical enhancement of machine learning training regimes.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
Reddit
Grokfast: Accelerated Grokking by Amplifying Slow Gradients (1 point, 1 comment) in /r/hackernews