Emergent Mind

Deconstructing What Makes a Good Optimizer for Language Models

(2407.07972)
Published Jul 10, 2024 in cs.LG and cs.AI

Abstract

Training language models becomes increasingly expensive with scale, prompting numerous attempts to improve optimization efficiency. Despite these efforts, the Adam optimizer remains the most widely used, due to a prevailing view that it is the most effective approach. We aim to compare several optimization algorithms, including SGD, Adafactor, Adam, and Lion, in the context of autoregressive language modeling across a range of model sizes, hyperparameters, and architecture variants. Our findings indicate that, except for SGD, these algorithms all perform comparably both in their optimal performance and also in terms of how they fare across a wide range of hyperparameter choices. Our results suggest to practitioners that the choice of optimizer can be guided by practical considerations like memory constraints and ease of implementation, as no single algorithm emerged as a clear winner in terms of performance or stability to hyperparameter misspecification. Given our findings, we further dissect these approaches, examining two simplified versions of Adam: a) signed momentum (Signum) which we see recovers both the performance and hyperparameter stability of Adam and b) Adalayer, a layerwise variant of Adam which we introduce to study Adam's preconditioning. Examining Adalayer leads us to the conclusion that the largest impact of Adam's preconditioning is restricted to the last layer and LayerNorm parameters, and, perhaps surprisingly, the remaining layers can be trained with SGD.

Momentum sensitivity with fixed learning rate in three scenarios using different optimization algorithms.

Overview

  • The paper provides an extensive evaluation of several optimization algorithms for language model training, challenging the prevailing preference for the Adam optimizer.

  • Key findings reveal that optimizers like Adafactor and Lion show comparable performance to Adam across different model sizes and hyperparameters, suggesting the consideration of practical concerns such as computational efficiency in optimizer choice.

  • The authors introduce simplifications and variants of Adam to dissect its components, concluding that many parameters can be trained effectively with SGD if an adaptive mechanism is applied to critical layers.

Deconstructing What Makes a Good Optimizer for Language Models

The paper "Deconstructing What Makes a Good Optimizer for Language Models" by Rosie Zhao, Depen Morwani, David Brandfonbrener, Nikhil Vyas, and Sham Kakade provides an extensive evaluation of several optimization algorithms for language model training. The authors contend that despite the prevailing preference for the Adam optimizer in the community, there is a lack of rigorous comparison among various optimizers under diverse conditions, such as different model sizes, hyperparameters, and architectures. This study addresses that gap by performing comprehensive sweeps and presenting a granular analysis of the stability and performance of several optimizers.

Methodological Framework

The authors test several optimization algorithms, including SGD, Adafactor, Adam, and Lion, focusing on autoregressive language modeling. They conduct large-scale experiments across varying model sizes (150M to 1.2B parameters) and hyperparameters to ascertain the optimal performance and robustness of each optimizer concerning hyperparameter choices.

Key findings demonstrate that SGD generally underperforms compared to other optimizers both in terms of stability and final validation loss. Other algorithms, including Adam, Adafactor, and Lion, show comparable performance, challenging the assumption that Adam is universally superior. This equivalence holds across multiple scales and two transformer architecture variants, suggesting that decisions about which optimizer to use can be influenced by practical concerns such as computational efficiency and ease of implementation rather than strict performance metrics.

Dissecting Optimizer Components

To understand the underlying factors contributing to the performance and stability of these optimizers, the authors introduce two variants of Adam—Signum and Adalayer.

  1. Signum: Signum is a simplified version of Adam that uses signed momentum. The empirical results show that Signum can recover both the performance and the hyperparameter stability of Adam, indicating that a significant advantage of Adam comes from its usage of sign gradients.
  2. Adalayer: Adalayer is a layerwise variant of Adam designed to examine the impact of preconditioning. The investigation reveals that the benefits of preconditioning in Adam are most pronounced in the last layer and LayerNorm parameters. Surprisingly, other parameters can be trained effectively using vanilla SGD, provided that an adaptive mechanism is applied to the last layer and LayerNorms.

Implications

These findings have several crucial implications:

  • Practical Optimization: The results suggest that in practical settings, the choice of optimizers should consider computational and memory constraints rather than assuming Adam's superiority.
  • Optimizer Design: The comparable performance of Adafactor and Lion with Adam indicates that more efficient optimizers can be designed without significantly compromising performance. This aligns with the trend towards designing scalable and efficient training algorithms.
  • Layerwise Adaptivity: The insight that most layers in a transformer model can be effectively trained with SGD, except the last layer and LayerNorm parameters, opens up new avenues for hybrid optimizer strategies. Such strategies could offer a trade-off between stability, performance, and computational efficiency.

Future Directions

Future research could explore the following avenues:

  • Broader Architecture Sweep: Extending this analysis to various architectures and tasks (e.g., masked language modeling, fine-tuning) would provide a more holistic view of optimizer performance.
  • 2D Hyperparameter Interactions: Investigating the interactions between multiple hyperparameters (e.g., batch size and learning rate) would yield deeper insights into the effective hyperparameter tuning strategies.
  • Adaptive Metrics: Developing metrics to dynamically adjust hyperparameters based on training feedback could lead to more robust and adaptive optimization techniques.

In conclusion, by rigorously comparing multiple optimizers and dissecting their components, this paper challenges the prevailing notion about the superiority of Adam in language model training. The insights from this study can guide practical decisions in model training and stimulate future research in developing more efficient and adaptive optimization algorithms.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.