Deconstructing What Makes a Good Optimizer for Language Models (2407.07972v2)

Published 10 Jul 2024 in cs.LG and cs.AI

Abstract: Training LLMs becomes increasingly expensive with scale, prompting numerous attempts to improve optimization efficiency. Despite these efforts, the Adam optimizer remains the most widely used, due to a prevailing view that it is the most effective approach. We aim to compare several optimization algorithms, including SGD, Adafactor, Adam, Lion, and Sophia in the context of autoregressive LLMing across a range of model sizes, hyperparameters, and architecture variants. Our findings indicate that, except for SGD, these algorithms all perform comparably both in their optimal performance and also in terms of how they fare across a wide range of hyperparameter choices. Our results suggest to practitioners that the choice of optimizer can be guided by practical considerations like memory constraints and ease of implementation, as no single algorithm emerged as a clear winner in terms of performance or stability to hyperparameter misspecification. Given our findings, we further dissect these approaches, examining two simplified versions of Adam: a) signed momentum (Signum) which we see recovers both the performance and hyperparameter stability of Adam and b) Adalayer, a layerwise variant of Adam which we introduce to study the impact on Adam's preconditioning for different layers of the network. Examining Adalayer leads us to the conclusion that, perhaps surprisingly, adaptivity on both the last layer and LayerNorm parameters in particular are necessary for retaining performance and stability to learning rate.

Citations (10)

View on Semantic Scholar

Summary

The paper demonstrates that while Adam is popular, alternatives like Adafactor and Lion perform comparably in stability and final validation loss.
It employs extensive hyperparameter sweeps on models from 150M to 1.2B parameters to reveal nuanced optimizer behaviors.
The study shows that most layers can be trained with SGD if adaptive techniques are applied to the last layer and LayerNorm, boosting computational efficiency.

Deconstructing What Makes a Good Optimizer for LLMs

The paper "Deconstructing What Makes a Good Optimizer for LLMs" by Rosie Zhao, Depen Morwani, David Brandfonbrener, Nikhil Vyas, and Sham Kakade provides an extensive evaluation of several optimization algorithms for LLM training. The authors contend that despite the prevailing preference for the Adam optimizer in the community, there is a lack of rigorous comparison among various optimizers under diverse conditions, such as different model sizes, hyperparameters, and architectures. This paper addresses that gap by performing comprehensive sweeps and presenting a granular analysis of the stability and performance of several optimizers.

Methodological Framework

The authors test several optimization algorithms, including SGD, Adafactor, Adam, and Lion, focusing on autoregressive LLMing. They conduct large-scale experiments across varying model sizes (150M to 1.2B parameters) and hyperparameters to ascertain the optimal performance and robustness of each optimizer concerning hyperparameter choices.

Key findings demonstrate that SGD generally underperforms compared to other optimizers both in terms of stability and final validation loss. Other algorithms, including Adam, Adafactor, and Lion, show comparable performance, challenging the assumption that Adam is universally superior. This equivalence holds across multiple scales and two transformer architecture variants, suggesting that decisions about which optimizer to use can be influenced by practical concerns such as computational efficiency and ease of implementation rather than strict performance metrics.

Dissecting Optimizer Components

To understand the underlying factors contributing to the performance and stability of these optimizers, the authors introduce two variants of Adam—Signum and Adalayer.

Signum: Signum is a simplified version of Adam that uses signed momentum. The empirical results show that Signum can recover both the performance and the hyperparameter stability of Adam, indicating that a significant advantage of Adam comes from its usage of sign gradients.
Adalayer: Adalayer is a layerwise variant of Adam designed to examine the impact of preconditioning. The investigation reveals that the benefits of preconditioning in Adam are most pronounced in the last layer and LayerNorm parameters. Surprisingly, other parameters can be trained effectively using vanilla SGD, provided that an adaptive mechanism is applied to the last layer and LayerNorms.

Implications

These findings have several crucial implications:

Practical Optimization: The results suggest that in practical settings, the choice of optimizers should consider computational and memory constraints rather than assuming Adam's superiority.
Optimizer Design: The comparable performance of Adafactor and Lion with Adam indicates that more efficient optimizers can be designed without significantly compromising performance. This aligns with the trend towards designing scalable and efficient training algorithms.
Layerwise Adaptivity: The insight that most layers in a transformer model can be effectively trained with SGD, except the last layer and LayerNorm parameters, opens up new avenues for hybrid optimizer strategies. Such strategies could offer a trade-off between stability, performance, and computational efficiency.

Future Directions

Future research could explore the following avenues:

Broader Architecture Sweep: Extending this analysis to various architectures and tasks (e.g., masked LLMing, fine-tuning) would provide a more holistic view of optimizer performance.
2D Hyperparameter Interactions: Investigating the interactions between multiple hyperparameters (e.g., batch size and learning rate) would yield deeper insights into the effective hyperparameter tuning strategies.
Adaptive Metrics: Developing metrics to dynamically adjust hyperparameters based on training feedback could lead to more robust and adaptive optimization techniques.

In conclusion, by rigorously comparing multiple optimizers and dissecting their components, this paper challenges the prevailing notion about the superiority of Adam in LLM training. The insights from this paper can guide practical decisions in model training and stimulate future research in developing more efficient and adaptive optimization algorithms.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ShamKakade6/status/1811789341028679922

https://twitter.com/rosieyzh/status/1912895525285720205

https://twitter.com/rosieyzh/status/1811790177246888075

https://twitter.com/fly51fly/status/1812595140122607780

https://twitter.com/vyasnikhil96/status/1842614959009644942

https://twitter.com/dayal_kalra/status/1936092528098656305

YouTube

Show All Videos