Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 67 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 25 tok/s Pro
GPT-5 High 18 tok/s Pro
GPT-4o 94 tok/s Pro
Kimi K2 173 tok/s Pro
GPT OSS 120B 444 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

On Empirical Comparisons of Optimizers for Deep Learning (1910.05446v3)

Published 11 Oct 2019 in cs.LG and stat.ML

Abstract: Selecting an optimizer is a central step in the contemporary deep learning pipeline. In this paper, we demonstrate the sensitivity of optimizer comparisons to the hyperparameter tuning protocol. Our findings suggest that the hyperparameter search space may be the single most important factor explaining the rankings obtained by recent empirical comparisons in the literature. In fact, we show that these results can be contradicted when hyperparameter search spaces are changed. As tuning effort grows without bound, more general optimizers should never underperform the ones they can approximate (i.e., Adam should never perform worse than momentum), but recent attempts to compare optimizers either assume these inclusion relationships are not practically relevant or restrict the hyperparameters in ways that break the inclusions. In our experiments, we find that inclusion relationships between optimizers matter in practice and always predict optimizer comparisons. In particular, we find that the popular adaptive gradient methods never underperform momentum or gradient descent. We also report practical tips around tuning often ignored hyperparameters of adaptive gradient methods and raise concerns about fairly benchmarking optimizers for neural network training.

Citations (239)

Summary

  • The paper demonstrates that comprehensive hyperparameter tuning critically determines optimizer rankings in deep learning.
  • Extensive experiments on models like ResNet and Transformers reveal that optimally tuned adaptive methods can match or outperform SGD.
  • The study emphasizes that meticulous tuning of parameters, including often overlooked ones like ε, uncovers inherent inclusion relationships among optimizers.

On Empirical Comparisons of Optimizers for Deep Learning

This paper explores the critical role of hyperparameter tuning protocols in empirical comparisons of optimizers for deep learning. Through extensive experimentation, the authors demonstrate that the hyperparameter search space may singularly determine the rankings of optimizers, challenging previous assertions about optimizer performance when considering adaptive methods such as Adam versus momentum-based methods.

Introduction

The selection of an optimizer in deep learning is pivotal, impacting both the speed and the ultimate effectiveness of training. Despite a lack of theoretical guidance, empirical studies are typically relied upon to inform this choice. This paper critically examines the methodology of such empirical evaluations, particularly focusing on how different hyperparameter tuning protocols can drastically alter optimizer rankings. The authors contend that more general optimizers such as Adam should theoretically never underperform compared to optimizers they can approximate, such as SGD with momentum, under optimal tuning conditions.

Optimizer Definitions and Inclusion Hierarchies

The paper explores the taxonomy of first-order optimization algorithms, positing that popular methods exhibit a natural inclusion hierarchy. Crucially, Adam, RMSProp, and other adaptive methods possess hyperparameters that allow them to simulate momentum-based SGD under specific configurations. The inclusion relationship suggests that comprehensive tuning should always favor more general optimizers over their specializations.

Experimental Methodology

To validate the practical importance of this hierarchy, the authors performed numerous experiments across various models and datasets, such as ResNet on CIFAR-10 and Transformer models on language tasks. The experiments meticulously tuned not only standard hyperparameters like learning rates but also those often overlooked, such as the ϵ\epsilon parameter in adaptive methods. Results indicated that only through extensive tuning could the inclusion relationships become apparent, with adaptive methods consistently outperforming or equating the performance of momentum and gradient descent under sufficiently tuned conditions.

Results and Analysis

Significant results include:

  • Adaptive Methods vs. Momentum: Adam and RMSProp when extensively tuned, never underperformed SGD and its variants. This contrasts with some conventional findings where limited hyperparameter exploration was conducted.
  • Sensitivity to Tuning Protocol: The paper exhibited the dramatic effect of hyperparameter choice on optimizer performance. Even slight enlargements of search spaces could invert optimizer rankings.
  • Workload Variability: Different optimization tasks responded variably to the tuning, where some showed negligible difference across optimizers, while others revealed substantial discrepancies.

Reconciling Prior Outcomes

The authors address discrepancies with prior work, notably that of Wilson et al., by illustrating that limited hyperparameter tuning, especially concerning ϵ\epsilon, led to underestimations of adaptive methods. By re-evaluating past results with advanced tuning protocols, they demonstrated consistency with their findings, upholding the inclusion hierarchy’s practical importance.

Practical Implications and Future Directions

This paper suggests that practitioners should prioritize comprehensive hyperparameter tuning to exploit potential inclusion benefits among optimizers. The findings advocate skepticism towards empirical studies claiming optimizer superiority without detailing tuning protocols. Future research could further elucidate these findings across a broader range of models and scales, potentially refining hyperparameter adaptive methods for automatic tuning.

Conclusion

The paper concludes that empirical comparisons of optimizers must carefully consider hyperparameter search spaces and inclusion relationships. This approach ensures that adaptive optimizers capitalize on their theoretical underpinnings, thereby yielding superior or equivalent performance compared to their specialized counterparts, which is critical for optimizing deep learning systems effectively.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 22 likes.

Upgrade to Pro to view all of the tweets about this paper:

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube