SGDR: Stochastic Gradient Descent with Warm Restarts (1608.03983v5)

Published 13 Aug 2016 in cs.LG, cs.NE, and math.OC

Abstract: Restart techniques are common in gradient-free optimization to deal with multimodal functions. Partial warm restarts are also gaining popularity in gradient-based optimization to improve the rate of convergence in accelerated gradient schemes to deal with ill-conditioned functions. In this paper, we propose a simple warm restart technique for stochastic gradient descent to improve its anytime performance when training deep neural networks. We empirically study its performance on the CIFAR-10 and CIFAR-100 datasets, where we demonstrate new state-of-the-art results at 3.14% and 16.21%, respectively. We also demonstrate its advantages on a dataset of EEG recordings and on a downsampled version of the ImageNet dataset. Our source code is available at https://github.com/loshchil/SGDR

Authors (2)

Ilya Loshchilov (18 papers)
Frank Hutter (177 papers)

Citations (7,294)

View on Semantic Scholar

Summary

The paper presents SGDR, which implements periodic warm restarts combined with cosine annealing to enhance SGD performance for training deep neural networks.
SGDR dynamically adjusts the learning rate to escape local minima and achieve state-of-the-art error rates on CIFAR-10 and CIFAR-100.
The approach accelerates anytime performance and demonstrates broad applicability across image and EEG data, paving the way for further optimization research.

SGDR: Stochastic Gradient Descent with Warm Restarts

Introduction

The paper "SGDR: Stochastic Gradient Descent with Warm Restarts" by Ilya Loshchilov and Frank Hutter proposes an innovative technique for optimizing deep neural networks (DNNs) training efficiency. The authors present a simple yet effective method, Stochastic Gradient Descent with Warm Restarts (SGDR), to enhance the anytime performance of SGD by leveraging periodic warm restarts. The effectiveness of SGDR is empirically demonstrated across several datasets, including CIFAR-10, CIFAR-100, an electroencephalography (EEG) recordings dataset, and a downsampled version of ImageNet.

Methodology

SGDR simulates warm restarts by cyclically varying the learning rate, $\eta_t$ , over epochs. Specifically, the learning rate is scheduled to decrease with a cosine annealing function within each restart period $T_i$ , formalized as: $\eta_t = \eta^i_{min} + 0.5(\eta^i_{max} - \eta^i_{min}) (1 + \cos(\frac{T_{cur}}{T_i} \pi)),$ where $T_{cur}$ indicates the number of epochs since the last restart. The initial restart period $T_0$ can also be progressively increased by a factor $T_{mult}$ after each restart. This approach allows the model to explore different regions of the parameter space more effectively, potentially escaping local minima and improving generalization.

Empirical Evaluation

The empirical evaluation of SGDR on CIFAR-10 and CIFAR-100 datasets using Wide Residual Networks (WRNs) demonstrates its superiority. SGDR achieved state-of-the-art results with a test error of 3.14% on CIFAR-10 and 16.21% on CIFAR-100 by leveraging ensembles created from snapshots taken at the end of each restart phase. These results are particularly notable considering the inclusivity of widely recognized architectures such as Residual Neural Networks and WRNs in the evaluation.

The paper also provides comprehensive results obtained from training on a dataset of EEG recordings. SGDR significantly improved the classification performance of movements (right/left hand and foot) over traditional learning rate schedules, suggesting its broader applicability beyond image data.

Implications and Future Work

The primary practical implication of SGDR is the reduction in training time required to reach comparable levels of accuracy compared to conventional learning rate schedules. This improvement in anytime performance can expedite the development and deployment of DNNs, particularly in resource-constrained scenarios.

The theoretical implication lies in the validation of restart mechanisms' utility in gradient-based optimization, traditionally explored in gradient-free contexts. The empirical results reinforce the versatility and robustness of warm restarts in addressing issues related to the optimization landscape of DNNs.

Looking forward, SGDR opens several avenues for further research. Future work could extend the utility of warm restarts to other advanced optimization methods like AdaDelta and Adam. Additionally, investigating the integration of warm restarts with more sophisticated network architectures such as DenseNets or exploring compression techniques to make ensemble models more computationally efficient could further enhance the empirical performance gains.

Conclusion

The proposed SGDR mechanism offers a significant enhancement in the training process of DNNs, achieving state-of-the-art performance across multiple datasets. By leveraging periodic warm restarts, SGDR facilitates more efficient exploration of the optimization landscape, contributing to faster convergence and improved any-time performance. The empirical results demonstrate the potential of SGDR to serve as a valuable tool for researchers and practitioners aiming to optimize the training of deep learning models.

PDF Markdown

Related Papers

GitHub

GitHub - loshchil/SGDR (250 stars)

Tweets

https://twitter.com/tunadorable/status/1819149105186459676

https://twitter.com/tcpollak/status/1853737161872314731

YouTube

Show All Videos