AdaPlus: Integrating Nesterov Momentum and Precise Stepsize Adjustment on AdamW Basis (2309.01966v2)
Abstract: This paper proposes an efficient optimizer called AdaPlus which integrates Nesterov momentum and precise stepsize adjustment on AdamW basis. AdaPlus combines the advantages of AdamW, Nadam, and AdaBelief and, in particular, does not introduce any extra hyper-parameters. We perform extensive experimental evaluations on three machine learning tasks to validate the effectiveness of AdaPlus. The experiment results validate that AdaPlus (i) among all the evaluated adaptive methods, performs most comparable with (even slightly better than) SGD with momentum on image classification tasks and (ii) outperforms other state-of-the-art optimizers on LLMing tasks and illustrates pretty high stability when training GANs. The experiment code of AdaPlus will be accessible at: https://github.com/guanleics/AdaPlus.
- “On the importance of initialization and momentum in deep learning,” in International conference on machine learning. PMLR, 2013, pp. 1139–1147.
- “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
- Y Nesterov, “A method of solving a convex programming problem with convergence rate mathcal {{\{{O}}\}}(1/k {{\{{2}}\}}),” in Sov. Math. Dokl, vol. 27.
- Timothy Dozat, “Incorporating nesterov momentum into adam,” 2016.
- “Adabelief optimizer: Adapting stepsizes by the belief in observed gradients,” Advances in neural information processing systems, vol. 33, pp. 18795–18806, 2020.
- “On the variance of the adaptive learning rate and beyond,” arXiv preprint arXiv:1908.03265, 2019.
- “Win: Weight-decay-integrated nesterov acceleration for adaptive gradient algorithms,” in The Eleventh International Conference on Learning Representations, 2022.
- “Symbolic discovery of optimization algorithms,” arXiv preprint arXiv:2302.06675, 2023.
- “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
- “Densely connected convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4700–4708.
- “Long short-term memory neural network for traffic speed prediction using remote microwave sensor data,” Transportation Research Part C: Emerging Technologies, vol. 54, pp. 187–197, 2015.
- “Wasserstein generative adversarial networks,” in International conference on machine learning. PMLR, 2017, pp. 214–223.
- “Improved techniques for training gans,” Advances in neural information processing systems, vol. 29, 2016.
- “Adaptive subgradient methods for online learning and stochastic optimization.,” Journal of machine learning research, vol. 12, no. 7, 2011.
- “Lecture 6.5-rmsprop, coursera: Neural networks for machine learning,” University of Toronto, Technical Report, vol. 6, 2012.
- “Adaptive methods for nonconvex optimization,” in Advances in Neural Information Processing Systems, 2018, vol. 31, pp. 9815–9825.
- “On the convergence of adam and beyond,” arXiv preprint arXiv:1904.09237, 2019.
- “Rethinking adam: A twofold exponential moving average approach,” arXiv preprint arXiv:2106.11514, 2021.
- “Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models,” arXiv preprint arXiv:2208.06677, 2022.
- “Xgrad: Boosting gradient-based optimizers with weight prediction,” arXiv preprint arXiv:2305.18240, 2023.
- “Xpipe: Efficient pipeline model parallelism for multi-gpu dnn training,” arXiv preprint arXiv:1911.04610, 2019.