Implicit Bias of AdamW: $\ell_\infty$ Norm Constrained Optimization (2404.04454v1)
Abstract: Adam with decoupled weight decay, also known as AdamW, is widely acclaimed for its superior performance in LLMing tasks, surpassing Adam with $\ell_2$ regularization in terms of generalization and optimization. However, this advantage is not theoretically well-understood. One challenge here is that though intuitively Adam with $\ell_2$ regularization optimizes the $\ell_2$ regularized loss, it is not clear if AdamW optimizes a specific objective. In this work, we make progress toward understanding the benefit of AdamW by showing that it implicitly performs constrained optimization. More concretely, we show in the full-batch setting, if AdamW converges with any non-increasing learning rate schedule whose partial sum diverges, it must converge to a KKT point of the original loss under the constraint that the $\ell_\infty$ norm of the parameter is bounded by the inverse of the weight decay factor. This result is built on the observation that Adam can be viewed as a smoothed version of SignGD, which is the normalized steepest descent with respect to $\ell_\infty$ norm, and a surprising connection between normalized steepest descent with weight decay and Frank-Wolfe.
- Theoretical Analysis of Auto Rate-Tuning by Batch Normalization. In International Conference on Learning Representations, 2018.
- Implicit regularization in deep matrix factorization. In Advances in Neural Information Processing Systems, 2019a.
- Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning. PMLR, 2019b.
- On exact computation with an infinitely wide neural net. Advances in neural information processing systems, 32, 2019c.
- Understanding gradient descent on the edge of stability in deep learning. In International Conference on Machine Learning. PMLR, 2022.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Dissecting adam: The sign, magnitude and variance of stochastic gradients. In International Conference on Machine Learning, 2018.
- signSGD: Compressed optimisation for non-convex problems. In International Conference on Machine Learning, 2018.
- Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. In Conference on learning theory. PMLR, 2020.
- The iterates of the Frank–Wolfe algorithm may not converge. Mathematics of Operations Research, 2023.
- Siegfried Bos and E Chug. Using weight decay to optimize the generalization ability of a perceptron. In Proceedings of International Conference on Neural Networks (ICNN’96). IEEE, 1996.
- On the implicit bias of adam. arXiv preprint arXiv:2309.00079, 2023.
- Lion Secretly Solves a Constrained Optimization: As Lyapunov Predicts. In The Twelfth International Conference on Learning Representations, 2023.
- Symbolic discovery of optimization algorithms. Advances in Neural Information Processing Systems, 2024.
- On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization. In International Conference on Learning Representations, 2018.
- Robustness to Unbounded Smoothness of Generalized SignSGD. In Advances in Neural Information Processing Systems, 2022.
- Label Noise SGD Provably Prefers Flat Global Minimizers. In Advances in Neural Information Processing Systems, 2021.
- Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability. In The Eleventh International Conference on Learning Representations, 2022.
- A Simple Convergence Proof of Adam and Adagrad. Transactions on Machine Learning Research, 2022.
- Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 2011.
- An algorithm for quadratic programming. Naval research logistics quarterly, 1956.
- Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems, 2017.
- Characterizing implicit bias in terms of optimization geometry. In International Conference on Machine Learning. PMLR, 2018.
- A novel convergence analysis for algorithms of the adam family. arXiv preprint arXiv:2112.03459, 2021.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
- Geoffrey E Hinton. Learning translation invariant recognition in a massively parallel networks. In International conference on parallel architectures and languages Europe, 1987.
- Norm matters: efficient and accurate normalization schemes in deep networks. In Advances in Neural Information Processing Systems, 2018.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, 2015.
- Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, 2018.
- Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In International conference on machine learning, 2013.
- Gradient descent aligns the layers of deep linear networks. In International Conference on Learning Representations, 2018.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 2012.
- A simple weight decay can improve generalization. Advances in neural information processing systems, 1991.
- Noise Is Not the Main Factor Behind the Gap Between Sgd and Adam on Transformers, But Sign Descent Might Be. In The Eleventh International Conference on Learning Representations, 2022.
- Towards explaining the regularization effect of initial large learning rate in training neural networks. In Advances in Neural Information Processing Systems, 2019.
- An Exponential Learning Rate Schedule for Deep Learning. In International Conference on Learning Representations, 2019.
- Reconciling modern deep learning with traditional optimization analyses: The intrinsic learning rate. Advances in Neural Information Processing Systems, 2020.
- What Happens after SGD Reaches Zero Loss?–A Mathematical Framework. In International Conference on Learning Representations, 2021.
- Robust training of neural networks using scale invariant architectures. In International Conference on Machine Learning, 2022a.
- Fast Mixing of Stochastic Gradient Descent with Normalization and Weight Decay. In Advances in Neural Information Processing Systems, 2022b.
- Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training. In The Twelfth International Conference on Learning Representations, 2023.
- Decoupled Weight Decay Regularization. In International Conference on Learning Representations, 2018.
- Adaptive Gradient Methods with Dynamic Bound of Learning Rate. In International Conference on Learning Representations, 2018.
- Gradient Descent Maximizes the Margin of Homogeneous Neural Networks. In International Conference on Learning Representations, 2019.
- Gradient Descent on Two-layer Nets: Margin Maximization and Simplicity Bias. In Advances in Neural Information Processing Systems, 2021.
- On the SDEs and Scaling Rules for Adaptive Gradient Algorithms. In Advances in Neural Information Processing Systems, 2022.
- Building a large annotated corpus of English: The Penn Treebank. Computational linguistics, 1993.
- Lexicographic and depth-sensitive margins in homogeneous and non-homogeneous deep models. In International Conference on Machine Learning, 2019a.
- Convergence of gradient descent on separable data. In The 22nd International Conference on Artificial Intelligence and Statistics, 2019b.
- On the Convergence of Adam and Beyond. In International Conference on Learning Representations, 2018.
- A stochastic approximation method. The annals of mathematical statistics, 1951.
- Learning representations by back-propagating errors. nature, 1986.
- Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, 2018.
- Rmsprop converges with proper hyperparameter. In International conference on learning representation, 2021.
- The Implicit Bias of Gradient Descent on Separable Data. In International Conference on Learning Representations, 2018.
- Lecture 6.5-rmsprop, coursera: Neural networks for machine learning. University of Toronto, Technical Report, 2012.
- The implicit bias for adaptive optimization algorithms on homogeneous neural networks. In International Conference on Machine Learning, 2021.
- How Sharpness-Aware Minimization Minimizes Sharpness? In The Eleventh International Conference on Learning Representations, 2022.
- The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, 2017.
- Adahessian: An adaptive second order optimizer for machine learning. In proceedings of the AAAI conference on artificial intelligence, 2021.
- Matthew D Zeiler. ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
- Three Mechanisms of Weight Decay Regularization. In International Conference on Learning Representations, 2018.
- Why are adaptive methods good for attention models? Advances in Neural Information Processing Systems, 2020.
- Adam Can Converge Without Any Modification On Update Rules. In Advances in Neural Information Processing Systems, 2022.
- Understanding adamw through proximal methods and scale-freeness. Transactions on Machine Learning Research, 2022.
- A sufficient condition for convergences of adam and rmsprop. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 2019.