Implicit Bias of AdamW: $\ell_\infty$ Norm Constrained Optimization (2404.04454v1)
Abstract: Adam with decoupled weight decay, also known as AdamW, is widely acclaimed for its superior performance in LLMing tasks, surpassing Adam with $\ell_2$ regularization in terms of generalization and optimization. However, this advantage is not theoretically well-understood. One challenge here is that though intuitively Adam with $\ell_2$ regularization optimizes the $\ell_2$ regularized loss, it is not clear if AdamW optimizes a specific objective. In this work, we make progress toward understanding the benefit of AdamW by showing that it implicitly performs constrained optimization. More concretely, we show in the full-batch setting, if AdamW converges with any non-increasing learning rate schedule whose partial sum diverges, it must converge to a KKT point of the original loss under the constraint that the $\ell_\infty$ norm of the parameter is bounded by the inverse of the weight decay factor. This result is built on the observation that Adam can be viewed as a smoothed version of SignGD, which is the normalized steepest descent with respect to $\ell_\infty$ norm, and a surprising connection between normalized steepest descent with weight decay and Frank-Wolfe.
- Theoretical Analysis of Auto Rate-Tuning by Batch Normalization. In International Conference on Learning Representations, 2018.
- Implicit regularization in deep matrix factorization. In Advances in Neural Information Processing Systems, 2019a.
- Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning. PMLR, 2019b.
- On exact computation with an infinitely wide neural net. Advances in neural information processing systems, 32, 2019c.
- Understanding gradient descent on the edge of stability in deep learning. In International Conference on Machine Learning. PMLR, 2022.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Dissecting adam: The sign, magnitude and variance of stochastic gradients. In International Conference on Machine Learning, 2018.
- signSGD: Compressed optimisation for non-convex problems. In International Conference on Machine Learning, 2018.
- Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. In Conference on learning theory. PMLR, 2020.
- The iterates of the Frank–Wolfe algorithm may not converge. Mathematics of Operations Research, 2023.
- Siegfried Bos and E Chug. Using weight decay to optimize the generalization ability of a perceptron. In Proceedings of International Conference on Neural Networks (ICNN’96). IEEE, 1996.
- On the implicit bias of adam. arXiv preprint arXiv:2309.00079, 2023.
- Lion Secretly Solves a Constrained Optimization: As Lyapunov Predicts. In The Twelfth International Conference on Learning Representations, 2023.
- Symbolic discovery of optimization algorithms. Advances in Neural Information Processing Systems, 2024.
- On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization. In International Conference on Learning Representations, 2018.
- Robustness to Unbounded Smoothness of Generalized SignSGD. In Advances in Neural Information Processing Systems, 2022.
- Label Noise SGD Provably Prefers Flat Global Minimizers. In Advances in Neural Information Processing Systems, 2021.
- Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability. In The Eleventh International Conference on Learning Representations, 2022.
- A Simple Convergence Proof of Adam and Adagrad. Transactions on Machine Learning Research, 2022.
- Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 2011.
- An algorithm for quadratic programming. Naval research logistics quarterly, 1956.
- Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems, 2017.
- Characterizing implicit bias in terms of optimization geometry. In International Conference on Machine Learning. PMLR, 2018.
- A novel convergence analysis for algorithms of the adam family. arXiv preprint arXiv:2112.03459, 2021.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
- Geoffrey E Hinton. Learning translation invariant recognition in a massively parallel networks. In International conference on parallel architectures and languages Europe, 1987.
- Norm matters: efficient and accurate normalization schemes in deep networks. In Advances in Neural Information Processing Systems, 2018.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, 2015.
- Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, 2018.
- Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In International conference on machine learning, 2013.
- Gradient descent aligns the layers of deep linear networks. In International Conference on Learning Representations, 2018.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 2012.
- A simple weight decay can improve generalization. Advances in neural information processing systems, 1991.
- Noise Is Not the Main Factor Behind the Gap Between Sgd and Adam on Transformers, But Sign Descent Might Be. In The Eleventh International Conference on Learning Representations, 2022.
- Towards explaining the regularization effect of initial large learning rate in training neural networks. In Advances in Neural Information Processing Systems, 2019.
- An Exponential Learning Rate Schedule for Deep Learning. In International Conference on Learning Representations, 2019.
- Reconciling modern deep learning with traditional optimization analyses: The intrinsic learning rate. Advances in Neural Information Processing Systems, 2020.
- What Happens after SGD Reaches Zero Loss?–A Mathematical Framework. In International Conference on Learning Representations, 2021.
- Robust training of neural networks using scale invariant architectures. In International Conference on Machine Learning, 2022a.
- Fast Mixing of Stochastic Gradient Descent with Normalization and Weight Decay. In Advances in Neural Information Processing Systems, 2022b.
- Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training. In The Twelfth International Conference on Learning Representations, 2023.
- Decoupled Weight Decay Regularization. In International Conference on Learning Representations, 2018.
- Adaptive Gradient Methods with Dynamic Bound of Learning Rate. In International Conference on Learning Representations, 2018.
- Gradient Descent Maximizes the Margin of Homogeneous Neural Networks. In International Conference on Learning Representations, 2019.
- Gradient Descent on Two-layer Nets: Margin Maximization and Simplicity Bias. In Advances in Neural Information Processing Systems, 2021.
- On the SDEs and Scaling Rules for Adaptive Gradient Algorithms. In Advances in Neural Information Processing Systems, 2022.
- Building a large annotated corpus of English: The Penn Treebank. Computational linguistics, 1993.
- Lexicographic and depth-sensitive margins in homogeneous and non-homogeneous deep models. In International Conference on Machine Learning, 2019a.
- Convergence of gradient descent on separable data. In The 22nd International Conference on Artificial Intelligence and Statistics, 2019b.
- On the Convergence of Adam and Beyond. In International Conference on Learning Representations, 2018.
- A stochastic approximation method. The annals of mathematical statistics, 1951.
- Learning representations by back-propagating errors. nature, 1986.
- Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, 2018.
- Rmsprop converges with proper hyperparameter. In International conference on learning representation, 2021.
- The Implicit Bias of Gradient Descent on Separable Data. In International Conference on Learning Representations, 2018.
- Lecture 6.5-rmsprop, coursera: Neural networks for machine learning. University of Toronto, Technical Report, 2012.
- The implicit bias for adaptive optimization algorithms on homogeneous neural networks. In International Conference on Machine Learning, 2021.
- How Sharpness-Aware Minimization Minimizes Sharpness? In The Eleventh International Conference on Learning Representations, 2022.
- The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, 2017.
- Adahessian: An adaptive second order optimizer for machine learning. In proceedings of the AAAI conference on artificial intelligence, 2021.
- Matthew D Zeiler. ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
- Three Mechanisms of Weight Decay Regularization. In International Conference on Learning Representations, 2018.
- Why are adaptive methods good for attention models? Advances in Neural Information Processing Systems, 2020.
- Adam Can Converge Without Any Modification On Update Rules. In Advances in Neural Information Processing Systems, 2022.
- Understanding adamw through proximal methods and scale-freeness. Transactions on Machine Learning Research, 2022.
- A sufficient condition for convergences of adam and rmsprop. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 2019.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.