Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency (2402.15926v2)
Abstract: We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize $\eta$ is so large that the loss initially oscillates. We show that GD exits this initial oscillatory phase rapidly -- in $\mathcal{O}(\eta)$ steps -- and subsequently achieves an $\tilde{\mathcal{O}}(1 / (\eta t) )$ convergence rate after $t$ additional steps. Our results imply that, given a budget of $T$ steps, GD can achieve an accelerated loss of $\tilde{\mathcal{O}}(1/T2)$ with an aggressive stepsize $\eta:= \Theta( T)$, without any use of momentum or variable stepsize schedulers. Our proof technique is versatile and also handles general classification loss functions (where exponential tails are needed for the $\tilde{\mathcal{O}}(1/T2)$ acceleration), nonlinear predictors in the neural tangent kernel regime, and online stochastic gradient descent (SGD) with a large stepsize, under suitable separability conditions.
- Learning threshold neurons via edge of stability. In Thirty-seventh Conference on Neural Information Processing Systems.
- Understanding the unstable convergence of gradient descent. In International Conference on Machine Learning. PMLR.
- A convergence theory for deep learning via over-parameterization. In International conference on machine learning. PMLR.
- Acceleration by stepsize hedging II: Silver stepsize schedule for smooth convex optimization. arXiv preprint arXiv:2309.16530 .
- SGD with large step sizes learns sparse features. In International Conference on Machine Learning. PMLR.
- Generalization performance of support vector machines and other pattern classifiers. Advances in Kernel methods—support vector learning 43–54.
- Theory of classification: A survey of some recent advances. ESAIM: probability and statistics 9 323–375.
- Beyond the edge of stability via two-step gradient updates. In International Conference on Machine Learning. PMLR.
- From stability to chaos: Analyzing gradient descent dynamics in quadratic regression. arXiv preprint arXiv:2310.01687 .
- How much over-parameterization is sufficient to learn deep ReLU networks? In International Conference on Learning Representations.
- Gradient descent on neural networks typically occurs at the edge of stability. In International Conference on Learning Representations.
- Self-stabilization: The implicit bias of gradient descent at the edge of stability. In The Eleventh International Conference on Learning Representations.
- Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations.
- Dudley, R. M. (1978). Central limit theorems for empirical measures. Annals of Probability 6 899–929.
- (S)GD over diagonal linear networks: Implicit bias, large stepsizes and edge of stability. In Thirty-seventh Conference on Neural Information Processing Systems.
- A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences 55 119–139.
- Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677 .
- Near-tight margin-based generalization bounds for support vector machines. In International Conference on Machine Learning. PMLR.
- Stable sample compression schemes: New applications and an optimal SVM margin bound. In Algorithmic Learning Theory. PMLR.
- Hazan, E. (2022). Introduction to online convex optimization. MIT Press.
- Train longer, generalize better: Closing the generalization gap in large batch training of neural networks. Advances in Neural Information Processing Systems 30.
- Neural tangent kernel: Convergence and generalization in neural networks. Advances in Neural Information Processing Systems 31.
- Gradient descent follows the regularization path for general losses. In Conference on Learning Theory. PMLR.
- Fast margin maximization via dual acceleration. In International Conference on Machine Learning. PMLR.
- Risk and parameter convergence of logistic regression. arXiv preprint arXiv:1803.07300 .
- Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks. In International Conference on Learning Representations.
- Characterizing the implicit bias via a primal-dual analysis. In Algorithmic Learning Theory. PMLR.
- Big-step-little-step: Efficient gradient methods for objectives with multiple scales. In Conference on Learning Theory. PMLR.
- Stochasticity of deterministic gradient descent: Large learning rate for multiscale objective function. Advances in Neural Information Processing Systems 33 2625–2638.
- Gradient descent monotonically decreases the sharpness of gradient flow solutions in scalar networks and beyond. In Proceedings of the 40th International Conference on Machine Learning, vol. 202 of Proceedings of Machine Learning Research. PMLR.
- Understanding the generalization benefit of normalization layers: Sharpness reduction. Advances in Neural Information Processing Systems 35 34689–34708.
- Beyond the quadratic approximation: The multiscale structure of neural network loss landscapes. arXiv preprint arXiv:2204.11326 .
- Convergence of gradient descent on separable data. In The 22nd International Conference on Artificial Intelligence and Statistics. PMLR.
- Nesterov, Y. (2018). Lectures on Convex Optimization. 2nd ed. Springer Publishing Company, Incorporated.
- Novikoff, A. B. (1962). On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata. New York, NY.
- The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research 19 2822–2878.
- Telgarsky, M. (2013). Margins, shrinkage, and boosting. In International Conference on Machine Learning. PMLR.
- Telgarsky, M. (2021). Deep learning theory lecture notes.
- Is importance weighting incompatible with interpolating classifiers? In International Conference on Learning Representations.
- Large learning rate tames homogeneity: Convergence and balancing effect. In International Conference on Learning Representations.
- Good regularity creates large learning rate implicit biases: edge of stability, balancing, and catapult. arXiv preprint arXiv:2310.17087 .
- Analyzing sharpness along GD trajectory: Progressive sharpening and edge of stability. Advances in Neural Information Processing Systems 35 9983–9994.
- Implicit bias of gradient descent for logistic regression at the edge of stability. In Thirty-seventh Conference on Neural Information Processing Systems.
- How SGD selects the global minima in over-parameterized learning: A dynamical stability perspective. Advances in Neural Information Processing Systems 31.
- Understanding edge-of-stability training dynamics with a minimalist example. In The Eleventh International Conference on Learning Representations.
- Stochastic gradient descent optimizes over-parameterized deep ReLU networks. arXiv preprint arXiv:1811.08888 .
- An improved analysis of training over-parameterized deep neural networks. Advances in Neural Information Processing Systems 32.