Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency (2402.15926v2)

Published 24 Feb 2024 in cs.LG and stat.ML

Abstract: We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize $\eta$ is so large that the loss initially oscillates. We show that GD exits this initial oscillatory phase rapidly -- in $\mathcal{O}(\eta)$ steps -- and subsequently achieves an $\tilde{\mathcal{O}}(1 / (\eta t) )$ convergence rate after $t$ additional steps. Our results imply that, given a budget of $T$ steps, GD can achieve an accelerated loss of $\tilde{\mathcal{O}}(1/T2)$ with an aggressive stepsize $\eta:= \Theta( T)$, without any use of momentum or variable stepsize schedulers. Our proof technique is versatile and also handles general classification loss functions (where exponential tails are needed for the $\tilde{\mathcal{O}}(1/T2)$ acceleration), nonlinear predictors in the neural tangent kernel regime, and online stochastic gradient descent (SGD) with a large stepsize, under suitable separability conditions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Learning threshold neurons via edge of stability. In Thirty-seventh Conference on Neural Information Processing Systems.
  2. Understanding the unstable convergence of gradient descent. In International Conference on Machine Learning. PMLR.
  3. A convergence theory for deep learning via over-parameterization. In International conference on machine learning. PMLR.
  4. Acceleration by stepsize hedging II: Silver stepsize schedule for smooth convex optimization. arXiv preprint arXiv:2309.16530 .
  5. SGD with large step sizes learns sparse features. In International Conference on Machine Learning. PMLR.
  6. Generalization performance of support vector machines and other pattern classifiers. Advances in Kernel methods—support vector learning 43–54.
  7. Theory of classification: A survey of some recent advances. ESAIM: probability and statistics 9 323–375.
  8. Beyond the edge of stability via two-step gradient updates. In International Conference on Machine Learning. PMLR.
  9. From stability to chaos: Analyzing gradient descent dynamics in quadratic regression. arXiv preprint arXiv:2310.01687 .
  10. How much over-parameterization is sufficient to learn deep ReLU networks? In International Conference on Learning Representations.
  11. Gradient descent on neural networks typically occurs at the edge of stability. In International Conference on Learning Representations.
  12. Self-stabilization: The implicit bias of gradient descent at the edge of stability. In The Eleventh International Conference on Learning Representations.
  13. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations.
  14. Dudley, R. M. (1978). Central limit theorems for empirical measures. Annals of Probability 6 899–929.
  15. (S)GD over diagonal linear networks: Implicit bias, large stepsizes and edge of stability. In Thirty-seventh Conference on Neural Information Processing Systems.
  16. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences 55 119–139.
  17. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677 .
  18. Near-tight margin-based generalization bounds for support vector machines. In International Conference on Machine Learning. PMLR.
  19. Stable sample compression schemes: New applications and an optimal SVM margin bound. In Algorithmic Learning Theory. PMLR.
  20. Hazan, E. (2022). Introduction to online convex optimization. MIT Press.
  21. Train longer, generalize better: Closing the generalization gap in large batch training of neural networks. Advances in Neural Information Processing Systems 30.
  22. Neural tangent kernel: Convergence and generalization in neural networks. Advances in Neural Information Processing Systems 31.
  23. Gradient descent follows the regularization path for general losses. In Conference on Learning Theory. PMLR.
  24. Fast margin maximization via dual acceleration. In International Conference on Machine Learning. PMLR.
  25. Risk and parameter convergence of logistic regression. arXiv preprint arXiv:1803.07300 .
  26. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks. In International Conference on Learning Representations.
  27. Characterizing the implicit bias via a primal-dual analysis. In Algorithmic Learning Theory. PMLR.
  28. Big-step-little-step: Efficient gradient methods for objectives with multiple scales. In Conference on Learning Theory. PMLR.
  29. Stochasticity of deterministic gradient descent: Large learning rate for multiscale objective function. Advances in Neural Information Processing Systems 33 2625–2638.
  30. Gradient descent monotonically decreases the sharpness of gradient flow solutions in scalar networks and beyond. In Proceedings of the 40th International Conference on Machine Learning, vol. 202 of Proceedings of Machine Learning Research. PMLR.
  31. Understanding the generalization benefit of normalization layers: Sharpness reduction. Advances in Neural Information Processing Systems 35 34689–34708.
  32. Beyond the quadratic approximation: The multiscale structure of neural network loss landscapes. arXiv preprint arXiv:2204.11326 .
  33. Convergence of gradient descent on separable data. In The 22nd International Conference on Artificial Intelligence and Statistics. PMLR.
  34. Nesterov, Y. (2018). Lectures on Convex Optimization. 2nd ed. Springer Publishing Company, Incorporated.
  35. Novikoff, A. B. (1962). On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata. New York, NY.
  36. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research 19 2822–2878.
  37. Telgarsky, M. (2013). Margins, shrinkage, and boosting. In International Conference on Machine Learning. PMLR.
  38. Telgarsky, M. (2021). Deep learning theory lecture notes.
  39. Is importance weighting incompatible with interpolating classifiers? In International Conference on Learning Representations.
  40. Large learning rate tames homogeneity: Convergence and balancing effect. In International Conference on Learning Representations.
  41. Good regularity creates large learning rate implicit biases: edge of stability, balancing, and catapult. arXiv preprint arXiv:2310.17087 .
  42. Analyzing sharpness along GD trajectory: Progressive sharpening and edge of stability. Advances in Neural Information Processing Systems 35 9983–9994.
  43. Implicit bias of gradient descent for logistic regression at the edge of stability. In Thirty-seventh Conference on Neural Information Processing Systems.
  44. How SGD selects the global minima in over-parameterized learning: A dynamical stability perspective. Advances in Neural Information Processing Systems 31.
  45. Understanding edge-of-stability training dynamics with a minimalist example. In The Eleventh International Conference on Learning Representations.
  46. Stochastic gradient descent optimizes over-parameterized deep ReLU networks. arXiv preprint arXiv:1811.08888 .
  47. An improved analysis of training over-parameterized deep neural networks. Advances in Neural Information Processing Systems 32.
Citations (8)

Summary

  • The paper demonstrates that using large step sizes in gradient descent induces an initial oscillatory loss phase that ultimately leads to rapid convergence.
  • It decomposes the optimization process into distinct phases, with a swift exit from the Edge of Stability enabling accelerated, stable descent.
  • The analysis extends to general loss functions and NTK-modeled nonlinear networks, revealing the universal benefits of non-monotonic optimization.

Large Stepsize Gradient Descent in Logistic Regression and Beyond: A Non-Monotonic Approach to Efficient Optimization

Introduction

Modern machine learning optimization techniques, particularly gradient descent (GD) methods, are cornerstone algorithms for training various models. The traditional wisdom in gradient-based optimization prescribes small, carefully tuned step sizes to ensure steady progress toward a loss function minimum. However, this approach often overlooks the potential benefits of employing larger step sizes, which can, counterintuitively, expedite convergence despite inducing periods of non-monotonic, oscillatory loss progression. This paper rigorously explores such dynamics in the context of logistic regression with linearly separable data, general classification loss functions, and two-layer neural networks in the Neural Tangent Kernel (NTK) regime. The central thesis posits that an aggressive, constant step size can defy conventional norms, achieving accelerated loss minimization by navigating through an initial unstable phase of optimization.

Large Stepsize Gradient Descent for Logistic Regression

The research first addresses logistic regression under a constant large step size setting. It dissects the optimization process into three distinct phases: an initial Edge of Stability (EoS) phase characterized by oscillatory loss values, a transitional phase leading to stabilization, and a final phase where the loss diminishes monotonically at a rate significantly faster than traditional approaches. Through rigorous analysis supported by empirical evidence, the paper demonstrates that a large step size facilitates rapid exit from the EoS phase, which paradoxically accelerates overall convergence when managed appropriately.

Key findings include:

  • During the EoS phase, despite non-monotonic loss behavior, an averaging over iterations reveals a diminishing loss trend.
  • A precise characterization of the phase transition time, after which the algorithm enters a stable phase with monotonically decreasing loss.
  • A substantial decrease in loss can be achieved with an appropriately chosen step size that is proportional to the total number of optimization steps.

Extension to General Losses and Nonlinear Models

The paper broadens the application of large step size GD beyond logistic regression to encompass general classification loss functions and nonlinear predictors modeled by two-layer networks in the NTK regime. This extension necessitates mild regularity assumptions on the loss functions, accommodating behaviors like Lipschitz continuity and self-bounded gradients. Importantly, the work establishes that non-monotonic optimization with large step sizes remains advantageous across a range of loss functions and model complexities, suggesting a universal principle at play.

For the analysis in the NTK setting, several notable insights emerge:

  • The network width requirement scales polynomially with the inverse margin and the aggressiveness of the chosen step size, highlighting the synergy between model capacity and step size magnitude in achieving efficient optimization.
  • A nuanced understanding of how the loss function's properties—such as its growth rate, smoothness, and tail behavior—influence the optimization trajectory and convergence rates when employing large step sizes.

Theoretical Implications and Future Directions

This paper's theoretical contributions illuminate the underlying mechanisms by which large step size GD circumvents traditional barriers to efficient optimization. Specifically, it discredits the notion that monotonically decreasing loss is a prerequisite for rapid convergence. Instead, it posits that an initial foray into non-monotonicity—courtesy of a large step size—can set the stage for accelerated progress.

Looking ahead, several avenues for further research present themselves. These include relaxing the linear separability assumptions, exploring the interplay between step size magnitude and regularization, and extending the current analysis to encompass multi-layer networks and more complex data structures. Additionally, investigating the implicit biases induced by large step size GD and their implications for model generalization remains an area ripe for exploration.

Conclusion

In conclusion, this paper challenges prevailing optimization paradigms in machine learning, demonstrating through rigorous theory and empirical validation that large step size GD can decisively outperform its conservative counterparts. By embracing non-monotonicity, this work paves the way for more efficient and theoretically sound approaches to training machine learning models.