Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency (2402.15926v2)

Published 24 Feb 2024 in cs.LG and stat.ML

Abstract: We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize $\eta$ is so large that the loss initially oscillates. We show that GD exits this initial oscillatory phase rapidly -- in $\mathcal{O}(\eta)$ steps -- and subsequently achieves an $\tilde{\mathcal{O}}(1 / (\eta t) )$ convergence rate after $t$ additional steps. Our results imply that, given a budget of $T$ steps, GD can achieve an accelerated loss of $\tilde{\mathcal{O}}(1/T^2)$ with an aggressive stepsize $\eta:= \Theta( T)$, without any use of momentum or variable stepsize schedulers. Our proof technique is versatile and also handles general classification loss functions (where exponential tails are needed for the $\tilde{\mathcal{O}}(1/T^2)$ acceleration), nonlinear predictors in the neural tangent kernel regime, and online stochastic gradient descent (SGD) with a large stepsize, under suitable separability conditions.

References (47)

Citations (8)

View on Semantic Scholar

Summary

The paper demonstrates that using large step sizes in gradient descent induces an initial oscillatory loss phase that ultimately leads to rapid convergence.
It decomposes the optimization process into distinct phases, with a swift exit from the Edge of Stability enabling accelerated, stable descent.
The analysis extends to general loss functions and NTK-modeled nonlinear networks, revealing the universal benefits of non-monotonic optimization.

Large Stepsize Gradient Descent in Logistic Regression and Beyond: A Non-Monotonic Approach to Efficient Optimization

Introduction

Modern machine learning optimization techniques, particularly gradient descent (GD) methods, are cornerstone algorithms for training various models. The traditional wisdom in gradient-based optimization prescribes small, carefully tuned step sizes to ensure steady progress toward a loss function minimum. However, this approach often overlooks the potential benefits of employing larger step sizes, which can, counterintuitively, expedite convergence despite inducing periods of non-monotonic, oscillatory loss progression. This paper rigorously explores such dynamics in the context of logistic regression with linearly separable data, general classification loss functions, and two-layer neural networks in the Neural Tangent Kernel (NTK) regime. The central thesis posits that an aggressive, constant step size can defy conventional norms, achieving accelerated loss minimization by navigating through an initial unstable phase of optimization.

Large Stepsize Gradient Descent for Logistic Regression

The research first addresses logistic regression under a constant large step size setting. It dissects the optimization process into three distinct phases: an initial Edge of Stability (EoS) phase characterized by oscillatory loss values, a transitional phase leading to stabilization, and a final phase where the loss diminishes monotonically at a rate significantly faster than traditional approaches. Through rigorous analysis supported by empirical evidence, the paper demonstrates that a large step size facilitates rapid exit from the EoS phase, which paradoxically accelerates overall convergence when managed appropriately.

Key findings include:

During the EoS phase, despite non-monotonic loss behavior, an averaging over iterations reveals a diminishing loss trend.
A precise characterization of the phase transition time, after which the algorithm enters a stable phase with monotonically decreasing loss.
A substantial decrease in loss can be achieved with an appropriately chosen step size that is proportional to the total number of optimization steps.

Extension to General Losses and Nonlinear Models

The paper broadens the application of large step size GD beyond logistic regression to encompass general classification loss functions and nonlinear predictors modeled by two-layer networks in the NTK regime. This extension necessitates mild regularity assumptions on the loss functions, accommodating behaviors like Lipschitz continuity and self-bounded gradients. Importantly, the work establishes that non-monotonic optimization with large step sizes remains advantageous across a range of loss functions and model complexities, suggesting a universal principle at play.

For the analysis in the NTK setting, several notable insights emerge:

The network width requirement scales polynomially with the inverse margin and the aggressiveness of the chosen step size, highlighting the synergy between model capacity and step size magnitude in achieving efficient optimization.
A nuanced understanding of how the loss function's properties—such as its growth rate, smoothness, and tail behavior—influence the optimization trajectory and convergence rates when employing large step sizes.

Theoretical Implications and Future Directions

This paper's theoretical contributions illuminate the underlying mechanisms by which large step size GD circumvents traditional barriers to efficient optimization. Specifically, it discredits the notion that monotonically decreasing loss is a prerequisite for rapid convergence. Instead, it posits that an initial foray into non-monotonicity—courtesy of a large step size—can set the stage for accelerated progress.

Looking ahead, several avenues for further research present themselves. These include relaxing the linear separability assumptions, exploring the interplay between step size magnitude and regularization, and extending the current analysis to encompass multi-layer networks and more complex data structures. Additionally, investigating the implicit biases induced by large step size GD and their implications for model generalization remains an area ripe for exploration.

Conclusion

In conclusion, this paper challenges prevailing optimization paradigms in machine learning, demonstrating through rigorous theory and empirical validation that large step size GD can decisively outperform its conservative counterparts. By embracing non-monotonicity, this work paves the way for more efficient and theoretically sound approaches to training machine learning models.

Related Papers

Tweets

https://twitter.com/uuujingfeng/status/1762517175703928862

https://twitter.com/yenhuan_li/status/1762463157870866909

https://twitter.com/uuujingfeng/status/1768314029687345479