Emergent Mind

Abstract

We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize $\eta$ is so large that the loss initially oscillates. We show that GD exits this initial oscillatory phase rapidly -- in $\mathcal{O}(\eta)$ steps -- and subsequently achieves an $\tilde{\mathcal{O}}(1 / (\eta t) )$ convergence rate after $t$ additional steps. Our results imply that, given a budget of $T$ steps, GD can achieve an accelerated loss of $\tilde{\mathcal{O}}(1/T2)$ with an aggressive stepsize $\eta:= \Theta( T)$, without any use of momentum or variable stepsize schedulers. Our proof technique is versatile and also handles general classification loss functions (where exponential tails are needed for the $\tilde{\mathcal{O}}(1/T2)$ acceleration), nonlinear predictors in the neural tangent kernel regime, and online stochastic gradient descent (SGD) with a large stepsize, under suitable separability conditions.

Gradient descent equation behaviors in various computational scenarios.

Overview

  • This study critically examines how using large step sizes in gradient descent (GD), notably for logistic regression and other models, can hasten convergence by initially tolerating non-monotonic, fluctuating losses.

  • The research delineates the optimization journey into three phases for logistic regression under large step size: an initial phase with oscillatory losses, a transitional phase towards stability, and a final phase of steady, rapid loss reduction.

  • The paper expands its analysis from logistic regression to include general classification loss functions and two-layer neural networks within the Neural Tangent Kernel (NTK) regime, establishing the broad applicability and advantages of large step size GD across different settings.

  • Future research directions highlighted include examining the effects of step size and regularization, extending analyses to more complex models, and investigating the impact of large step size GD on model generalization.

Large Stepsize Gradient Descent in Logistic Regression and Beyond: A Non-Monotonic Approach to Efficient Optimization

Introduction

Modern machine learning optimization techniques, particularly gradient descent (GD) methods, are cornerstone algorithms for training various models. The traditional wisdom in gradient-based optimization prescribes small, carefully tuned step sizes to ensure steady progress toward a loss function minimum. However, this approach often overlooks the potential benefits of employing larger step sizes, which can, counterintuitively, expedite convergence despite inducing periods of non-monotonic, oscillatory loss progression. This paper rigorously explores such dynamics in the context of logistic regression with linearly separable data, general classification loss functions, and two-layer neural networks in the Neural Tangent Kernel (NTK) regime. The central thesis posits that an aggressive, constant step size can defy conventional norms, achieving accelerated loss minimization by navigating through an initial unstable phase of optimization.

Large Stepsize Gradient Descent for Logistic Regression

The research first addresses logistic regression under a constant large step size setting. It dissects the optimization process into three distinct phases: an initial Edge of Stability (EoS) phase characterized by oscillatory loss values, a transitional phase leading to stabilization, and a final phase where the loss diminishes monotonically at a rate significantly faster than traditional approaches. Through rigorous analysis supported by empirical evidence, the study demonstrates that a large step size facilitates rapid exit from the EoS phase, which paradoxically accelerates overall convergence when managed appropriately.

Key findings include:

  • During the EoS phase, despite non-monotonic loss behavior, an averaging over iterations reveals a diminishing loss trend.
  • A precise characterization of the phase transition time, after which the algorithm enters a stable phase with monotonically decreasing loss.
  • A substantial decrease in loss can be achieved with an appropriately chosen step size that is proportional to the total number of optimization steps.

Extension to General Losses and Nonlinear Models

The paper broadens the application of large step size GD beyond logistic regression to encompass general classification loss functions and nonlinear predictors modeled by two-layer networks in the NTK regime. This extension necessitates mild regularity assumptions on the loss functions, accommodating behaviors like Lipschitz continuity and self-bounded gradients. Importantly, the work establishes that non-monotonic optimization with large step sizes remains advantageous across a range of loss functions and model complexities, suggesting a universal principle at play.

For the analysis in the NTK setting, several notable insights emerge:

  • The network width requirement scales polynomially with the inverse margin and the aggressiveness of the chosen step size, highlighting the synergy between model capacity and step size magnitude in achieving efficient optimization.
  • A nuanced understanding of how the loss function's properties—such as its growth rate, smoothness, and tail behavior—influence the optimization trajectory and convergence rates when employing large step sizes.

Theoretical Implications and Future Directions

This study's theoretical contributions illuminate the underlying mechanisms by which large step size GD circumvents traditional barriers to efficient optimization. Specifically, it discredits the notion that monotonically decreasing loss is a prerequisite for rapid convergence. Instead, it posits that an initial foray into non-monotonicity—courtesy of a large step size—can set the stage for accelerated progress.

Looking ahead, several avenues for further research present themselves. These include relaxing the linear separability assumptions, exploring the interplay between step size magnitude and regularization, and extending the current analysis to encompass multi-layer networks and more complex data structures. Additionally, investigating the implicit biases induced by large step size GD and their implications for model generalization remains an area ripe for exploration.

Conclusion

In conclusion, this paper challenges prevailing optimization paradigms in machine learning, demonstrating through rigorous theory and empirical validation that large step size GD can decisively outperform its conservative counterparts. By embracing non-monotonicity, this work paves the way for more efficient and theoretically sound approaches to training machine learning models.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.