Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability

Published 26 Feb 2021 in cs.LG and stat.ML | (2103.00065v3)

Abstract: We empirically demonstrate that full-batch gradient descent on neural network training objectives typically operates in a regime we call the Edge of Stability. In this regime, the maximum eigenvalue of the training loss Hessian hovers just above the numerical value $2 / \text{(step size)}$, and the training loss behaves non-monotonically over short timescales, yet consistently decreases over long timescales. Since this behavior is inconsistent with several widespread presumptions in the field of optimization, our findings raise questions as to whether these presumptions are relevant to neural network training. We hope that our findings will inspire future efforts aimed at rigorously understanding optimization at the Edge of Stability. Code is available at https://github.com/locuslab/edge-of-stability.

Abstract PDF Upgrade to Chat

Authors (5)

Citations (225)

View on Semantic Scholar

Summary

The paper demonstrates that full-batch gradient descent operates at the Edge of Stability, with the Hessian’s maximum eigenvalue stabilizing near 2/η.
It identifies a progressive sharpening phenomenon where sharpness increases uniformly across various architectures until reaching a critical threshold.
The study challenges standard L-smoothness assumptions and common step-size heuristics, urging a reevaluation of theoretical models for neural network training.

Analyzing Gradient Descent on Neural Networks: Behavior at the Edge of Stability

The paper "Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability" presents a comprehensive empirical study suggesting that when neural networks are trained using full-batch gradient descent, the optimization process frequently operates in a regime known as the "Edge of Stability." This study challenges several entrenched beliefs about optimization in neural networks, implicating both practical training dynamics and theoretical analyses.

Core Findings

Edge of Stability Regime:
- At the Edge of Stability, the maximum eigenvalue of the Hessian of the training loss, referred to as "sharpness," stabilizes near $2/\eta$ where $\eta$ is the gradient descent step size.
- Despite the sharpness exceeding expected stability thresholds $(\text{sharpness} > 2/\eta)$ , gradient descent doesn't diverge. Instead, it enters a regime where the training loss exhibits non-monotonic behavior over short periods but decreases consistently over longer timescales.
Progressive Sharpening Phenomenon:
- The sharpness tends to increase continuously during training until it approaches the critical value $2/\eta$ . This process, termed "progressive sharpening," occurs across various architectures and tasks.
Universal Application Across Architectures:
- The Edge of Stability regime is observed across a diverse set of neural network configurations, including fully-connected networks, convolutional networks, and complex architectures like Transformers trained on tasks like CIFAR-10 and WikiText-2.

Implications and Theoretical Challenges

Questioning Conventional Optimization Wisdom:

Inapplicability of $L$ -Smoothness:

The study finds that traditional $L$ -smoothness assumptions, which suggest bounds on the Hessian eigenvalue, do not hold in practical neural network training scenarios. This challenges the applicability of theoretical analyses that rely on these assumptions.

Monotonic Convergence Assumptions:

The authors show that the non-monotonic behavior of the training loss at the Edge of Stability contradicts numerous theoretical models which predict monotonic progress under certain conditions.

Quadratic Local Models:

Attempts to utilize quadratic Taylor approximations to model local behavior at the Edge of Stability are found lacking. Divergence would be expected if training dynamics adhered strictly to these quadratic models, indicating that neural networks do not conform to simple quadratic behavior at these operating points.

Step Size Selection Heuristics:

Conventional heuristics suggest adjusting step size based on local sharpness estimates (e.g., $\eta = 1/\lambda$ ). However, the proposed adaptive step sizing does not outperform fixed-step variants empirically, prompting a reevaluation of step size strategies.

Future Research Directions

The findings underscore several areas for future inquiry:

Mechanisms Behind Edge of Stability: Understanding why gradient descent functions effectively at the Edge of Stability could unveil new insights into implicit regularization phenomena and neural network convergence beyond traditional stability models.
Extending to Stochastic Gradient Descent (SGD): Although this study focuses on full-batch gradient descent, the principles might extend to SGD, albeit with modifications to account for stochasticity and batch size effects.
Generalization Implications: While sharpness has been traditionally linked to generalization in deep learning, this study specifically divorces sharpness considerations from direct generalization insights, demanding nuanced investigation into generalization-friendly regimes.

This paper is a significant contribution to understanding the idiosyncrasies of gradient descent in neural network training. It advocates for revisiting outdated optimization conventions, emphasizes the empirical versus theoretical gap, and sets a foundation for realigning mathematical models with observed training behaviors.

Markdown Report Issue