Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exponential escape efficiency of SGD from sharp minima in non-stationary regime (2111.04004v2)

Published 7 Nov 2021 in cs.LG and math.OC

Abstract: We show that stochastic gradient descent (SGD) escapes from sharp minima exponentially fast even before SGD reaches stationary distribution. SGD has been a de-facto standard training algorithm for various machine learning tasks. However, there still exists an open question as to why SGDs find highly generalizable parameters from non-convex target functions, such as the loss function of neural networks. An "escape efficiency" has been an attractive notion to tackle this question, which measures how SGD efficiently escapes from sharp minima with potentially low generalization performance. Despite its importance, the notion has the limitation that it works only when SGD reaches a stationary distribution after sufficient updates. In this paper, we develop a new theory to investigate escape efficiency of SGD with Gaussian noise, by introducing the Large Deviation Theory for dynamical systems. Based on the theory, we prove that the fast escape form sharp minima, named exponential escape, occurs in a non-stationary setting, and that it holds not only for continuous SGD but also for discrete SGD. A key notion for the result is a quantity called "steepness," which describes the SGD's stochastic behavior throughout its training process. Our experiments are consistent with our theory.

Citations (3)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com