Emergent Mind

Momentum-Based Variance Reduction in Non-Convex SGD

(1905.10018)
Published May 24, 2019 in cs.LG , math.OC , and stat.ML

Abstract

Variance reduction has emerged in recent years as a strong competitor to stochastic gradient descent in non-convex problems, providing the first algorithms to improve upon the converge rate of stochastic gradient descent for finding first-order critical points. However, variance reduction techniques typically require carefully tuned learning rates and willingness to use excessively large "mega-batches" in order to achieve their improved results. We present a new algorithm, STORM, that does not require any batches and makes use of adaptive learning rates, enabling simpler implementation and less hyperparameter tuning. Our technique for removing the batches uses a variant of momentum to achieve variance reduction in non-convex optimization. On smooth losses $F$, STORM finds a point $\boldsymbol{x}$ with $\mathbb{E}[|\nabla F(\boldsymbol{x})|]\le O(1/\sqrt{T}+\sigma{1/3}/T{1/3})$ in $T$ iterations with $\sigma2$ variance in the gradients, matching the optimal rate but without requiring knowledge of $\sigma$.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a summary of this paper on our Pro plan:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.