Momentum via Primal Averaging: Theoretical Insights and Learning Rate Schedules for Non-Convex Optimization (2010.00406v4)
Abstract: Momentum methods are now used pervasively within the machine learning community for training non-convex models such as deep neural networks. Empirically, they out perform traditional stochastic gradient descent (SGD) approaches. In this work we develop a Lyapunov analysis of SGD with momentum (SGD+M), by utilizing a equivalent rewriting of the method known as the stochastic primal averaging (SPA) form. This analysis is much tighter than previous theory in the non-convex case, and due to this we are able to give precise insights into when SGD+M may out-perform SGD, and what hyper-parameter schedules will work and why.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.