Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 168 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 35 tok/s Pro
GPT-5 High 34 tok/s Pro
GPT-4o 130 tok/s Pro
Kimi K2 170 tok/s Pro
GPT OSS 120B 437 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Stochastic Mirror Descent on Overparameterized Nonlinear Models: Convergence, Implicit Regularization, and Generalization (1906.03830v1)

Published 10 Jun 2019 in cs.LG, math.OC, and stat.ML

Abstract: Most modern learning problems are highly overparameterized, meaning that there are many more parameters than the number of training data points, and as a result, the training loss may have infinitely many global minima (parameter vectors that perfectly interpolate the training data). Therefore, it is important to understand which interpolating solutions we converge to, how they depend on the initialization point and the learning algorithm, and whether they lead to different generalization performances. In this paper, we study these questions for the family of stochastic mirror descent (SMD) algorithms, of which the popular stochastic gradient descent (SGD) is a special case. Our contributions are both theoretical and experimental. On the theory side, we show that in the overparameterized nonlinear setting, if the initialization is close enough to the manifold of global minima (something that comes for free in the highly overparameterized case), SMD with sufficiently small step size converges to a global minimum that is approximately the closest one in Bregman divergence. On the experimental side, our extensive experiments on standard datasets and models, using various initializations, various mirror descents, and various Bregman divergences, consistently confirms that this phenomenon happens in deep learning. Our experiments further indicate that there is a clear difference in the generalization performance of the solutions obtained by different SMD algorithms. Experimenting on a standard image dataset and network architecture with SMD with different kinds of implicit regularization, $\ell_1$ to encourage sparsity, $\ell_2$ yielding SGD, and $\ell_{10}$ to discourage large components in the parameter vector, consistently and definitively shows that $\ell_{10}$-SMD has better generalization performance than SGD, which in turn has better generalization performance than $\ell_1$-SMD.

Citations (66)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.