Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Stochastic Gradient Descent outperforms Gradient Descent in recovering a high-dimensional signal in a glassy energy landscape (2309.04788v2)

Published 9 Sep 2023 in cs.LG and cond-mat.dis-nn

Abstract: Stochastic Gradient Descent (SGD) is an out-of-equilibrium algorithm used extensively to train artificial neural networks. However very little is known on to what extent SGD is crucial for to the success of this technology and, in particular, how much it is effective in optimizing high-dimensional non-convex cost functions as compared to other optimization algorithms such as Gradient Descent (GD). In this work we leverage dynamical mean field theory to benchmark its performances in the high-dimensional limit. To do that, we consider the problem of recovering a hidden high-dimensional non-linearly encrypted signal, a prototype high-dimensional non-convex hard optimization problem. We compare the performances of SGD to GD and we show that SGD largely outperforms GD for sufficiently small batch sizes. In particular, a power law fit of the relaxation time of these algorithms shows that the recovery threshold for SGD with small batch size is smaller than the corresponding one of GD.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. S. Kirkpatrick, C. D. Gelatt Jr, and M. P. Vecchi, Optimization by simulated annealing, science 220, 671 (1983).
  2. S. Liu, D. Papailiopoulos, and D. Achlioptas, Bad global minima exist and sgd can reach them, Advances in Neural Information Processing Systems 33, 8543 (2020).
  3. B. Neyshabur, R. Tomioka, and N. Srebro, In search of the real inductive bias: On the role of implicit regularization in deep learning, arXiv preprint arXiv:1412.6614  (2014).
  4. D. Saad and S. A. Solla, On-line learning in soft committee machines, Physical Review E 52, 4225 (1995a).
  5. D. Saad and S. A. Solla, Exact solution for on-line learning in multilayer neural networks, Physical Review Letters 74, 4337 (1995b).
  6. D. Saad and M. Rattray, Globally optimal parameters for on-line learning in multilayer neural networks, Physical review letters 79, 2578 (1997).
  7. D. Saad, On-line learning in neural networks (Cambridge University Press, 1999).
  8. A. C. Coolen and D. Saad, Dynamics of learning with restricted training sets, Physical Review E 62, 5444 (2000).
  9. A. C. Coolen, D. Saad, and Y.-S. Xiong, On-line learning from restricted training sets in multilayer neural networks, Europhysics Letters 51, 691 (2000).
  10. N. Yang, C. Tang, and Y. Tu, Stochastic gradient descent introduces an effective landscape-dependent regularization favoring flat solutions, Physical Review Letters 130, 237101 (2023).
  11. S. Sarao Mannelli and P. Urbani, Analytical study of momentum-based acceleration methods in paradigmatic high-dimensional non-convex problems, Advances in Neural Information Processing Systems 34, 187 (2021).
  12. F. Mignacco, P. Urbani, and L. Zdeborová, Stochasticity helps to navigate rough landscapes: comparing gradient-descent-based algorithms in the phase retrieval problem, Machine Learning: Science and Technology 2, 035029 (2021).
  13. G. Ben Arous, R. Gheissari, and A. Jagannath, High-dimensional limit theorems for sgd: Effective dynamics and critical scaling, Advances in Neural Information Processing Systems 35, 25349 (2022).
  14. M. C. Angelini and F. Ricci-Tersenghi, Limits and performances of algorithms based on simulated annealing in solving sparse hard inference problems, Physical Review X 13, 021011 (2023).
  15. G. Ben Arous, R. Gheissari, and A. Jagannath, Algorithmic thresholds for tensor pca,   (2020).
  16. G. B. Arous, R. Gheissari, and A. Jagannath, Online stochastic gradient descent on non-convex losses from high-dimensional inference, The Journal of Machine Learning Research 22, 4788 (2021).
  17. Y. Chi, Y. M. Lu, and Y. Chen, Nonconvex optimization meets low-rank matrix factorization: An overview, IEEE Transactions on Signal Processing 67, 5239 (2019).
  18. L. F. Cugliandolo, Recent applications of dynamical mean-field methods, arXiv preprint arXiv:2305.01229  (2023).
  19. Y. V. Fyodorov, A spin glass model for reconstructing nonlinearly encrypted signals corrupted by noise, Journal of Statistical Physics 175, 789 (2019).
  20. M. Mezard, G. Parisi, and M. A. Virasoro, Spin glass theory and beyond (World Scientific, Singapore, 1987).
  21. P. J. Kamali and P. Urbani, Dynamical mean field theory for models of confluent tissues and beyond, arXiv preprint arXiv:2306.06420  (2023).
  22. P. Urbani, A continuous constraint satisfaction problem for the rigidity transition in confluent tissues, Journal of Physics A: Mathematical and Theoretical 56, 115003 (2023).
  23. A. Montanari and E. Subag, Solving overparametrized systems of random equations: I. model and algorithms for approximate solutions, arXiv preprint arXiv:2306.13326  (2023).
  24. Note that the central limit theorem implies that m0=10−4subscript𝑚0superscript104m_{0}=10^{-4}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT is equivalent to having a system size of order N≃108similar-to-or-equals𝑁superscript108N\simeq 10^{8}italic_N ≃ 10 start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT which is impossible to simulate.
  25. This is the best indication that SGD for b=0.1𝑏0.1b=0.1italic_b = 0.1 has a smaller recovery threshold than GD. Indeed fitting the relaxation time in the SGD case is hard given that we have data only on just a decade in time.
  26. L. F. Cugliandolo, Dynamics of glassy systems, arXiv preprint cond-mat/0210312  (2002).
  27. F. Mignacco and P. Urbani, The effective noise of stochastic gradient descent, Journal of Statistical Mechanics: Theory and Experiment 2022, 083405 (2022).
  28. J. Zinn-Justin, Quantum field theory and critical phenomena, Vol. 171 (Oxford university press, 2021).
  29. T. Castellani and A. Cavagna, Spin-glass theory for pedestrians, Journal of Statistical Mechanics: Theory and Experiment 2005, P05012 (2005).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Persia Jana Kamali (2 papers)
  2. Pierfrancesco Urbani (55 papers)
Citations (6)

Summary

We haven't generated a summary for this paper yet.