Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Role of Momentum in Smoothing Objective Function and Generalizability of Deep Neural Networks (2402.02325v3)

Published 4 Feb 2024 in cs.LG and math.OC

Abstract: For nonconvex objective functions, including deep neural networks, stochastic gradient descent (SGD) with momentum has fast convergence and excellent generalizability, but a theoretical explanation for this is lacking. In contrast to previous studies that defined the stochastic noise that occurs during optimization as the variance of the stochastic gradient, we define it as the gap between the search direction of the optimizer and the steepest descent direction and show that its level dominates generalizability of the model. We also show that the stochastic noise in SGD with momentum smoothes the objective function, the degree of which is determined by the learning rate, the batch size, the momentum factor, the variance of the stochastic gradient, and the upper bound of the gradient norm. By numerically deriving the stochastic noise level in SGD and SGD with momentum, we provide theoretical findings that help explain the training dynamics of SGD with momentum, which were not explained by previous studies on convergence and stability. We also provide experimental results supporting our assertion that model generalizability depends on the stochastic noise level.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. A PID controller approach for stochastic optimization of deep networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8522–8531, 2018.
  2. Visual Reconstruction. MIT Press, 1987.
  3. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, 2018.
  4. Closing the generalization gap of adaptive gradient methods in training deep neural networks. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, pages 3267–3275, 2021.
  5. Demon: Improved neural network training with momenutm decay. In IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP, 2022.
  6. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Learning, 40(4):834–848, 2018.
  7. On the convergence of a class of Adam-type algorithms for non-convex optimization. Proceedings of the 7th International Conference on Learning Representations, 2019.
  8. Momentum improves normalized SGD. In Proceedings of the 37th International Conference on Machine Learning, pages 2260–2268, 2020.
  9. A robust accelerated optimization algorithm for strongly convex functions. In Annual American Control Conference, pages 1376–1381, 2018.
  10. Escaping saddles with stochastic gradients. In Proceedings of the 35th International Conference on Machine Learning, pages 1163–1172, 2018.
  11. Aaron Defazio. Momentum via primal averaging: Theoretical insights and learning rate schedules for non-convex optimization. https://arxiv.org/abs/2010.00406, 2020.
  12. Adaptive subgradient methods for online learning and stochastic optimization. In Proceedings of the 23rd Conference on Learning Theory, pages 257–269, 2010.
  13. Convergence rates for the stochastic gradient descent method for non-convex objective functions. Journal of Machine Learning Research, 21:1–48, 2020.
  14. When and why momentum accelerates SGD: An empirical study. https://arxiv.org/abs/2306.09000, 2023.
  15. Escaping from saddle points - online stochastic gradient for tensor decomposition. In Proceedings of the 28th Conference on Learning Theory, pages 797–842, 2015.
  16. Understanding the role of momentum in stochastic gradient methods. In Advances in Neural Information Processing Systems, pages 9630–9640, 2019.
  17. A stochastic analog of the conjugate gradient method. Cybernetics, 8(1):138–140, 1972.
  18. Shape matters: Understanding the implicit bias of the noise covariance. In Proceedings of the 34th Conference on Learning Theory, pages 2315–2357, 2021.
  19. Train faster, generalize better: Stability of stochastic gradient descent. In Proceedings of The 33rd International Conference on Machine Learning, pages 1225–1234, 2016.
  20. Harshvardhan and Sebastian U. Stich. Escaping local minima with stochastic noise. In the 13th International OPT Workshop on Optimization for Machine Learning in NeurIPS 2021, 2021.
  21. On graduated optimization for stochastic non-convex problems. In Proceedings of The 33rd International Conference on Machine Learning, pages 1833–1841, 2016.
  22. Control batch size and learning rate to generalize well: Theoretical and empirical evidence. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 1141–1150, 2019.
  23. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  24. sharpDARTS: faster and more accurate differentiable architecture search. https://arxiv.org/abs/1903.09900, 2019.
  25. Hideaki Iiduka. Appropriate learning rates of adaptive learning rate optimization algorithms for training deep neural networks. IEEE Transactions on Cybernetics, 52(12):13250–13261, 2022a.
  26. Hideaki Iiduka. Critical bach size minimizes stochastic first-order oracle complexity of deep learning optimizer using hyperparameters close to one. https://arxiv.org/abs/2208.09814, 2022b.
  27. Towards understanding how momentum improves generalization in deep learning. In Proceedings of the 39th International Conference on Machine Learning, pages 9965–10040, 2022.
  28. How to escape saddle points efficiently. http://arxiv.org/abs/1703.00887, 2017.
  29. On large-batch training for deep learning: Generalization gap and sharp minima. In Proceedings of the 5th International Conference on Learning Representations, 2017.
  30. On the insufficiency of existing momentum schemes for stochastic optimization. In Proceedings of the 6th International Conference on Learning Representations, 2018.
  31. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, pages 1–15, 2015.
  32. An alternative view: When does SGD escape local minima? In Proceedings of the 35th International Conference on Machine Learning, pages 2703–2712, 2018.
  33. Alex Krizhevsky. Learning multiple layers of features from tiny images. https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf, 2009.
  34. Noise is not the main factor behind the gap between SGD and adam on transformers, but sign descent might be. In Proceedings of the 8th International Conference on Learning Representations, 2023.
  35. The two regimes of deep network training. https://arxiv.org/abs/2002.10376, 2020.
  36. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1):57–95, 2016.
  37. Towards explaining the regularization effect of initial large learning rate in training neural networks. In Advances in Neural Information Processing Systems, pages 11669–11680, 2019.
  38. Generalization properties and implicit regularization for multiple passes SGM. In Proceedings of The 33rd International Conference on Machine Learning, pages 2340–2348, 2016.
  39. An improved analysis of stochastic gradient descent with momentum. In Advances in Neural Information Processing Systems, 2020.
  40. Stochastic polyak step-size for SGD: An adaptive learning rate for fast convergence: An adaptive learning rate for fast convergence. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS), 2021.
  41. SGDR: stochastic gradient descent with warm restarts. In Proceedings of the 5th International Conference on Learning Representations, 2017.
  42. Quasi-hyperbolic momentum and adam for deep learning. In Proceedings of the 7th International Conference on Learning Representations, 2019.
  43. The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. Proceedings of the 35th International Conference on Machine Learning, 80:3331–3340, 2018.
  44. An empirical model of large-batch training. http://arxiv.org/abs/1812.06162, 2018.
  45. Generalization bounds of SGLD for non-convex learning: Two theoretical viewpoints. In Proceedings of the 31st Annual Conference on Learning Theory, pages 605–638, 2018.
  46. Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence O⁢(1/k2)𝑂1superscript𝑘2{O}(1/k^{2})italic_O ( 1 / italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Doklady AN USSR, 269:543–547, 1983.
  47. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pages 8024–8035, 2019.
  48. B.T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
  49. On the convergence of adam and beyond. In Proceedings of the 6th International Conference on Learning Representations, 2018.
  50. A stochastic approximation method. The Annals of Mathematical Statistics, 22:400–407, 1951.
  51. A deterministic annealing approach to clustering. Pattern Recognition Letters, 11(9):589–594, 1990.
  52. Existence and estimation of critical batch size for training generative adversarial networks with two time-scale update rule. In Proceedings of the 40th International Conference on Machine Learning, pages 30080–30104, 2023a.
  53. Using stochastic gradient descent to smooth nonconvex functions: Analysis of implicit graduated optimization with optimal noise scheduling. https://arxiv.org/abs/2311.08745, 2023b.
  54. Robustness analysis of non-convex stochastic gradient descent using biased expectations. In Proceedings of the 34th Conference on Neural Information Processing Systems, pages 16377–16387, 2020.
  55. The fastest known globally convergent first-order method for minimizing strongly convex functions. IEEE Control Systems Letters, 2(1):49–54, 2018.
  56. Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research, 20(112):1–49, 2019.
  57. Don’t decay the learning rate, increase the batch size. In Proceedings of the 6th International Conference on Learning Representations, 2018.
  58. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning, pages 1139–1147, 2013.
  59. An empirical study of large-batch stochastic gradient descent with structured covariance noise. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics AISTATS, pages 3621–3631, 2020.
  60. A bayesian method to quantifying chemical composition using NMR: application to porous media systems. In Proceedings of the 22nd European Signal Processing Conference (EUSIPCO), pages 2515–2519, 2014.
  61. Zhijun Wu. The effective energy transformation scheme as a special continuation approach to global optimization with application to molecular conformation. SIAM Journal on Optimization, 6(3):748–768, 1996.
  62. Wide residual networks. In Proceedings of the British Machine Vision Conference, 2016.
  63. Adaptive methods for nonconvex optimization. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018.
  64. Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. In Advances in Neural Information Processing Systems, pages 8194–8205, 2019.
  65. On the convergence of adaptive gradient methods for nonconvex optimization. 12th Annual Workshop on Optimization for Machine Learning, 2020.
  66. AdaBelief optimizer: Adapting stepsizes by the belief in observed gradients. In Advances in Neural Information Processing Systems, 2020.
  67. A sufficient condition for convergences of adam and rmsprop. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11119–11127, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Naoki Sato (45 papers)
  2. Hideaki Iiduka (34 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.