Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 75 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 20 tok/s Pro
GPT-5 High 18 tok/s Pro
GPT-4o 95 tok/s Pro
Kimi K2 193 tok/s Pro
GPT OSS 120B 467 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

SGD with AdaGrad Stepsizes: Full Adaptivity with High Probability to Unknown Parameters, Unbounded Gradients and Affine Variance (2302.08783v2)

Published 17 Feb 2023 in cs.LG, math.OC, and stat.ML

Abstract: We study Stochastic Gradient Descent with AdaGrad stepsizes: a popular adaptive (self-tuning) method for first-order stochastic optimization. Despite being well studied, existing analyses of this method suffer from various shortcomings: they either assume some knowledge of the problem parameters, impose strong global Lipschitz conditions, or fail to give bounds that hold with high probability. We provide a comprehensive analysis of this basic method without any of these limitations, in both the convex and non-convex (smooth) cases, that additionally supports a general ``affine variance'' noise model and provides sharp rates of convergence in both the low-noise and high-noise~regimes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. A new regret analysis for adam-type algorithms. In International conference on machine learning, pages 202–210. PMLR, 2020.
  2. Lower bounds for non-convex stochastic optimization. Mathematical Programming, pages 1–50, 2022.
  3. L. Bottou. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pages 421–436. Springer, 2012.
  4. Optimization methods for large-scale machine learning. Siam Review, 60(2):223–311, 2018.
  5. Y. Carmon and O. Hinder. Making SGD parameter-free. In COLT, 2022.
  6. A. Cutkosky and H. Mehta. High-probability bounds for non-convex stochastic optimization with heavy tails. In NeurIPS, 2021.
  7. A. Cutkosky and F. Orabona. Black-box reductions for parameter-free online learning in banach spaces. In Conference On Learning Theory, pages 1493–1529. PMLR, 2018.
  8. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
  9. The power of adaptivity in sgd: Self-tuning step sizes with unbounded gradients and affine variance. In COLT, 2022.
  10. S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
  11. Tight analyses for non-smooth stochastic gradient descent. In Conference on Learning Theory, pages 1579–1613. PMLR, 2019.
  12. Time-uniform, nonparametric, nonasymptotic confidence sequences. The Annals of Statistics, 49(2):1055–1080, 2021.
  13. High probability bounds for a class of nonconvex algorithms with adagrad stepsize. In International Conference on Learning Representations, 2022.
  14. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
  15. G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1):365–397, 2012.
  16. X. Li and F. Orabona. On the convergence of stochastic gradient descent with adaptive stepsizes. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 983–992. PMLR, 2019.
  17. X. Li and F. Orabona. A high probability analysis of adaptive sgd with momentum. In Workshop on Beyond First Order Methods in ML Systems at ICML’20, 2020.
  18. High probability convergence of stochastic gradient methods. arXiv preprint arXiv:2302.14843, 2023.
  19. Z. Mhammedi and W. M. Koolen. Lipschitz and comparator-norm adaptivity in online learning. In Conference on Learning Theory, pages 2858–2887. PMLR, 2020.
  20. F. Orabona and D. Pál. Coin betting and parameter-free online learning. Advances in Neural Information Processing Systems, 29, 2016.
  21. F. Orabona and D. Pál. Scale-free online learning. Theoretical Computer Science, 716:50–69, 2018.
  22. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018.
  23. H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
  24. No more pesky learning rates. In International conference on machine learning, pages 343–351. PMLR, 2013.
  25. P. T. Tran et al. On the convergence proof of amsgrad and a new version. IEEE Access, 7:61706–61716, 2019.
  26. Adagrad stepsizes: Sharp convergence over nonconvex landscapes. In International Conference on Machine Learning, pages 6677–6686. PMLR, 2019.
  27. On the convergence of adaptive gradient methods for nonconvex optimization. arXiv preprint arXiv:1808.05671, 2018.
Citations (19)

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube