Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
124 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SGD with AdaGrad Stepsizes: Full Adaptivity with High Probability to Unknown Parameters, Unbounded Gradients and Affine Variance (2302.08783v2)

Published 17 Feb 2023 in cs.LG, math.OC, and stat.ML

Abstract: We study Stochastic Gradient Descent with AdaGrad stepsizes: a popular adaptive (self-tuning) method for first-order stochastic optimization. Despite being well studied, existing analyses of this method suffer from various shortcomings: they either assume some knowledge of the problem parameters, impose strong global Lipschitz conditions, or fail to give bounds that hold with high probability. We provide a comprehensive analysis of this basic method without any of these limitations, in both the convex and non-convex (smooth) cases, that additionally supports a general ``affine variance'' noise model and provides sharp rates of convergence in both the low-noise and high-noise~regimes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. A new regret analysis for adam-type algorithms. In International conference on machine learning, pages 202–210. PMLR, 2020.
  2. Lower bounds for non-convex stochastic optimization. Mathematical Programming, pages 1–50, 2022.
  3. L. Bottou. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pages 421–436. Springer, 2012.
  4. Optimization methods for large-scale machine learning. Siam Review, 60(2):223–311, 2018.
  5. Y. Carmon and O. Hinder. Making SGD parameter-free. In COLT, 2022.
  6. A. Cutkosky and H. Mehta. High-probability bounds for non-convex stochastic optimization with heavy tails. In NeurIPS, 2021.
  7. A. Cutkosky and F. Orabona. Black-box reductions for parameter-free online learning in banach spaces. In Conference On Learning Theory, pages 1493–1529. PMLR, 2018.
  8. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
  9. The power of adaptivity in sgd: Self-tuning step sizes with unbounded gradients and affine variance. In COLT, 2022.
  10. S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
  11. Tight analyses for non-smooth stochastic gradient descent. In Conference on Learning Theory, pages 1579–1613. PMLR, 2019.
  12. Time-uniform, nonparametric, nonasymptotic confidence sequences. The Annals of Statistics, 49(2):1055–1080, 2021.
  13. High probability bounds for a class of nonconvex algorithms with adagrad stepsize. In International Conference on Learning Representations, 2022.
  14. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
  15. G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1):365–397, 2012.
  16. X. Li and F. Orabona. On the convergence of stochastic gradient descent with adaptive stepsizes. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 983–992. PMLR, 2019.
  17. X. Li and F. Orabona. A high probability analysis of adaptive sgd with momentum. In Workshop on Beyond First Order Methods in ML Systems at ICML’20, 2020.
  18. High probability convergence of stochastic gradient methods. arXiv preprint arXiv:2302.14843, 2023.
  19. Z. Mhammedi and W. M. Koolen. Lipschitz and comparator-norm adaptivity in online learning. In Conference on Learning Theory, pages 2858–2887. PMLR, 2020.
  20. F. Orabona and D. Pál. Coin betting and parameter-free online learning. Advances in Neural Information Processing Systems, 29, 2016.
  21. F. Orabona and D. Pál. Scale-free online learning. Theoretical Computer Science, 716:50–69, 2018.
  22. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018.
  23. H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
  24. No more pesky learning rates. In International conference on machine learning, pages 343–351. PMLR, 2013.
  25. P. T. Tran et al. On the convergence proof of amsgrad and a new version. IEEE Access, 7:61706–61716, 2019.
  26. Adagrad stepsizes: Sharp convergence over nonconvex landscapes. In International Conference on Machine Learning, pages 6677–6686. PMLR, 2019.
  27. On the convergence of adaptive gradient methods for nonconvex optimization. arXiv preprint arXiv:1808.05671, 2018.
Citations (19)

Summary

We haven't generated a summary for this paper yet.