Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Ordering for Non-Replacement SGD (2306.15848v1)

Published 28 Jun 2023 in cs.LG and math.OC

Abstract: One approach for reducing run time and improving efficiency of machine learning is to reduce the convergence rate of the optimization algorithm used. Shuffling is an algorithm technique that is widely used in machine learning, but it only started to gain attention theoretically in recent years. With different convergence rates developed for random shuffling and incremental gradient descent, we seek to find an ordering that can improve the convergence rates for the non-replacement form of the algorithm. Based on existing bounds of the distance between the optimal and current iterate, we derive an upper bound that is dependent on the gradients at the beginning of the epoch. Through analysis of the bound, we are able to develop optimal orderings for constant and decreasing step sizes for strongly convex and convex functions. We further test and verify our results through experiments on synthesis and real data sets. In addition, we are able to combine the ordering with mini-batch and further apply it to more complex neural networks, which show promising results.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (120)
  1. Pankaj K Agarwal, Sariel Har-Peled and Kasturi R Varadarajan “Approximating extent measures of points” In Journal of the ACM (JACM) 51.4 ACM New York, NY, USA, 2004, pp. 606–635
  2. Kwangjun Ahn, Chulhee Yun and Suvrit Sra “SGD with shuffling: optimal rates without component convexity and large epoch requirements” In Advances in Neural Information Processing Systems 33, 2020, pp. 17526–17535
  3. “Variance Reduction in SGD by Distributed Importance Sampling”, 2015 eprint: arXiv:1511.06481
  4. “A generalization of Polyak’s convergence result for subgradient optimization” In Mathematical Programming 37.3 Springer, 1987, pp. 309–317
  5. Zeyuan Allen-Zhu “Katyusha: The first direct acceleration of stochastic gradient methods” In The Journal of Machine Learning Research 18.1 JMLR. org, 2017, pp. 8194–8244
  6. Zeyuan Allen-Zhu, Yang Yuan and Karthik Sridharan “Exploiting the structure: Stochastic gradient methods using raw clusters” In Advances in Neural Information Processing Systems 29, 2016
  7. Hilal Asi and John C Duchi “The importance of better models in stochastic optimization” In Proceedings of the National Academy of Sciences 116.46 National Acad Sciences, 2019, pp. 22924–22930
  8. Jimmy Lei Ba, Jamie Ryan Kiros and Geoffrey E Hinton “Layer normalization” In arXiv preprint arXiv:1607.06450, 2016
  9. Francis Bach “Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression” In The Journal of Machine Learning Research 15.1 JMLR. org, 2014, pp. 595–627
  10. Mokhtar S Bazaraa and Hanif D Sherali “On the choice of step size in subgradient optimization” In European Journal of Operational Research 7.4 Elsevier, 1981, pp. 380–388
  11. Dimitri Bertsekas “Convex optimization algorithms” Athena Scientific, 2015
  12. Dimitri P Bertsekas “Incremental least squares methods and the extended Kalman filter” In SIAM Journal on Optimization 6.3 SIAM, 1996, pp. 807–822
  13. Dimitri P Bertsekas “A hybrid incremental gradient method for least squares” In SIAM Journal on Optimization 7, 1997, pp. 913–926
  14. Dimitri P Bertsekas “Nonlinear programming” In Journal of the Operational Research Society 48.3 Taylor & Francis, 1997, pp. 334–334
  15. Dimitri P Bertsekas “Incremental gradient, subgradient, and proximal methods for convex optimization: A survey” In Optimization for Machine Learning 2010.1-38, 2011, pp. 3
  16. Dimitri P Bertsekas and John N Tsitsiklis “Neuro-dynamic programming: an overview” In Proceedings of 1995 34th IEEE conference on decision and control 1, 1995, pp. 560–564 IEEE
  17. Dimitri P Bertsekas and John N Tsitsiklis “Gradient convergence in gradient methods with errors” In SIAM Journal on Optimization 10.3 SIAM, 2000, pp. 627–642
  18. Léon Bottou and Yann Le Cun “On-line learning for very large data sets” In Applied stochastic models in business and industry 21.2 Wiley Online Library, 2005, pp. 137–151
  19. “Distributed optimization and statistical learning via the alternating direction method of multipliers” In Foundations and Trends® in Machine learning 3.1 Now Publishers, Inc., 2011, pp. 1–122
  20. Ulf Brannlund “On relaxation methods for nonsmooth convex optimization.”, 1995
  21. “Bayesian coreset construction via greedy iterative geodesic ascent” In International Conference on Machine Learning, 2018, pp. 698–706 PMLR
  22. Kai Lai Chung “On a stochastic approximation method” In The Annals of Mathematical Statistics JSTOR, 1954, pp. 463–483
  23. Michael B Cohen, Cameron Musco and Christopher Musco “Input sparsity time low-rank approximation via ridge leverage score sampling” In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, 2017, pp. 1758–1777 SIAM
  24. “Convergence of some algorithms for convex minimization” In Mathematical Programming 62.1 Springer, 1993, pp. 261–275
  25. Aaron Defazio, Francis Bach and Simon Lacoste-Julien “SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives” In Advances in neural information processing systems 27, 2014
  26. “On the ineffectiveness of variance reduced optimization for deep learning” In Advances in Neural Information Processing Systems 32, 2019
  27. “Nondifferentiable Optimization, Optimization Software” In Inc. Publications Division, New York, 1985
  28. John Duchi, Elad Hazan and Yoram Singer “Adaptive subgradient methods for online learning and stochastic optimization.” In Journal of machine learning research 12.7, 2011
  29. Yu M Ermol’ev “Methods of solution of nonlinear extremal problems” In Cybernetics 2.4 Springer, 1966, pp. 1–14
  30. Yu Ermoliev “Stochastic programming methods” Nauka, 1976
  31. Yu M Ermoliev “On the stochastic quasi-gradient method and stochastic quasi-Feyer sequences” In Kibernetika 2, 1969, pp. 72–83
  32. Yu M Ermoliev and RJ-B Wets “Numerical techniques for stochastic optimization” Springer-Verlag, 1988
  33. Yuri Ermoliev “Stochastic quasigradient methods and their application to system optimization” In Stochastics: An International Journal of Probability and Stochastic Processes 9.1-2 Taylor & Francis, 1983, pp. 1–36
  34. “Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization” In International Conference on Machine Learning, 2015, pp. 2540–2548 PMLR
  35. Alexei A Gaivoronski “Convergence properties of backpropagation for neural nets via theory of stochastic gradient methods. Part 1” In Optimization methods and Software 4.2 Taylor & Francis, 1994, pp. 117–134
  36. “Understanding the difficulty of training deep feedforward neural networks” In Proceedings of the thirteenth international conference on artificial intelligence and statistics, 2010, pp. 249–256 JMLR WorkshopConference Proceedings
  37. Jean-Louis Goffin “The relaxation method for solving systems of linear inequalities” In Mathematics of Operations Research 5.3 INFORMS, 1980, pp. 388–414
  38. Jean-Louis Goffin and Krzysztof C Kiwiel “Convergence of a simple subgradient level method” In Mathematical Programming 85.1 Springer-Verlag, 1999, pp. 207–211
  39. Luigi Grippo “A class of unconstrained minimization methods for neural network training” In Optimization Methods and Software 4.2 Taylor & Francis, 1994, pp. 135–150
  40. M Gurbuzbalaban, Asu Ozdaglar and Pablo A Parrilo “Convergence rate of incremental gradient and incremental newton methods” In SIAM Journal on Optimization 29.4 SIAM, 2019, pp. 2542–2565
  41. Mert Gurbuzbalaban, Asuman Ozdaglar and Pablo A Parrilo “On the convergence rate of incremental aggregated gradient algorithms” In SIAM Journal on Optimization 27.2 SIAM, 2017, pp. 1035–1048
  42. Mert Gürbüzbalaban, Asuman Ozdaglar and Pablo Parrilo “A globally convergent incremental Newton method” In Mathematical Programming 151.1 Springer, 2015, pp. 283–313
  43. Mert Gürbüzbalaban, Asuman Ozdaglar and Pablo Parrilo “Why Random Reshuffling Beats Stochastic Gradient Descent” In arXiv preprint arXiv:1510.08560, 2015
  44. “Random shuffling beats sgd after finite epochs” In International Conference on Machine Learning, 2019, pp. 2624–2633 PMLR
  45. “On coresets for k-means and k-median clustering” In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, 2004, pp. 291–300
  46. Elad Hazan, Amit Agarwal and Satyen Kale “Logarithmic regret algorithms for online convex optimization” In Machine Learning 69.2 Springer, 2007, pp. 169–192
  47. “Convex analysis and minimization algorithms I: Fundamentals” Springer science & business media, 2013
  48. “Variance reduced stochastic gradient descent with neighbors” In Advances in Neural Information Processing Systems 28, 2015
  49. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift” In CoRR abs/1502.03167, 2015
  50. “Accelerating stochastic gradient descent using predictive variance reduction” In Advances in neural information processing systems 26, 2013
  51. Christos A Kaskavelis and Michael C Caramanis “Efficient Lagrangian relaxation algorithms for industry size job-shop scheduling problems” In IIE transactions 30.11 Springer, 1998, pp. 1085–1097
  52. “Not all samples are created equal: Deep learning with importance sampling” In International conference on machine learning, 2018, pp. 2525–2534 PMLR
  53. L Kaufman, PJ Rousseeuw and Y Dodge “Clustering by means of medoids in statistical data analysis based on the” In L1 Norm,~ orth-Holland, Amsterdam, 1987
  54. “Ordered SGD: A new stochastic optimization framework for empirical risk minimization” In International Conference on Artificial Intelligence and Statistics, 2020, pp. 669–679 PMLR
  55. Sehun Kim, Hyunsil Ahn and Seong-Cheol Cho “Variable target value subgradient method” In Mathematical Programming 49.1 Springer, 1990, pp. 359–369
  56. “An improved subgradient method for constrained nondifferentiable optimization” In Operations Research Letters 14.1 Elsevier, 1993, pp. 61–64
  57. Diederik P Kingma and Jimmy Ba “Adam: A method for stochastic optimization” In arXiv preprint arXiv:1412.6980, 2014
  58. Krzysztof C Kiwiel “The efficiency of subgradient projection methods for convex optimization, part I: General level methods” In SIAM Journal on Control and Optimization 34.2 SIAM, 1996, pp. 660–676
  59. Krzysztof C Kiwiel “The efficiency of subgradient projection methods for convex optimization, part II: Implementations and extensions” In SIAM Journal on Control and Optimization 34.2 SIAM, 1996, pp. 677–697
  60. Krzysztof C Kiwiel, Torbjörn Larsson and Per Olov Lindberg “The efficiency of ballstep subgradient level methods for convex optimization” In Mathematics of Operations Research 24.1 INFORMS, 1999, pp. 237–254
  61. Teuvo Kohonen “An adaptive associative memory principle” In IEEE Transactions on Computers 100.4 IEEE, 1974, pp. 444–445
  62. Anatolii Nikolaevich Kulikov and Valeriy Raufovich Fazylov “Convex optimization with prescribed accuracy” In USSR computational mathematics and mathematical physics 30.3 Elsevier, 1990, pp. 16–22
  63. Mu Li, Gary L Miller and Richard Peng “Iterative row sampling” In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, 2013, pp. 127–136 IEEE
  64. Hongzhou Lin, Julien Mairal and Zaid Harchaoui “A universal catalyst for first-order optimization” In Advances in neural information processing systems 28, 2015
  65. Hui Lin, Jeff Bilmes and Shasha Xie “Graph-based submodular selection for extractive summarization” In 2009 IEEE Workshop on Automatic Speech Recognition & Understanding, 2009, pp. 381–386 IEEE
  66. Hui Lin and Jeff A Bilmes “Learning mixtures of submodular shells with application to document summarization” In arXiv preprint arXiv:1210.4871, 2012
  67. Yucheng Lu, Si Yi Meng and Christopher De Sa “A General Analysis of Example-selection for Stochastic Gradient Descent” In conference paper at ICLR 2022, 2022
  68. “Training gaussian mixture models at scale via coresets” In The Journal of Machine Learning Research 18.1 JMLR. org, 2017, pp. 5885–5909
  69. Zhi-Quan Luo “On the convergence of the LMS algorithm with adaptive learning rate for linear feedforward networks” In Neural Computation 3.2 MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info …, 1991, pp. 226–245
  70. Olvi L Mangasarian and Mikhail V Solodov “Backpropagation convergence via deterministic nonmonotone perturbed minimization” In Advances in Neural Information Processing Systems 6, 1993
  71. Michel Minoux “Accelerated greedy algorithms for maximizing submodular set functions” In Optimization techniques Springer, 1978, pp. 234–243
  72. Michel Minoux “Mathematical programming: theory and algorithms” John Wiley & Sons, 1986
  73. “Lazier than lazy greedy” In Proceedings of the AAAI Conference on Artificial Intelligence 29.1, 2015
  74. Baharan Mirzasoleiman, Jeff Bilmes and Jure Leskovec “Coresets for data-efficient training of machine learning models” In International Conference on Machine Learning, 2020, pp. 6950–6960 PMLR
  75. “Distributed submodular cover: Succinctly summarizing massive data” In Advances in Neural Information Processing Systems 28, 2015
  76. Baharan Mirzasoleiman, Morteza Zadimoghaddam and Amin Karbasi “Fast distributed submodular cover: Public-private data summarization” In Advances in Neural Information Processing Systems 29, 2016
  77. Konstantin Mishchenko, Ahmed Khaled Ragab Bayoumi and Peter Richtarik “Random Reshuffling: Simple Analysis with Vast Improvements” In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020 URL: https://proceedings.neurips.cc/paper/2020/hash/c8cc6e90ccbff44c9cee23611711cdc4-Abstract.html
  78. Amirkeivan Mohtashami, Sebastian U. Stich and Martin Jaggi “Characterizing & Finding Good Data Orderings for Fast Convergence of Sequential Gradient Methods” In ArXiv abs/2202.01838, 2022
  79. Hiroyuki Moriyama, Nobuo Yamashita and Masao Fukushima “The incremental Gauss-Newton algorithm with adaptive stepsize rule” In Computational Optimization and Applications 26.2 Springer, 2003, pp. 107–141
  80. “Non-asymptotic analysis of stochastic approximation algorithms for machine learning” In Advances in neural information processing systems 24, 2011
  81. “Recursive sampling for the nystrom method” In Advances in Neural Information Processing Systems 30, 2017
  82. Angelia Nedic and Dimitri P Bertsekas “Incremental subgradient methods for nondifferentiable optimization” In SIAM Journal on Optimization 12.1 SIAM, 2001, pp. 109–138
  83. “On the rate of convergence of distributed subgradient methods for multi-agent optimization” In 2007 46th IEEE Conference on Decision and Control, 2007, pp. 4711–4716 IEEE
  84. “Distributed subgradient methods for multi-agent optimization” In IEEE Transactions on Automatic Control 54.1 IEEE, 2009, pp. 48–61
  85. “Convergence rate of incremental subgradient algorithms” In Stochastic optimization: algorithms and applications Springer, 2001, pp. 223–264
  86. Deanna Needell, Rachel Ward and Nati Srebro “Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm” In Advances in neural information processing systems 27, 2014
  87. George L Nemhauser, Laurence A Wolsey and Marshall L Fisher “An analysis of approximations for maximizing submodular set functions—I” In Mathematical programming 14.1 Springer, 1978, pp. 265–294
  88. “Robust stochastic approximation approach to stochastic programming” In SIAM Journal on optimization 19.4 SIAM, 2009, pp. 1574–1609
  89. Yurii Nesterov “Introductory lectures on convex optimization: a basic course”, 2004
  90. “A Unified Convergence Analysis for Shuffling-Type Gradient Methods” In Journal of Machine Learning Research, 2021, pp. 1–44
  91. Elijah Polak “On the mathematical foundations of nondifferentiable optimization in engineering design” In SIAM review 29.1 SIAM, 1987, pp. 21–89
  92. Boris T Polyak “Introduction to optimization. optimization software” In Inc., Publications Division, New York 1, 1987, pp. 32
  93. Boris Teodorovich Polyak “A general method for solving extremal problems” In Doklady Akademii Nauk 174.1, 1967, pp. 33–36 Russian Academy of Sciences
  94. Boris Teodorovich Polyak “Minimization of unsmooth functionals” In USSR Computational Mathematics and Mathematical Physics 9.3 Elsevier, 1969, pp. 14–29
  95. Ning Qian “On the momentum term in gradient descent learning algorithms” In Neural networks 12.1 Elsevier, 1999, pp. 145–151
  96. Shashank Rajput, Anant Gupta and Dimitris Papailiopoulos “Closing the convergence gap of SGD without replacement” In ICML’20: Proceedings of the 37th International Conference on Machine Learning, 2020, pp. 7964–7973
  97. Shashank Rajput, Kangwook Lee and Dimitris Papailiopoulos “Permutation-Based SGD: Is Random Optimal?” In International Conference on Learning Representations, 2022 URL: https://openreview.net/forum?id=YiBa9HKTyXE
  98. Alexander Rakhlin, Ohad Shamir and Karthik Sridharan “Making gradient descent optimal for strongly convex stochastic optimization” In arXiv preprint arXiv:1109.5647, 2011
  99. S Sundhar Ram, A Nedic and VV Veeravalli “Stochastic incremental gradient descent for estimation in sensor networks” In 2007 conference record of the forty-first asilomar conference on signals, systems and computers, 2007, pp. 582–586 IEEE
  100. R Tyrrell Rockafellar “Convex Analysis” Citeseer, 1970
  101. Nicolas Roux, Mark Schmidt and Francis Bach “A stochastic gradient method with an exponential convergence _rate for finite training sets” In Advances in neural information processing systems 25, 2012
  102. “How good is SGD with random shuffling?” In Conference on Learning Theory, 2020, pp. 3250–3284 PMLR
  103. Vatsal Shah, Xiaoxia Wu and Sujay Sanghavi “Choosing the Sample with Lowest Loss makes SGD Robust” In International Conference on Artificial Intelligence and Statistics, 2020
  104. “Stochastic dual coordinate ascent methods for regularized loss minimization.” In Journal of Machine Learning Research 14.2, 2013
  105. Ohad Shamir “Without-Replacement Sampling for Stochastic Gradient Methods: Convergence Results and Application to Distributed Optimization”, 2016 eprint: arXiv:1603.00570
  106. NZ Shor, Krzysztof C Kiwiel and Andrzej Ruszcaynski “Minimization methods for non-differentiable functions” Springer-Verlag, 1985
  107. Jascha Sohl-Dickstein, Ben Poole and Surya Ganguli “Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods” In International Conference on Machine Learning, 2014, pp. 604–612 PMLR
  108. Mikhail V Solodov “Incremental gradient algorithms with stepsizes bounded away from zero” In Computational Optimization and Applications 11.1 Springer, 1998, pp. 23–35
  109. Mikhail V Solodov and SK Zavriev “Error stability properties of generalized gradient-type algorithms” In Journal of Optimization Theory and Applications 98.3 Springer, 1998, pp. 663–680
  110. “MLI: An API for distributed machine learning” In 2013 IEEE 13th International Conference on Data Mining, 2013, pp. 1187–1192 IEEE
  111. Sebastian U. Stich, Anant Raj and Martin Jaggi “Safe Adaptive Importance Sampling”, 2017 eprint: arXiv:1711.02637
  112. Emma Strubell, Ananya Ganesh and Andrew McCallum “Energy and policy considerations for deep learning in NLP” In arXiv preprint arXiv:1906.02243, 2019
  113. Paul Tseng “An incremental gradient (-projection) method with momentum term and adaptive stepsize rule” In SIAM Journal on Optimization 8.2 SIAM, 1998, pp. 506–531
  114. Kai Wei, Rishabh Iyer and Jeff Bilmes “Submodularity in data subset selection and active learning” In International Conference on Machine Learning, 2015, pp. 1954–1963 PMLR
  115. Laurence A Wolsey “An analysis of the greedy algorithm for the submodular set covering problem” In Combinatorica 2.4 Springer, 1982, pp. 385–393
  116. “A proximal stochastic gradient method with progressive variance reduction” In SIAM Journal on Optimization 24.4 SIAM, 2014, pp. 2057–2075
  117. Chulhee Yun, Suvrit Sra and Ali Jadbabaie “Open Problem: Can Single-Shuffle SGD be Better than Reshuffling SGD and GD?” In Conference on Learning Theory, 2021, pp. 4653–4658 PMLR
  118. Matthew D Zeiler “Adadelta: an adaptive learning rate method” In arXiv preprint arXiv:1212.5701, 2012
  119. Xing Zhao, Peter B Luh and Jihua Wang “Surrogate gradient algorithm for Lagrangian relaxation” In Journal of optimization Theory and Applications 100.3 Springer, 1999, pp. 699–712
  120. “Analysis of an approximate gradient projection method with applications to the backpropagation algorithm” In Optimization Methods and Software 4.2 Taylor & Francis, 1994, pp. 85–101

Summary

We haven't generated a summary for this paper yet.