Convergence of mean-field Langevin dynamics: Time and space discretization, stochastic gradient, and variance reduction (2306.07221v1)
Abstract: The mean-field Langevin dynamics (MFLD) is a nonlinear generalization of the Langevin dynamics that incorporates a distribution-dependent drift, and it naturally arises from the optimization of two-layer neural networks via (noisy) gradient descent. Recent works have shown that MFLD globally minimizes an entropy-regularized convex functional in the space of measures. However, all prior analyses assumed the infinite-particle or continuous-time limit, and cannot handle stochastic gradient updates. We provide an general framework to prove a uniform-in-time propagation of chaos for MFLD that takes into account the errors due to finite-particle approximation, time-discretization, and stochastic gradient approximation. To demonstrate the wide applicability of this framework, we establish quantitative convergence rate guarantees to the regularized global optimal solution under (i) a wide range of learning problems such as neural network in the mean-field regime and MMD minimization, and (ii) different gradient estimators including SGD and SVRG. Despite the generality of our results, we achieve an improved convergence rate in both the SGD and SVRG settings when specialized to the standard Langevin dynamics.
- The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks. In Conference on Learning Theory, pages 4782–4887. PMLR, 2022.
- Maximum mean discrepancy gradient flow. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/944a5ae3483ed5c1e10bbccb7942a279-Paper.pdf.
- D. Bakry and M. Émery. Diffusions hypercontractives. In J. Azéma and M. Yor, editors, Séminaire de Probabilités XIX 1983/84, pages 177–206, Berlin, Heidelberg, 1985a. Springer Berlin Heidelberg. ISBN 978-3-540-39397-9.
- D. Bakry and M. Émery. Diffusions hypercontractives in sem. probab. xix lnm 1123, 1985b.
- Analysis and geometry of Markov diffusion operators, volume 103. Springer, 2014.
- Functional inequalities for Gaussian convolutions of compactly supported measures: Explicit bounds and dimension dependence. Bernoulli, 24(1):333 – 353, 2018. doi: 10.3150/16-BEJ879. URL https://doi.org/10.3150/16-BEJ879.
- P. Cattiaux and A. Guillin. Semi Log-Concave Markov Diffusions, pages 231–292. Springer International Publishing, Cham, 2014.
- P. Cattiaux and A. Guillin. Functional inequalities for perturbed measures with applications to log-concave measures and to some Bayesian problems. Bernoulli, 28(4):2294 – 2321, 2022. doi: 10.3150/21-BEJ1419. URL https://doi.org/10.3150/21-BEJ1419.
- Uniform-in-time propagation of chaos for mean field langevin dynamics. arXiv preprint arXiv:2212.03050v1, 2022.
- A dynamical central limit theorem for shallow neural networks. arXiv preprint arXiv:2008.09623, 2020.
- Analysis of langevin monte carlo from poincar\\\backslash\’e to log-sobolev. arXiv preprint arXiv:2112.12662, 2021.
- L. Chizat. Mean-field langevin dynamics : Exponential convergence and annealing. Transactions on Machine Learning Research, 2022. URL https://openreview.net/forum?id=BDqzLH1gEm.
- L. Chizat and F. Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. In Advances in Neural Information Processing Systems 31, pages 3040–3050, 2018.
- A kernel test of goodness of fit. In M. F. Balcan and K. Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 2606–2615, New York, New York, USA, 20–22 Jun 2016. PMLR. URL https://proceedings.mlr.press/v48/chwialkowski16.html.
- A. S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. arXiv preprint arXiv:1412.7392, 2014.
- Quantitative propagation of chaos for sgd in wide neural networks. Advances in Neural Information Processing Systems, 33:278–288, 2020.
- F. Delarue and A. Tse. Uniform in time weak propagation of chaos on the torus. arXiv preprint arXiv:2104.14973, 2021.
- Variance reduction in stochastic gradient langevin dynamics. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper/2016/file/9b698eb3105bd82528f23d0c92dedfc0-Paper.pdf.
- A kernel method for the two-sample-problem. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems, volume 19. MIT Press, 2006. URL https://proceedings.neurips.cc/paper/2006/file/e9fb2eda3d9c55a0d89c98d6c54b5b3e-Paper.pdf.
- R. Holley and D. Stroock. Logarithmic sobolev inequalities and stochastic ising models. Journal of statistical physics, 46(5-6):1159–1194, 1987.
- Mean-field langevin dynamics and energy landscape of neural networks. arXiv preprint arXiv:1905.07769, 2019.
- Distribution dependent stochastic differential equations. Frontiers of Mathematics in China, 16(2):257–301, 2021.
- Analysis of a two-layer neural network via displacement convexity. arXiv preprint arXiv:1901.01375, 2019.
- R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper/2013/file/ac1dd209cbcc5e5d1c6e28598e8cbbe8-Paper.pdf.
- M. Kac. Foundations of kinetic theory. In Proceedings of The third Berkeley symposium on mathematical statistics and probability, volume 3, pages 171–197, 1956.
- H. Kahn and T. E. Harris. Estimation of particle transmission by random sampling. National Bureau of Standards applied mathematics series, 12:27–30, 1951.
- Y. Kinoshita and T. Suzuki. Improved convergence rate of stochastic gradient langevin dynamics with variance reduction and its application to optimization. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=Sj2z__i1wX-.
- Kernel stein discrepancy descent. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 5719–5730. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/korba21a.html.
- D. Lacker. Hierarchies, entropy, and quantitative propagation of chaos for mean field diffusions. arXiv preprint arXiv:2105.02983, 2021.
- D. Lacker and L. L. Flem. Sharp uniform-in-time propagation of chaos. arXiv preprint arXiv:2205.12047, 2022.
- A kernelized stein discrepancy for goodness-of-fit tests. In M. F. Balcan and K. Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 276–284, New York, New York, USA, 20–22 Jun 2016. PMLR. URL https://proceedings.mlr.press/v48/liub16.html.
- A complete recipe for stochastic gradient mcmc. Advances in neural information processing systems, 28, 2015.
- H. P. McKean. A class of markov processes associated with nonlinear parabolic equations. Proceedings of the National Academy of Sciences, 56(6):1907–1911, 1966.
- A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
- S. Mischler. An introduction to evolution PDEs, Chapter 0: On the Gronwall lemma, 2019. URL https://www.ceremade.dauphine.fr/~mischler/Enseignements/M2evol2018/chap0.pdf.
- A. Nitanda and T. Suzuki. Stochastic particle gradient descent for infinite ensembles. arXiv preprint arXiv:1712.05438, 2017.
- Particle dual averaging: Optimization of mean field neural networks with global convergence rate analysis, 2020.
- Particle dual averaging: Optimization of mean field neural network with global convergence rate analysis. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 19608–19621. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/file/a34e1ddbb4d329167f50992ba59fe45a-Paper.pdf.
- Convex analysis of the mean field langevin dynamics. In G. Camps-Valls, F. J. R. Ruiz, and I. Valera, editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 9741–9757. PMLR, 28–30 Mar 2022.
- Particle stochastic dual coordinate ascent: Exponential convergent algorithm for mean field neural network optimization. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=PQQp7AJwz3.
- F. Otto and C. Villani. Generalization of an inequality by talagrand and links with the logarithmic sobolev inequality. Journal of Functional Analysis, 173(2):361–400, 2000.
- G. M. Rotskoff and E. Vanden-Eijnden. Trainability and accuracy of neural networks: An interacting particle system approach. arXiv preprint arXiv:1805.00915, 2018.
- J. Sirignano and K. Spiliopoulos. Mean field analysis of neural networks: A central limit theorem. Stochastic Processes and their Applications, 130(3):1820–1852, 2020.
- Uniform-in-time propagation of chaos for the mean field gradient langevin dynamics. In Submitted to The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=_JScUk9TBUn.
- A.-S. Sznitman. Topics in propagation of chaos. In Ecole d’été de probabilités de Saint-Flour XIX-1989, pages 165–251. Springer, 1991.
- S. Vempala and A. Wibisono. Rapid convergence of the unadjusted langevin algorithm: Isoperimetry suffices. In Advances in Neural Information Processing Systems, pages 8094–8106, 2019.
- M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688, 2011.
- Subsampled stochastic variance-reduced gradient langevin dynamics. In International Conference on Uncertainty in Artificial Intelligence, 2018.
- Taiji Suzuki (119 papers)
- Denny Wu (24 papers)
- Atsushi Nitanda (29 papers)