Mean-field Analysis on Two-layer Neural Networks from a Kernel Perspective (2403.14917v2)
Abstract: In this paper, we study the feature learning ability of two-layer neural networks in the mean-field regime through the lens of kernel methods. To focus on the dynamics of the kernel induced by the first layer, we utilize a two-timescale limit, where the second layer moves much faster than the first layer. In this limit, the learning problem is reduced to the minimization problem over the intrinsic kernel. Then, we show the global convergence of the mean-field Langevin dynamics and derive time and particle discretization error. We also demonstrate that two-layer neural networks can learn a union of multiple reproducing kernel Hilbert spaces more efficiently than any kernel methods, and neural networks acquire data-dependent kernel which aligns with the target function. In addition, we develop a label noise procedure, which converges to the global optimum and show that the degrees of freedom appears as an implicit regularization.
- Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, pp.Ā 322ā332. PMLR, May 2019a. URL https://proceedings.mlr.press/v97/arora19a.html. ISSN: 2640-3498.
- On Exact Computation with an Infinitely Wide Neural Net. In Advances in Neural Information Processing Systems, volumeĀ 32. Curran Associates, Inc., 2019b. URL https://papers.neurips.cc/paper_files/paper/2019/hash/dbc4d84bfcfe2284ba11beffb853a8c4-Abstract.html.
- Neural Networks as Kernel Learners: The Silent Alignment Effect. In International Conference on Learning Representations, October 2021. URL https://openreview.net/forum?id=1NvflqAdoom.
- Bach, F. On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions. Journal of Machine Learning Research, 18(21):1ā38, 2017. ISSN 1533-7928. URL http://jmlr.org/papers/v18/15-178.html.
- Multiple kernel learning, conic duality, and the SMO algorithm. In Twenty-first international conference on Machine learning - ICML ā04, pp.Ā Ā 6, Banff, Alberta, Canada, 2004. ACM Press. doi: 10.1145/1015330.1015424. URL http://portal.acm.org/citation.cfm?doid=1015330.1015424.
- Diffusions hypercontractives. SĆ©minaire de probabilitĆ©s de Strasbourg, 19:177ā206, 1985. URL https://eudml.org/doc/113511. Publisher: Springer - Lecture Notes in Mathematics.
- Implicit Regularization via Neural Feature Alignment. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, pp.Ā 2269ā2277. PMLR, March 2021. URL https://proceedings.mlr.press/v130/baratin21a.html. ISSN: 2640-3498.
- Barron, A. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930ā945, May 1993. ISSN 0018-9448, 1557-9654. doi: 10.1109/18.256500. URL https://ieeexplore.ieee.org/document/256500/.
- Rademacher and Gaussian Complexities: Risk Bounds and Structural Results. In Goos, G., Hartmanis, J., VanĀ Leeuwen, J., Helmbold, D., and Williamson, B. (eds.), Computational Learning Theory, volume 2111, pp.Ā 224ā240. Springer Berlin Heidelberg, Berlin, Heidelberg, 2001. ISBN 978-3-540-42343-0 978-3-540-44581-4. doi: 10.1007/3-540-44581-1Ė15. URL http://link.springer.com/10.1007/3-540-44581-1_15. Series Title: Lecture Notes in Computer Science.
- Learning single-index models with shallow neural networks. In Advances in Neural Information Processing Systems, May 2022. URL https://openreview.net/forum?id=wt7cd9m2cz2.
- On Learning Gaussian Multi-index Models with Gradient Flow, November 2023. URL http://arxiv.org/abs/2310.19793. arXiv:2310.19793 [cs, math, stat].
- Optimal Rates for the Regularized Least-Squares Algorithm. Foundations of Computational Mathematics, 7(3):331ā368, July 2007. ISSN 1615-3375, 1615-3383. doi: 10.1007/s10208-006-0196-8. URL http://link.springer.com/10.1007/s10208-006-0196-8.
- Uniform-in-time propagation of chaos for mean field Langevin dynamics, November 2023. URL http://arxiv.org/abs/2212.03050. arXiv:2212.03050 [math, stat].
- A Generalized Neural Tangent Kernel Analysis for Two-layer Neural Networks. In Advances in Neural Information Processing Systems, volumeĀ 33, pp.Ā 13363ā13373. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/hash/9afe487de556e59e6db6c862adfe25a4-Abstract.html.
- Chizat, L. Mean-Field Langevin Dynamics : Exponential Convergence and Annealing. Transactions on Machine Learning Research, May 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=BDqzLH1gEm.
- On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31, 2018. URL https://proceedings-neurips-cc.utokyo.idm.oclc.org/paper_files/paper/2018/hash/a1afc58c6ca9540d057299ec3016d726-Abstract.html.
- On Kernel-Target Alignment. In Advances in Neural Information Processing Systems, volumeĀ 14. MIT Press, 2001. URL https://proceedings.neurips.cc/paper_files/paper/2001/hash/1f71e393b3809197ed66df836fe833e5-Abstract.html.
- Label noise sgd provably prefers flat global minimizers. Advances in Neural Information Processing Systems, 34:27449ā27461, 2021. URL https://proceedings-neurips-cc.utokyo.idm.oclc.org/paper/2021/hash/e6af401c28c1790eaef7d55c92ab6ab6-Abstract.html.
- Neural Networks can Learn Representations with Gradient Descent. In Proceedings of Thirty Fifth Conference on Learning Theory, pp.Ā 5413ā5452. PMLR, June 2022. URL https://proceedings.mlr.press/v178/damian22a.html. ISSN: 2640-3498.
- A priori estimates of the population risk for two-layer neural networks. Communications in Mathematical Sciences, 17(5):1407ā1425, 2019. ISSN 15396746, 19450796. doi: 10.4310/CMS.2019.v17.n5.a11. URL https://www.intlpress.com/site/pub/pages/journals/items/cms/content/vols/0017/0005/a011/.
- Over Parameterized Two-level Neural Networks Can Learn Near Optimal Feature Representations, October 2019. URL http://arxiv.org/abs/1910.11508. arXiv:1910.11508 [cs, math, stat].
- On the minimax optimality and superiority of deep neural network learning over sparse parameter spaces. Neural Networks, 123:343ā361, March 2020. ISSN 0893-6080. doi: 10.1016/j.neunet.2019.12.014. URL https://www.sciencedirect.com/science/article/pii/S089360801930406X.
- Logarithmic Sobolev inequalities and stochastic Ising models. Journal of Statistical Physics, 46(5):1159ā1194, March 1987. ISSN 1572-9613. doi: 10.1007/BF01011161. URL https://doi.org/10.1007/BF01011161.
- Hsu, D. Dimension lower bounds for linear approaches to function approximation. Daniel Hsuās homepage, 2021.
- Mean-Field Langevin Dynamics and Energy Landscape of Neural Networks, December 2020. URL http://arxiv.org/abs/1905.07769. arXiv:1905.07769 [math, stat].
- Neural Tangent Kernel: Convergence and Generalization in Neural Networks. In Advances in Neural Information Processing Systems, volumeĀ 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/hash/5a4be1fa34e62bb8a6ec6b91d2462f5a-Abstract.html.
- What Happens after SGD Reaches Zero Loss? āA Mathematical Framework. In International Conference on Learning Representations, October 2021. URL https://openreview.net/forum?id=siCt4xZn5Ve.
- The Barron space and the flow-induced function spaces for neural network models. Constructive Approximation, 55(1):369ā406, 2022. URL https://link.springer.com/article/10.1007/s00365-021-09549-y. Publisher: Springer.
- Beyond NTK with Vanilla Gradient Descent: A Mean-Field Analysis of Neural Networks with Polynomial Width, Samples, and Time. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview-net.utokyo.idm.oclc.org/forum?id=Y2hnMZvVDm.
- Leveraging the two timescale regime to demonstrate convergence of neural networks, October 2023. URL http://arxiv.org/abs/2304.09576. arXiv:2304.09576 [cs, math, stat].
- Maurer, A. A Vector-Contraction Inequality for Rademacher Complexities. In Ortner, R., Simon, H.Ā U., and Zilles, S. (eds.), Algorithmic Learning Theory, Lecture Notes in Computer Science, pp.Ā 3ā17, Cham, 2016. Springer International Publishing. ISBN 978-3-319-46379-7. doi: 10.1007/978-3-319-46379-7Ė1.
- A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33), August 2018. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.1806579115. URL https://pnas.org/doi/full/10.1073/pnas.1806579115.
- Neural Networks Efficiently Learn Low-Dimensional Representations with SGD. In The Eleventh International Conference on Learning Representations, 2022. URL https://openreview-net.utokyo.idm.oclc.org/forum?id=6taykzqcPD.
- Stochastic Particle Gradient Descent for Infinite Ensembles, December 2017. URL http://arxiv.org/abs/1712.05438. arXiv:1712.05438 [cs, math, stat].
- Optimal Rates for Averaged Stochastic Gradient Descent under Neural Tangent Kernel Regime. In International Conference on Learning Representations, 2020. URL https://openreview-net.utokyo.idm.oclc.org/forum?id=PULSD5qI2N1.
- Convex Analysis of the Mean Field Langevin Dynamics. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, pp.Ā 9741ā9757. PMLR, May 2022. URL https://proceedings.mlr.press/v151/nitanda22a.html. ISSN: 2640-3498.
- Optimal criterion for feature learning of two-layer linear neural network in high dimensional interpolation regime. In The Twelfth International Conference on Learning Representations, October 2023. URL https://openreview.net/forum?id=Jc0FssXh2R.
- Suzuki, T. Fast generalization error bound of deep learning from a kernel perspective. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, pp.Ā 1397ā1406. PMLR, March 2018. URL https://proceedings.mlr.press/v84/suzuki18a.html. ISSN: 2640-3498.
- Spectral Pruning: Compressing Deep Neural Networks via Spectral Analysis and its Generalization Error. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, pp.Ā 2839ā2846, Yokohama, Japan, July 2020. International Joint Conferences on Artificial Intelligence Organization. ISBN 978-0-9992411-6-5. doi: 10.24963/ijcai.2020/393. URL https://www.ijcai.org/proceedings/2020/393.
- Uniform-in-time propagation of chaos for the mean-field gradient Langevin dynamics. In The Eleventh International Conference on Learning Representations, September 2022. URL https://openreview.net/forum?id=_JScUk9TBUn.
- Convergence of mean-field Langevin dynamics: time-space discretization, stochastic gradient, and variance reduction. In Thirty-seventh Conference on Neural Information Processing Systems, November 2023a. URL https://openreview.net/forum?id=9STYRIVx6u.
- Feature learning via mean-field Langevin dynamics: classifying sparse parities and beyond. In Thirty-seventh Conference on Neural Information Processing Systems, November 2023b. URL https://https://openreview.net/forum?id=tj86aGVNb3.
- Label noise (stochastic) gradient descent implicitly solves the Lasso for quadratic parametrisation. In Proceedings of Thirty Fifth Conference on Learning Theory, pp.Ā 2127ā2159. PMLR, June 2022. URL https://proceedings.mlr.press/v178/vivien22a.html. ISSN: 2640-3498.
- Wainwright, M.Ā J. High-dimensional statistics: A non-asymptotic viewpoint, volumeĀ 48. Cambridge university press, 2019.
- On the Power and Limitations of Random Features for Understanding Neural Networks. In Advances in Neural Information Processing Systems, volumeĀ 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/hash/5481b2f34a74e427a2818014b8e103b0-Abstract.html.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.