Rethinking Gauss-Newton for learning over-parameterized models (2302.02904v3)
Abstract: This work studies the global convergence and implicit bias of Gauss Newton's (GN) when optimizing over-parameterized one-hidden layer networks in the mean-field regime. We first establish a global convergence result for GN in the continuous-time limit exhibiting a faster convergence rate compared to GD due to improved conditioning. We then perform an empirical study on a synthetic regression task to investigate the implicit bias of GN's method. While GN is consistently faster than GD in finding a global optimum, the learned model generalizes well on test data when starting from random initial weights with a small variance and using a small step size to slow down convergence. Specifically, our study shows that such a setting results in a hidden learning phenomenon, where the dynamics are able to recover features with good generalization properties despite the model having sub-optimal training and test performances due to an under-optimized linear layer. This study exhibits a trade-off between the convergence speed of GN and the generalization ability of the learned solution.
- Zeyuan Allen-Zhu, Yuanzhi Li and Zhao Song “A convergence theory for deep learning via over-parameterization” In International Conference on Machine Learning, 2019, pp. 242–252 PMLR
- Shun-ichi Amari “Natural Gradient Works Efficiently in Learning” In Neural Computation 10.2, 1998, pp. 251–276 URL: https://www.mitpressjournals.org/doi/10.1162/089976698300017746
- “SGD with large step sizes learns sparse features” In arXiv preprint arXiv:2210.05337, 2022
- “Kernelized Wasserstein Natural Gradient”, 2019 URL: https://openreview.net/forum?id=Hklz71rYvS
- “Non-Convex Bilevel Games with Critical Point Selection Maps” In Advances in Neural Information Processing Systems (NeurIPS) 2022, 2022
- “Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks” In International Conference on Machine Learning, 2019, pp. 322–332 PMLR
- Mikhail Belkin, Alexander Rakhlin and Alexandre B Tsybakov “Does data interpolation contradict statistical optimality?” In The 22nd International Conference on Artificial Intelligence and Statistics, 2019, pp. 1611–1619 PMLR
- “Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process” In Conference on learning theory, 2020, pp. 483–513 PMLR
- Aleksandar Botev, Hippolyt Ritter and David Barber “Practical gauss-newton optimisation for deep learning” In International Conference on Machine Learning, 2017, pp. 557–565 PMLR
- Etienne Boursier, Loucas Pillaud-Vivien and Nicolas Flammarion “Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs” In arXiv preprint arXiv:2206.00939, 2022
- “Gram-gauss-newton method: Learning overparameterized neural networks for regression problems” In arXiv preprint arXiv:1905.11675, 2019
- Zhengdao Chen, Eric Vanden-Eijnden and Joan Bruna “On feature learning in neural networks with global convergence guarantees” In arXiv preprint arXiv:2204.10782, 2022
- Lenaic Chizat “Sparse Optimization on Measures with Over-parameterized Gradient Descent” arXiv: 1907.10300 In arXiv:1907.10300 [math, stat], 2019 URL: http://arxiv.org/abs/1907.10300
- “On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport”, 2018 URL: https://hal.archives-ouvertes.fr/hal-01798792
- Lenaic Chizat, Edouard Oyallon and Francis Bach “On lazy training in differentiable programming” In Advances in Neural Information Processing Systems 32, 2019
- Li Deng “The mnist database of handwritten digit images for machine learning research” In IEEE Signal Processing Magazine 29.6 IEEE, 2012, pp. 141–142
- “Gradient descent finds global minima of deep neural networks” In International conference on machine learning, 2019, pp. 1675–1685 PMLR
- “Gradient descent provably optimizes over-parameterized neural networks” In arXiv preprint arXiv:1810.02054, 2018
- Stefan Elfwing, Eiji Uchibe and Kenji Doya “Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning. ArXiv e-prints (2017)” In arXiv preprint arXiv:1702.03118, 2017
- “Fisher-Legendre (FishLeg) optimization of deep neural networks” In The Eleventh International Conference on Learning Representations, 2023
- “A Kronecker-factored Approximate Fisher Matrix for Convolution Layers” event-place: New York, NY, USA In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16 JMLR.org, 2016, pp. 573–582 URL: http://dl.acm.org/citation.cfm?id=3045390.3045452
- “Surprises in high-dimensional ridgeless least squares interpolation” In The Annals of Statistics 50.2 Institute of Mathematical Statistics, 2022, pp. 949–986
- Arthur Jacot, Franck Gabriel and Clément Hongler “Neural Tangent Kernel: Convergence and Generalization in Neural Networks” arXiv: 1806.07572 In arXiv:1806.07572 [cs, math, stat], 2018 URL: http://arxiv.org/abs/1806.07572
- Barbara Kaltenbacher, Andreas Neubauer and AG Ramm “Convergence rates of the continuous regularized Gauss—Newton method” In Journal of Inverse and Ill-Posed Problems 10.3 De Gruyter, 2002, pp. 261–280
- Barbara Kaltenbacher, Andreas Neubauer and Otmar Scherzer “Iterative regularization methods for nonlinear ill-posed problems” In Iterative Regularization Methods for Nonlinear Ill-Posed Problems de Gruyter, 2008
- “Understanding approximate fisher information for fast convergence of natural gradient descent in wide neural networks” In Advances in neural information processing systems 33, 2020, pp. 10891–10901
- Anna Kerekes, Anna Mészáros and Ferenc Huszár “Depth Without the Magic: Inductive Bias of Natural Gradient Descent” In arXiv preprint arXiv:2111.11542, 2021
- Chaoyue Liu, Libin Zhu and Mikhail Belkin “Loss landscapes and optimization in over-parameterized non-linear systems and neural networks” In Applied and Computational Harmonic Analysis 59 Elsevier, 2022, pp. 85–116
- James Martens “Deep Learning via Hessian-free Optimization” event-place: Haifa, Israel In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10 USA: Omnipress, 2010, pp. 735–742 URL: http://dl.acm.org/citation.cfm?id=3104322.3104416
- James Martens “New insights and perspectives on the natural gradient method” In The Journal of Machine Learning Research 21.1 JMLRORG, 2020, pp. 5776–5851
- “Optimizing Neural Networks with Kronecker-factored Approximate Curvature” arXiv: 1503.05671 In arXiv:1503.05671 [cs, stat], 2015 URL: http://arxiv.org/abs/1503.05671
- “Training Deep and Recurrent Networks with Hessian-Free Optimization” In Neural Networks: Tricks of the Trade: Second Edition, Lecture Notes in Computer Science Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 479–535 DOI: 10.1007/978-3-642-35289-8_27
- “On the explicit role of initialization on the convergence and implicit bias of overparametrized linear networks” In International Conference on Machine Learning, 2021, pp. 7760–7768 PMLR
- Konstantin Mishchenko “Regularized Newton Method with Global O(1/k**2)O(1/k**2)italic_O ( 1 / italic_k * * 2 ) Convergence” In arXiv preprint arXiv:2112.02089, 2021
- Jorge J Moré “The Levenberg-Marquardt algorithm: implementation and theory” In Numerical Analysis: Proceedings of the Biennial Conference Held at Dundee, June 28–July 1, 1977, 2006, pp. 105–116 Springer
- Rotem Mulayoff, Tomer Michaeli and Daniel Soudry “The implicit bias of minima stability: A view from function space” In Advances in Neural Information Processing Systems 34, 2021, pp. 17749–17761
- Vinod Nair and Geoffrey E Hinton “Rectified linear units improve restricted boltzmann machines” In Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814
- Yu Nesterov “Modified Gauss–Newton scheme with worst case guarantees for global performance” In Optimisation methods and software 22.3 Taylor & Francis, 2007, pp. 469–483
- Behnam Neyshabur, Ryota Tomioka and Nathan Srebro “In search of the real inductive bias: On the role of implicit regularization in deep learning” In arXiv preprint arXiv:1412.6614, 2014
- Scott Pesme, Loucas Pillaud-Vivien and Nicolas Flammarion “Implicit bias of sgd for diagonal linear networks: a provable benefit of stochasticity” In Advances in Neural Information Processing Systems 34, 2021, pp. 29218–29230
- Alexander G Ramm “Dynamical systems method for solving operator equations” In Communications in Nonlinear Science and Numerical Simulation 9.4 Elsevier, 2004, pp. 383–402
- “Efficient subsampled gauss-newton and natural gradient methods for training neural networks” In arXiv preprint arXiv:1906.02353, 2019
- David A. R. Robin, Kevin Scaman and Marc Lelarge “Convergence beyond the over-parameterized regime using Rayleigh quotients” In Advances in Neural Information Processing Systems, 2022 URL: https://openreview.net/forum?id=pl279jU4GOu
- Nicol N Schraudolph “Fast curvature matrix-vector products for second-order gradient descent” In Neural computation 14.7 MIT Press, 2002, pp. 1723–1738
- “On the importance of initialization and momentum in deep learning” In International Conference on Machine Learning, 2013, pp. 1139–1147 URL: http://proceedings.mlr.press/v28/sutskever13.html
- Alexandre B. Tsybakov “Introduction to Nonparametric Estimation”, Springer Series in Statistics New York: Springer-Verlag, 2009 DOI: 10.1007/b13794
- “Implicit Regularization in Overparameterized Bilevel Optimization” In ICML 2021 Beyond First Order Methods Workshop, 2021
- Stephan Wojtowytsch “On the convergence of gradient descent training for two-layer relu-networks in the mean field regime” In arXiv preprint arXiv:2005.13530, 2020
- “Kernel and rich regimes in overparametrized models” In Conference on Learning Theory, 2020, pp. 3635–3673 PMLR
- “Flexible modification of gauss-newton method and its stochastic extension” In arXiv preprint arXiv:2102.00810, 2021
- Chulhee Yun, Shankar Krishnan and Hossein Mobahi “A unifying view on implicit bias in training linear neural networks” In arXiv preprint arXiv:2010.02501, 2020
- Guodong Zhang, James Martens and Roger B Grosse “Fast convergence of natural gradient descent for over-parameterized neural networks” In Advances in Neural Information Processing Systems 32, 2019
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.