Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fast Convergence in Learning Two-Layer Neural Networks with Separable Data (2305.13471v2)

Published 22 May 2023 in cs.LG

Abstract: Normalized gradient descent has shown substantial success in speeding up the convergence of exponentially-tailed loss functions (which includes exponential and logistic losses) on linear classifiers with separable data. In this paper, we go beyond linear models by studying normalized GD on two-layer neural nets. We prove for exponentially-tailed losses that using normalized GD leads to linear rate of convergence of the training loss to the global optimum if the iterates find an interpolating model. This is made possible by showing certain gradient self-boundedness conditions and a log-Lipschitzness property. We also study generalization of normalized GD for convex objectives via an algorithmic-stability analysis. In particular, we show that normalized GD does not overfit during training by establishing finite-time generalization bounds.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, 242–252. PMLR.
  2. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, 322–332. PMLR.
  3. On exponential convergence of sgd in non-convex over-parametrized learning. arXiv preprint arXiv:1811.02564.
  4. Stability and generalization. The Journal of Machine Learning Research, 2: 499–526.
  5. Generalization bounds of stochastic gradient descent for wide and deep neural networks. Advances in neural information processing systems, 32.
  6. Generalization error bounds of gradient descent for learning over-parameterized deep relu networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 3349–3356.
  7. Stability and generalization of learning algorithms that converge to global optima. In International Conference on Machine Learning, 745–754. PMLR.
  8. When does gradient descent with logistic loss interpolate using deep networks with smoothed ReLU activations? In Conference on Learning Theory, 927–1027. PMLR.
  9. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on Learning Theory, 1305–1338. PMLR.
  10. Gradient descent finds global minima of deep neural networks. In International conference on machine learning, 1675–1685. PMLR.
  11. Train faster, generalize better: Stability of stochastic gradient descent. In International conference on machine learning, 1225–1234. PMLR.
  12. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31.
  13. Risk and parameter convergence of logistic regression. arXiv preprint arXiv:1803.07300.
  14. Directional convergence and alignment in deep learning. Advances in Neural Information Processing Systems, 33: 17176–17186.
  15. Characterizing the implicit bias via a primal-dual analysis. In Algorithmic Learning Theory, 772–804. PMLR.
  16. Fine-grained analysis of stability and generalization for stochastic gradient descent. In International Conference on Machine Learning, 5809–5819. PMLR.
  17. Learning overparameterized neural networks via stochastic gradient descent on structured data. Advances in neural information processing systems, 31.
  18. Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. Applied and Computational Harmonic Analysis, 59: 85–116.
  19. Lojasiewicz, S. 1963. A topological property of real analytic subsets. Coll. du CNRS, Les equations aux derive es partielles.
  20. Gradient descent maximizes the margin of homogeneous neural networks. arXiv preprint arXiv:1906.05890.
  21. Convergence of gradient descent on separable data. In The 22nd International Conference on Artificial Intelligence and Statistics, 3420–3428. PMLR.
  22. Nesterov, Y. 2003. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media.
  23. Overparameterized nonlinear learning: Gradient descent takes the shortest path? In International Conference on Machine Learning, 4951–4960. PMLR.
  24. Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks. IEEE Journal on Selected Areas in Information Theory, 1(1): 84–105.
  25. Polyak, B. 1963. Gradient methods for the minimisation of functionals. Ussr Computational Mathematics and Mathematical Physics, 3: 864–878.
  26. Random Features for Large-Scale Kernel Machines. In Platt, J.; Koller, D.; Singer, Y.; and Roweis, S., eds., Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc.
  27. Margin Maximizing Loss Functions. In NIPS.
  28. The effects of mild over-parameterization on the optimization landscape of shallow ReLU neural networks. In Conference on Learning Theory, 3889–3934. PMLR.
  29. Stability vs Implicit Bias of Gradient Methods on Separable Data and Beyond. arXiv preprint arXiv:2202.13441.
  30. Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint arXiv:1308.6370.
  31. Shamir, O. 2021. Gradient methods never overfit on separable data. Journal of Machine Learning Research, 22(85): 1–20.
  32. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1): 2822–2878.
  33. Generalization and Stability of Interpolating Neural Networks with Minimal Width. arXiv preprint arXiv:2302.09235.
  34. On Generalization of Decentralized Learning with Separable Data. In International Conference on Artificial Intelligence and Statistics, 4917–4945. PMLR.
  35. Telgarsky, M. 2013. Margins, shrinkage, and boosting. In International Conference on Machine Learning, 307–315. PMLR.
  36. Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron. In The 22nd international conference on artificial intelligence and statistics, 1195–1204. PMLR.
  37. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3): 107–115.
  38. Gradient descent optimizes over-parameterized deep ReLU networks. Machine learning, 109(3): 467–492.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Hossein Taheri (22 papers)
  2. Christos Thrampoulidis (79 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.