Depth Separation in Norm-Bounded Infinite-Width Neural Networks (2402.08808v1)
Abstract: We study depth separation in infinite-width neural networks, where complexity is controlled by the overall squared $\ell_2$-norm of the weights (sum of squares of all weights in the network). Whereas previous depth separation results focused on separation in terms of width, such results do not give insight into whether depth determines if it is possible to learn a network that generalizes well even when the network width is unbounded. Here, we study separation in terms of the sample complexity required for learnability. Specifically, we show that there are functions that are learnable with sample complexity polynomial in the input dimension by norm-controlled depth-3 ReLU networks, yet are not learnable with sub-exponential sample complexity by norm-controlled depth-2 ReLU networks (with any value for the norm). We also show that a similar statement in the reverse direction is not possible: any function learnable with polynomial sample complexity by a norm-controlled depth-2 ReLU network with infinite width is also learnable with polynomial sample complexity by a norm-controlled depth-3 ReLU network.
- Understanding deep neural networks with rectified linear units. In International Conference on Learning Representations, 2018.
- Francis Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18(1):629–681, 2017.
- Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993.
- Peter Bartlett. For valid generalization the size of the weights is more important than the size of the network. Advances in neural information processing systems, 9, 1996.
- Rademacher and gaussian complexities: Risk bounds and structural results. In International Conference on Computational Learning Theory, pages 224–240. Springer, 2001.
- Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
- Convex neural networks. Advances in neural information processing systems, 18, 2005.
- Penalising the biases in norm regularisation enforces sparsity. 2023.
- Better depth-width trade-offs for neural networks through the lens of dynamical systems. In International Conference on Machine Learning, pages 1469–1478. PMLR, 2020.
- On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31, 2018.
- Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on Learning Theory, pages 1305–1338. PMLR, 2020.
- Kernel methods for deep learning. Advances in neural information processing systems, 22, 2009.
- Amit Daniely. Depth separation for neural networks. In Conference on Learning Theory, pages 690–696. PMLR, 2017.
- Generalization bounds for neural networks via approximate description length. Advances in Neural Information Processing Systems, 32, 2019.
- On the power of over-parametrization in neural networks with quadratic activation. In International conference on machine learning, pages 1329–1338. PMLR, 2018.
- The power of depth for feedforward neural networks. In Conference on learning theory, pages 907–940. PMLR, 2016.
- Size-independent sample complexity of neural networks. In Conference On Learning Theory, pages 297–299. PMLR, 2018.
- Comparing biases for minimal network construction with back-propagation. Advances in neural information processing systems, 1, 1988.
- Arthur Jacot. Implicit bias of large depth networks: a notion of rank for nonlinear functions. In The Eleventh International Conference on Learning Representations, 2022.
- Arthur Jacot. Bottleneck structure in learned features: Low-dimension vs regularity tradeoff. arXiv preprint arXiv:2305.19008, 2023.
- Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
- Shiyu Liang and R Srikant. Why deep neural networks for function approximation? In International Conference on Learning Representations, 2016.
- The expressive power of neural networks: A view from the width. Advances in neural information processing systems, 30, 2017.
- Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Conference on Learning Theory, pages 2388–2464. PMLR, 2019.
- In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014.
- Norm-based capacity control in neural networks. In Conference on learning theory, pages 1376–1401. PMLR, 2015.
- Geometry of optimization and implicit regularization in deep learning. arXiv preprint arXiv:1705.03071, 2017.
- A function space view of bounded norm infinite width relu nets: The multivariate case. arXiv preprint arXiv:1910.01635, 2019.
- The role of neural network activation functions. IEEE Signal Processing Letters, 27:1779–1783, 2020.
- Banach space representer theorems for neural networks and ridge splines. The Journal of Machine Learning Research, 22(1):1960–1999, 2021.
- Neural networks are convex regularizers: Exact polynomial-time convex optimization formulations for two-layer networks. In International Conference on Machine Learning, pages 7695–7705. PMLR, 2020.
- Allan Pinkus. Approximation theory of the MLP model in neural networks. Acta numerica, 8:143–195, 1999.
- The power of deeper networks for expressing natural functions. In International Conference on Learning Representations, 2018.
- Depth-width tradeoffs in approximating natural functions with neural networks. In International conference on machine learning, pages 2979–2987. PMLR, 2017.
- Depth separations in neural networks: what is actually being separated? In Conference on Learning Theory, pages 2664–2666. PMLR, 2019.
- How do infinite width bounded norm networks look in function space? In Conference on Learning Theory, pages 2667–2690. PMLR, 2019.
- Vector-valued variation spaces and width bounds for dnns: Insights on weight decay regularization. arXiv preprint arXiv:2305.16534, 2023.
- Panagiotis Sidiropoulos. N-sphere chord length distribution. arXiv preprint arXiv:1411.5639, 2014.
- Matus Telgarsky. Benefits of depth in neural networks. In Conference on learning theory, pages 1517–1539. PMLR, 2016.
- Michael Unser. Ridges, neural networks, and the radon transform. Journal of Machine Learning Research, 24(37):1–33, 2023.
- Gal Vardi. On the implicit bias in deep-learning algorithms. Communications of the ACM, 66(6):86–93, 2023.
- Neural networks with small weights and depth-separation barriers. arXiv preprint arXiv:2006.00625, 2020.
- Depth separation beyond radial functions. The Journal of Machine Learning Research, 23(1):5309–5364, 2022.
- Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.
- Dmitry Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Networks, 94:103–114, 2017.
- Dmitry Yarotsky. Optimal approximation of continuous functions by very deep relu networks. In Conference on learning theory, pages 639–649. PMLR, 2018.
- Small relu networks are powerful memorizers: a tight analysis of memorization capacity. Advances in Neural Information Processing Systems, 32, 2019.
- Understanding deep learning requires rethinking generalization. iclr 2017. arXiv preprint arXiv:1611.03530, 2017.