Nonparametric regression using over-parameterized shallow ReLU neural networks (2306.08321v2)
Abstract: It is shown that over-parameterized neural networks can achieve minimax optimal rates of convergence (up to logarithmic factors) for learning functions from certain smooth function classes, if the weights are suitably constrained or regularized. Specifically, we consider the nonparametric regression of estimating an unknown $d$-variate function by using shallow ReLU neural networks. It is assumed that the regression function is from the H\"older space with smoothness $\alpha<(d+3)/2$ or a variation space corresponding to shallow neural networks, which can be viewed as an infinitely wide neural network. In this setting, we prove that least squares estimators based on shallow neural networks with certain norm constraints on the weights are minimax optimal, if the network width is sufficiently large. As a byproduct, we derive a new size-independent bound for the local Rademacher complexity of shallow ReLU neural networks, which may be of independent interest.
- A convergence theory for deep learning via over-parameterization. In Proceedings of the 36th International Conference on Machine Learning, pages 242–252. 2019.
- Francis Bach. Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research, 18(19):1–53, 2017.
- Andrew R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930–945, 1993.
- Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2002.
- Local Rademacher complexities. The Annals of Statistics, 33(4), 2005.
- Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020.
- Deep learning: a statistical viewpoint. Acta Numerica, 30:87–201, 2021.
- Understanding neural networks with reproducing kernel Banach spaces. Applied and Computational Harmonic Analysis, 62:194–236, 2023.
- On deep learning as a remedy for the curse of dimensionality in nonparametric regression. The Annals of Statistics, 47(4):2261–2285, 2019.
- Mikhail Belkin. Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. Acta Numerica, 30:203–248, 2021.
- Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019a.
- Does data interpolation contradict statistical optimality? In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1611–1619. 2019b.
- Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press, 2013.
- Simon Buchholz. Kernel interpolation in Sobolev spaces is not consistent in low dimensions. In Proceedings of the 35th Conference on Learning Theory, pages 3410–3440. 2022.
- Nonparametric regression on low-dimensional manifolds using deep ReLU networks: function approximation and statistical recovery. Information and Inference: A Journal of the IMA, 11(4):1203–1253, 2022.
- On lazy training in differentiable programming. In Advances in Neural Information Processing Systems, pages 2933–2943, 2019.
- Gradient descent finds global minima of deep neural networks. In Proceedings of the 36th International Conference on Machine Learning, pages 1675–1685. 2019.
- A priori estimates of the population risk for two-layer neural networks. Communications in Mathematical Sciences, 17(5):1407–1425, 2019.
- The Barron space and the flow-induced function spaces for neural network models. Constructive Approximation, 55(1):369–406, 2022.
- Gerald B. Folland. Real analysis: modern techniques and their applications. John Wiley & Sons, 1999.
- Size-independent sample complexity of neural networks. Information and Inference: A Journal of the IMA, 9(2):473–504, 2020.
- Kantorovich duality for general transport costs and applications. Journal of Functional Analysis, 273(11):3327–3405, 2017.
- A distribution-free theory of nonparametric regression. Springer, 2002.
- Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2), 2022.
- Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, pages 8580–8589. 2018.
- Deep nonparametric regression on approximate manifolds: Non-asymptotic error bounds with polynomial prefactors. arXiv: 2104.06708, 2021.
- Approximation bounds for norm constrained neural networks with applications to regression and GANs. Applied and Computational Harmonic Analysis, 65:249–278, 2023.
- Approximation by combinations of ReLU and squared ReLU ridge functions with l1superscript𝑙1l^{1}italic_l start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and l0superscript𝑙0l^{0}italic_l start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT controls. IEEE Transactions on Information Theory, 64(12):7649–7656, 2018.
- On the rate of convergence of fully connected deep neural network regression estimates. The Annals of Statistics, 49(4):2231–2249, 2021.
- Vladimir Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34(6), 2006.
- Just interpolate: Kernel “Ridgeless” regression can generalize. The Annals of Statistics, 48(3), 2020.
- Yuly Makovoz. Random approximants and neural networks. Journal of Approximation Theory, 85(1):98–109, 1996.
- A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33), 2018.
- Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Proceedings of the 32nd Conference on Learning Theory, pages 2388–2464. 2019.
- Generalization error of random feature and kernel methods: Hypercontractivity and kernel matrix concentration. Applied and Computational Harmonic Analysis, 59:3–84, 2022.
- Hrushikesh N. Mhaskar. On the tractability of multivariate integration and approximation by neural networks. Journal of Complexity, 20(4):561–590, 2004.
- Hrushikesh N. Mhaskar. Dimension independent bounds for general shallow networks. Neural Networks, 123:142–152, 2020.
- Adaptive approximation and generalization of deep neural network with intrinsic dimensionality. Journal of Machine Learning Research, 21(174):1–38, 2020.
- Path-SGD: Path-normalized optimization in deep neural networks. In Advances in Neural Information Processing Systems, pages 2422–2430. 2015a.
- In search of the real inductive bias: On the role of implicit regularization in deep learning. In 3rd International Conference on Learning Representations, 2015b.
- A function space view of bounded norm infinite width ReLU nets: The multivariate case. In 8th International Conference on Learning Representations, 2020.
- Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. IEEE Journal on Selected Areas in Information Theory, 1(1):84–105, 2020.
- Banach space representer theorems for neural networks and ridge splines. Journal of Machine Learning Research, 22(43):1–40, 2021.
- What kinds of functions do deep neural networks learn? insights from variational spline theory. SIAM Journal on Mathematics of Data Science, 4(2):464–489, 2022.
- Near-minimax optimal estimation with shallow ReLU neural networks. IEEE Transactions on Information Theory, 69(2):1125–1140, 2023.
- Consistency of interpolation with Laplace kernels is a high-dimensional phenomenon. In Proceedings of the 32nd Conference on Learning Theory, pages 2595–2623. 2019.
- How do infinite width bounded norm networks look in function space? In Proceedings of the 32nd Conference on Learning Theory, pages 2667–2690. 2019.
- Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with ReLU activation function. The Annals of Statistics, 48(4):1875–1897, 2020.
- Approximation rates for neural networks with general activation functions. Neural Networks, 128:313–321, 2020.
- Sharp bounds on the approximation rates, metric entropy, and n-widths of shallow neural networks. Foundations of Computational Mathematics, pages 1–57, 2022.
- Characterization of the variation spaces corresponding to shallow neural networks. Constructive Approximation, 2023.
- Mean field analysis of neural networks: A law of large numbers. SIAM Journal on Applied Mathematics, 80(2):725–752, 2020.
- Support vector machines. Springer Science & Business Media, 2008.
- Charles J. Stone. Optimal global rates of convergence for nonparametric regression. The annals of statistics, 10(4):1040–1053, 1982.
- Taiji Suzuki. Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality. In 7th International Conference on Learning Representations, 2019.
- On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications, 16(2):264–280, 1971.
- Martin J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.
- Learning rates of least-square regularized regression. Foundations of Computational Mathematics, 6(2):171–192, 2005.
- Optimal rates of approximation by shallow ReLUk𝑘{}^{k}start_FLOATSUPERSCRIPT italic_k end_FLOATSUPERSCRIPT neural networks and applications to nonparametric regression. arXiv: 2304.01561, 2023.
- Understanding deep learning requires rethinking generalization. In 5th International Conference on Learning Representations, 2017.