Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the Asymptotic Learning Curves of Kernel Ridge Regression under Power-law Decay (2309.13337v1)

Published 23 Sep 2023 in cs.LG, math.ST, and stat.TH

Abstract: The widely observed 'benign overfitting phenomenon' in the neural network literature raises the challenge to the 'bias-variance trade-off' doctrine in the statistical learning theory. Since the generalization ability of the 'lazy trained' over-parametrized neural network can be well approximated by that of the neural tangent kernel regression, the curve of the excess risk (namely, the learning curve) of kernel ridge regression attracts increasing attention recently. However, most recent arguments on the learning curve are heuristic and are based on the 'Gaussian design' assumption. In this paper, under mild and more realistic assumptions, we rigorously provide a full characterization of the learning curve: elaborating the effect and the interplay of the choice of the regularization parameter, the source condition and the noise. In particular, our results suggest that the 'benign overfitting phenomenon' exists in very wide neural networks only when the noise level is small.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Ingo Steinwart (auth.) Andreas Christmann. Support Vector Machines. Information Science and Statistics. Springer-Verlag New York, New York, NY, 1 edition, 2008. ISBN 0-387-77242-1 0-387-77241-3 978-0-387-77241-7 978-0-387-77242-4. doi: 10.1007/978-0-387-77242-4.
  2. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, December 2020. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.1907378117.
  3. Kernel ridgeless regression is inconsistent in low dimensions. (arXiv:2205.13525), June 2022. doi: 10.48550/arXiv.2205.13525.
  4. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
  5. Deep equals shallow for ReLU networks in kernel regimes. arXiv preprint arXiv:2009.14397, 2020.
  6. On the inductive bias of neural tangent kernels. In Advances in Neural Information Processing Systems, volume 32, 2019.
  7. Spectrum dependent learning curves in kernel regression and wide neural networks. In Proceedings of the 37th International Conference on Machine Learning, pages 1024–1034. PMLR, November 2020.
  8. Simon Buchholz. Kernel interpolation in Sobolev spaces is not consistent in low dimensions. In Po-Ling Loh and Maxim Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 3410–3440. PMLR, July 2022.
  9. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007. doi: 10.1007/s10208-006-0196-8.
  10. Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. Advances in Neural Information Processing Systems, 34:10131–10143, 2021.
  11. Sobolev norm learning rates for regularized least-squares algorithms. Journal of Machine Learning Research, 21:205:1–205:38, 2020.
  12. Norm inequalities equivalent to Heinz inequality. Proceedings of the American Mathematical Society, 118(3):827–830, 1993.
  13. Neural tangent kernel: Convergence and generalization in neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  14. Kernel alignment risk estimator: Risk prediction from training data. June 2020. doi: 10.48550/arXiv.2006.09796.
  15. Learning curves for Gaussian process regression with power-law priors and targets. (arXiv:2110.12231), November 2021.
  16. Gaussian processes and kernel methods: A review on connections and equivalences. arXiv preprint arXiv:1807.02582, 2018.
  17. Generalization ability of wide neural networks on R. (arXiv:2302.05933), February 2023. doi: 10.48550/arXiv.2302.05933.
  18. Kernel interpolation generalizes poorly. (arXiv:2303.15809), March 2023a. doi: 10.48550/arXiv.2303.15809.
  19. On the saturation effect of kernel ridge regression. In International Conference on Learning Representations, February 2023b.
  20. Just interpolate: Kernel "ridgeless" regression can generalize. The Annals of Statistics, 48(3), June 2020. ISSN 0090-5364. doi: 10.1214/19-AOS1849.
  21. Optimal rates for spectral algorithms with least-squares regression over Hilbert spaces. Applied and Computational Harmonic Analysis, 48:868–890, 2018. doi: 10.1016/j.acha.2018.09.009.
  22. Kernel interpolation of high dimensional scattered data. (arXiv:2009.01514), September 2021.
  23. Benign, tempered, or catastrophic: A taxonomy of overfitting. (arXiv:2207.06569), July 2022. doi: 10.48550/arXiv.2207.06569.
  24. Design problems for optimal surface interpolation. Technical report, Wisconsin Univ-Madison Dept of Statistics, 1979.
  25. Consistency of interpolation with Laplace kernels is a high-dimensional phenomenon. (arXiv:1812.11167), December 2018.
  26. Barry Simon. Operator Theory. American Mathematical Society, Providence, Rhode Island, November 2015. ISBN 978-1-4704-1103-9 978-1-4704-2763-4. doi: 10.1090/simon/004.
  27. Ingo Steinwart and C. Scovel. Mercer’s theorem on general domains: On the interaction between measures, kernels, and RKHSs. Constructive Approximation, 35(3):363–417, 2012. doi: 10.1007/S00365-012-9153-3.
  28. Optimal rates for regularized least squares regression. In COLT, pages 79–93, 2009.
  29. Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer science & business media, 1999.
  30. Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019. doi: 10.1017/9781108627771.
  31. Understanding deep learning requires rethinking generalization. (arXiv:1611.03530), February 2017.
  32. On the optimality of misspecified spectral algorithms. (arXiv:2303.14942), March 2023a. doi: 10.48550/arXiv.2303.14942.
  33. On the optimality of misspecified kernel ridge regression. In International Conference on Machine Learning, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yicheng Li (38 papers)
  2. Haobo Zhang (31 papers)
  3. Qian Lin (79 papers)
Citations (9)

Summary

We haven't generated a summary for this paper yet.