Error Exponent in Agnostic PAC Learning (2405.00792v1)
Abstract: Statistical learning theory and the Probably Approximately Correct (PAC) criterion are the common approach to mathematical learning theory. PAC is widely used to analyze learning problems and algorithms, and have been studied thoroughly. Uniform worst case bounds on the convergence rate have been well established using, e.g., VC theory or Radamacher complexity. However, in a typical scenario the performance could be much better. In this paper, we consider PAC learning using a somewhat different tradeoff, the error exponent - a well established analysis method in Information Theory - which describes the exponential behavior of the probability that the risk will exceed a certain threshold as function of the sample size. We focus on binary classification and find, under some stability assumptions, an improved distribution dependent error exponent for a wide range of problems, establishing the exponential behavior of the PAC error probability in agnostic learning. Interestingly, under these assumptions, agnostic learning may have the same error exponent as realizable learning. The error exponent criterion can be applied to analyze knowledge distillation, a problem that so far lacks a theoretical analysis.
- L. G. Valiant, “A theory of the learnable.” Commun. ACM, vol. 27, no. 11, pp. 1134–1142, 1984. [Online]. Available: http://dblp.uni-trier.de/db/journals/cacm/cacm27.html#Valiant84
- V. Vapnik and A. Chervonenkis, “Theory of pattern recognition,” 1974.
- O. Bousquet, S. Boucheron, and G. Lugosi, “Introduction to statistical learning theory,” Advanced Lectures on Machine Learning: ML Summer Schools 2003, Canberra, Australia, February 2-14, 2003, Tübingen, Germany, August 4-16, 2003, Revised Lectures, pp. 169–207, 2004.
- O. Bousquet, S. Hanneke, S. Moran, R. Van Handel, and A. Yehudayoff, “A theory of universal learning,” in Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, 2021, pp. 532–541.
- D. Cohn and G. Tesauro, “Can neural networks do better than the vapnik-chervonenkis bounds?” Advances in Neural Information Processing Systems, vol. 3, 1990.
- ——, “How tight are the vapnik-chervonenkis bounds?” Neural Computation, vol. 4, no. 2, pp. 249–269, 1992.
- V. Nagarajan and J. Z. Kolter, “Uniform convergence may be unable to explain generalization in deep learning,” Advances in Neural Information Processing Systems, vol. 32, 2019.
- A. Nitanda and T. Suzuki, “Stochastic gradient descent with exponential convergence rates of expected classification errors,” in The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 2019, pp. 1417–1426.
- D. Haussler, M. Kearns, and R. E. Schapire, “Bounds on the sample complexity of bayesian learning using information theory and the vc dimension,” Machine learning, vol. 14, pp. 83–113, 1994.
- J.-Y. Audibert and A. B. Tsybakov, “Fast learning rates for plug-in classifiers,” 2007.
- G. Lever, F. Laviolette, and J. Shawe-Taylor, “Distribution-dependent pac-bayes priors,” in International Conference on Algorithmic Learning Theory. Springer, 2010, pp. 119–133.
- ——, “Tighter pac-bayes bounds through distribution-dependent priors,” Theoretical Computer Science, vol. 473, pp. 4–28, 2013.
- S. Ben-David and R. Urner, “The sample complexity of agnostic learning under deterministic labels,” in Conference on Learning Theory. PMLR, 2014, pp. 527–542.
- G. M. Benedek and A. Itai, “Nonuniform learnability,” in Automata, Languages and Programming: 15th International Colloquium Tampere, Finland, July 11–15, 1988 Proceedings 15. Springer, 1988, pp. 82–92.
- S. Hanneke, A. Kontorovich, S. Sabato, and R. Weiss, “Universal bayes consistency in metric spaces,” in 2020 Information Theory and Applications Workshop (ITA). IEEE, 2020, pp. 1–33.
- S. Hanneke, “Learning whenever learning is possible: Universal learning under general stochastic processes,” The Journal of Machine Learning Research, vol. 22, no. 1, pp. 5751–5866, 2021.
- I. Csiszar, “The method of types [information theory],” IEEE Transactions on Information Theory, vol. 44, no. 6, pp. 2505–2523, 1998.
- I. N. Sanov, “On the probability of large deviations of random variables,” Selected Translations in Mathematical Statistics and Probability, vol. 1, pp. 213–244, 1961.
- A. Hendel, “Improved PAC Learning Bounds with Application to Knowledge Distillation,” M.Sc thesis., Dept. of Electrical Engineering - Systems, Tel-Aviv Univ., Tel-Aviv, Israel., 2023.
- Adi Hendel (1 paper)
- Meir Feder (60 papers)