Learning Algorithm Generalization Error Bounds via Auxiliary Distributions (2210.00483v2)
Abstract: Generalization error bounds are essential for comprehending how well machine learning models work. In this work, we suggest a novel method, i.e., the Auxiliary Distribution Method, that leads to new upper bounds on expected generalization errors that are appropriate for supervised learning scenarios. We show that our general upper bounds can be specialized under some conditions to new bounds involving the $\alpha$-Jensen-Shannon, $\alpha$-R\'enyi ($0< \alpha < 1$) information between a random variable modeling the set of training samples and another random variable modeling the set of hypotheses. Our upper bounds based on $\alpha$-Jensen-Shannon information are also finite. Additionally, we demonstrate how our auxiliary distribution method can be used to derive the upper bounds on excess risk of some learning algorithms in the supervised learning context {\blue and the generalization error under the distribution mismatch scenario in supervised learning algorithms, where the distribution mismatch is modeled as $\alpha$-Jensen-Shannon or $\alpha$-R\'enyi divergence between the distribution of test and training data samples distributions.} We also outline the conditions for which our proposed upper bounds might be tighter than other earlier upper bounds.
- G. Aminian, L. Toni, and M. R. D. Rodrigues, “Jensen-shannon information based characterization of the generalization error of learning algorithms,” in 2020 IEEE Information Theory Workshop (ITW), pp. 1–5, 2021.
- V. N. Vapnik, “An overview of statistical learning theory,” IEEE transactions on neural networks, vol. 10, no. 5, pp. 988–999, 1999.
- O. Bousquet and A. Elisseeff, “Stability and generalization,” Journal of machine learning research, vol. 2, no. Mar, pp. 499–526, 2002.
- H. Xu and S. Mannor, “Robustness and generalization,” Machine learning, vol. 86, no. 3, pp. 391–423, 2012.
- D. A. McAllester, “Pac-bayesian stochastic model selection,” Machine Learning, vol. 51, no. 1, pp. 5–21, 2003.
- D. Russo and J. Zou, “How much does your data exploration overfit? controlling bias via information usage,” IEEE Transactions on Information Theory, vol. 66, no. 1, pp. 302–323, 2019.
- A. Xu and M. Raginsky, “Information-theoretic analysis of generalization capability of learning algorithms,” in Advances in Neural Information Processing Systems, pp. 2524–2533, 2017.
- Y. Bu, S. Zou, and V. V. Veeravalli, “Tightening mutual information-based bounds on generalization error,” IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 1, pp. 121–130, 2020.
- A. Asadi, E. Abbe, and S. Verdú, “Chaining mutual information and tightening generalization bounds,” in Advances in Neural Information Processing Systems, pp. 7234–7243, 2018.
- A. R. Asadi and E. Abbe, “Chaining meets chain rule: Multilevel entropic regularization and training of neural networks,” Journal of Machine Learning Research, vol. 21, no. 139, pp. 1–32, 2020.
- A. R. Esposito, M. Gastpar, and I. Issa, “Generalization error bounds via rényi-, f-divergences and maximal leakage,” IEEE Transactions on Information Theory, vol. 67, no. 8, pp. 4986–5004, 2021.
- E. Modak, H. Asnani, and V. M. Prabhakaran, “Rényi divergence based bounds on generalization error,” in 2021 IEEE Information Theory Workshop (ITW), pp. 1–6, 2021.
- A. T. Lopez and V. Jog, “Generalization error bounds using wasserstein distances,” in 2018 IEEE Information Theory Workshop (ITW), pp. 1–5, IEEE, 2018.
- H. Wang, M. Diaz, J. C. S. Santos Filho, and F. P. Calmon, “An information-theoretic view of generalization via wasserstein distance,” in 2019 IEEE International Symposium on Information Theory (ISIT), pp. 577–581, IEEE, 2019.
- B. R. Gálvez, G. Bassi, R. Thobaben, and M. Skoglund, “Tighter expected generalization error bounds via wasserstein distance,” in Advances in Neural Information Processing Systems, 2021.
- G. Aminian, Y. Bu, G. Wornell, and M. Rodrigues, “Tighter expected generalization error bounds via convexity of information measures,” in IEEE International Symposium on Information Theory (ISIT), 2022.
- T. Steinke and L. Zakynthinou, “Reasoning about generalization via conditional mutual information,” in Conference on Learning Theory, pp. 3437–3452, PMLR, 2020.
- R. Zhou, C. Tian, and T. Liu, “Individually conditional individual mutual information bound on generalization error,” IEEE Transactions on Information Theory, pp. 1–1, 2022.
- H. Hafez-Kolahi, Z. Golgooni, S. Kasaei, and M. Soleymani, “Conditioning and processing: Techniques to improve information-theoretic generalization bounds,” Advances in Neural Information Processing Systems, vol. 33, 2020.
- G. Aminian*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT, Y. Bu*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT, L. Toni, M. Rodrigues, and G. Wornell, “An exact characterization of the generalization error for the Gibbs algorithm,” Advances in Neural Information Processing Systems, vol. 34, 2021.
- M. S. Masiha, A. Gohari, M. H. Yassaee, and M. R. Aref, “Learning under distribution mismatch and model misspecification,” in IEEE International Symposium on Information Theory (ISIT), 2021.
- Y. Mansour, M. Mohri, and A. Rostamizadeh, “Domain adaptation: Learning bounds and algorithms,” arXiv preprint arXiv:0902.3430, 2009.
- Z. Wang, “Theoretical guarantees of transfer learning,” 2018.
- X. Wu, J. H. Manton, U. Aickelin, and J. Zhu, “Information-theoretic analysis for transfer learning,” in 2020 IEEE International Symposium on Information Theory (ISIT), pp. 2819–2824, IEEE, 2020.
- E. Englesson and H. Azizpour, “Generalized jensen-shannon divergence loss for learning with noisy labels,” Advances in Neural Information Processing Systems, vol. 34, pp. 30284–30297, 2021.
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, pp. 2672–2680, 2014.
- P. Melville, S. M. Yang, M. Saar-Tsechansky, and R. Mooney, “Active learning for probability estimation using jensen-shannon divergence,” in European conference on machine learning, pp. 268–279, Springer, 2005.
- E. Choi and C. Lee, “Feature extraction based on the bhattacharyya distance,” Pattern Recognition, vol. 36, no. 8, pp. 1703–1709, 2003.
- G. B. Coleman and H. C. Andrews, “Image segmentation by clustering,” Proceedings of the IEEE, vol. 67, no. 5, pp. 773–785, 1979.
- F. Topsøe, “Information theory at the service of science,” in Entropy, Search, Complexity, pp. 179–207, Springer, 2007.
- F. Topsoe, “Some inequalities for information divergence and related measures of discrimination,” IEEE Transactions on Information Theory, vol. 46, no. 4, pp. 1602–1609, 2000.
- T. M. Cover, Elements of information theory. John Wiley & Sons, 1999.
- F. Nielsen, “On a generalization of the jensen-shannon divergence and the jensen-shannon centroid,” Entropy, vol. 22, no. 2, p. 221, 2020.
- J. Lin, “Divergence measures based on the shannon entropy,” IEEE Transactions on Information Theory, vol. 37, no. 1, pp. 145–151, 1991.
- T. Van Erven and P. Harremos, “Rényi divergence and kullback-leibler divergence,” IEEE Transactions on Information Theory, vol. 60, no. 7, pp. 3797–3820, 2014.
- T. Kailath, “The divergence and bhattacharyya distance measures in signal selection,” IEEE transactions on communication technology, vol. 15, no. 1, pp. 52–60, 1967.
- D. P. Palomar and S. Verdú, “Lautum information,” IEEE Transactions on Information Theory, vol. 54, no. 3, pp. 964–975, 2008.
- I. Sason and S. Verdú, “f𝑓fitalic_f-divergence inequalities,” IEEE Transactions on Information Theory, vol. 62, no. 11, pp. 5973–6006, 2016.
- S. Verdú, “α𝛼\alphaitalic_α-mutual information,” in 2015 Information Theory and Applications Workshop (ITA), pp. 1–6, IEEE, 2015.
- Oxford university press, 2013.
- John Wiley & Sons, 2011.
- M. Gastpar, A. R. Esposito, and I. Issa, “Information measures, learning and generalization,” 5th London Symposium on Information Theory, 2019.
- F. Topsoe, “Inequalities for the jensen-shannon divergence,” Draft available at http://www. math. ku. dk/topsoe, 2002.
- M. Raginsky, A. Rakhlin, M. Tsao, Y. Wu, and A. Xu, “Information-theoretic analysis of stability and bias of learning algorithms,” in 2016 IEEE Information Theory Workshop (ITW), pp. 26–30, IEEE, 2016.
- V. Anantharam, “A variational characterization of rényi divergences,” IEEE Transactions on Information Theory, vol. 64, no. 11, pp. 6979–6989, 2018.
- I. Sason, “On f-divergences: Integral representations, local behavior, and inequalities,” Entropy, vol. 20, no. 5, p. 383, 2018.
- I. Kuzborskij, N. Cesa-Bianchi, and C. Szepesvári, “Distribution-dependent analysis of gibbs-erm principle,” in Conference on Learning Theory, pp. 2028–2054, PMLR, 2019.
- F. Topsøe, “Jenson-shannon divergence and norm-based measures of discrimination and variation,” preprint, 2003.
- M. Gil, F. Alajaji, and T. Linder, “Rényi divergence measures for commonly used univariate continuous distributions,” Information Sciences, vol. 249, pp. 124–131, 2013.