Uniform Generalization Bounds on Data-Dependent Hypothesis Sets via PAC-Bayesian Theory on Random Sets (2404.17442v2)
Abstract: We propose data-dependent uniform generalization bounds by approaching the problem from a PAC-Bayesian perspective. We first apply the PAC-Bayesian framework on "random sets" in a rigorous way, where the training algorithm is assumed to output a data-dependent hypothesis set after observing the training data. This approach allows us to prove data-dependent bounds, which can be applicable in numerous contexts. To highlight the power of our approach, we consider two main applications. First, we propose a PAC-Bayesian formulation of the recently developed fractal-dimension-based generalization bounds. The derived results are shown to be tighter and they unify the existing results around one simple proof technique. Second, we prove uniform bounds over the trajectories of continuous Langevin dynamics and stochastic gradient Langevin dynamics. These results provide novel information about the generalization properties of noisy algorithms.
- Pierre Alquier. User-friendly Introduction to PAC-Bayes Bounds. Foundations and Trends® in Machine Learning, 2024.
- Tighter PAC-Bayes Bounds. In Advances in Neural Information Processing Systems (NIPS), 2006.
- Thinking Outside the Ball: Optimal Learning with Gradient Descent for Generalized Linear Stochastic Convex Optimization. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- David Aristoff. Estimating Small Probabilities for Langevin Dynamics. arXiv, abs/1205.2400, 2012.
- On the Rademacher Complexity of Linear Hypothesis Sets. arXiv, abs/2007.11045, 2020.
- Rademacher and Gaussian Complexities: Risk Bounds and Structural Results. Journal of Machine Learning Research, 2002.
- Model selection and error estimation. Machine Learning, 2002.
- Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems (NIPS), 2017.
- Intrinsic Dimension, Persistent Homology and Generalization in Neural Networks. Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Occam’s Hammer. In Conference on Learning Theory (COLT), 2007.
- Vladimir Bogachev. Measure theory. Springer, 2007.
- Concentration Inequalities - A Non-asymptotic Theory of Independence. Oxford University Press, 2013.
- Fractal Structure and Generalization Properties of Stochastic Optimization Algorithms. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Olivier Catoni. Pac-Bayesian supervised classification: the thermodynamics of statistical learning. Institute of Mathematical Statistics, 2007.
- A unified recipe for deriving (time-uniform) PAC-Bayes bounds. Journal of Machine Learning Research, 2023.
- Arnak Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society Series B: Statistical Methodology, 2017.
- From Mutual Information to Expected Dynamics: New Generalization Bounds for Heavy-Tailed SGD. In NeurIPS 2023 Workshop Heavy Tails in Machine Learning, 2023.
- Generalization Bounds with Data-dependent Fractal Dimensions. arXiv, abs/2302.02766, 2023.
- Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data. In Conference on Uncertainty in Artificial Intelligence (UAI), 2017.
- Kenneth Falconer. Fractal Geometry - Mathematical Foundations and Applications. Wiley, 2014.
- Time-independent Generalization Bounds for SGLD in Non-convex Settings. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Hypothesis Set Stability and Generalization. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
- Time-Independent Information-Theoretic Generalization Bounds for SGLD. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
- PAC-Bayesian learning of linear classifiers. In International Conference on Machine Learning (ICML), 2009.
- Risk bounds for the majority vote: from a PAC-Bayesian analysis to a learning algorithm. Journal of Machine Learning Research, 2015.
- The Heavy-Tail Phenomenon in SGD. In International Conference on Machine Learning (ICML), 2021.
- Sharpened Generalization Bounds Based on Conditional Mutual Information and an Application to Noisy, Iterative Algorithms. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work. In Advances in Neural Information Processing Systems (NIPS), 2000.
- Generalization Bounds Using Lower Tail Exponents in Stochastic Optimizers. In International Conference on Machine Learning (ICML), 2022.
- On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization. In Advances in Neural Information Processing Systems (NIPS), 2008.
- Vladimir Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 2001.
- Rademacher processes and bounding the risk of function learning. arXiv, math/0405338, 2004.
- Empirical margin distributions and bounding the generalization error of combined classifiers. The Annals of Statistics, 2002.
- PAC-Bayes Bounds for the Risk of the Majority Vote and the Variance of the Gibbs Classifier. In Advances in Neural Information Processing Systems (NIPS), 2006.
- John Langford. Tutorial on Practical Prediction Theory for Classification. Journal of Machine Learning Research, 2005.
- (Not) Bounding the True Error. In Advances in Neural Information Processing Systems (NIPS), 2001.
- PAC-Bayes & Margins. In Advances in Neural Information Processing Systems (NIPS), 2002.
- On Generalization Error Bounds of Noisy Gradient Methods for Non-Convex Learning. In International Conference on Learning Representations (ICLR), 2020.
- Ben London. A PAC-Bayesian Analysis of Randomized Learning with Application to Stochastic Gradient Descent. In Advances in Neural Information Processing Systems (NIPS), 2017.
- Conformal Dimension: Theory and Application. American Mathematical Society, 2010.
- Pertti Mattila. Geometry of Sets and Measures in Euclidean Spaces. Cambridge University Press, 1999.
- Andreas Maurer. A Note on the PAC Bayesian Theorem. arXiv, cs.LG/0411099, 2004.
- David McAllester. Some PAC-Bayesian Theorems. In Conference on Computational Learning Theory (COLT), 1998.
- David McAllester. PAC-Bayesian Stochastic Model Selection. Machine Learning, 2003.
- Foundations of machine learning. MIT press, 2018.
- Ilya Molchanov. Theory of Random Sets. Springer, 2017.
- Generalization Bounds of SGLD for Non-convex Learning: Two Theoretical Viewpoints. In Conference On Learning Theory (COLT), 2018.
- Uniform convergence may be unable to explain generalization in deep learning. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
- Information-Theoretic Generalization Bounds for SGLD via Data-Dependent Estimates. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
- Information-Theoretic Generalization Bounds for Stochastic Gradient Descent. In Conference on Learning Theory (COLT), 2021.
- Norm-Based Capacity Control in Neural Networks. In Conference on Learning Theory (COLT), 2015.
- Bernt Øksendal. Stochastic Differential Equations. Springer, 2003.
- PAC-bayes bounds with data dependent priors. Journal of Machine Learning Research, 2012.
- Generalization Error Bounds for Noisy, Iterative Algorithms. IEEE International Symposium on Information Theory (ISIT), 2018.
- Non-Convex Learning via Stochastic Gradient Langevin Dynamics: A Nonasymptotic Analysis. In Conference on Learning Theory (COLT), 2017.
- Patrick Rebeschini. Algorithmic Fondations of Learning, 2020.
- PAC-Bayes Analysis Beyond the Usual Bounds. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Generalization Guarantees via Algorithm-dependent Rademacher Complexity. In Conference on Learning Theory (COLT), 2023.
- Understanding Machine Learning - From Theory to Algorithms. Cambridge University Press, 2014.
- A PAC Analysis of a Bayesian Estimator. In Conference on Computational Learning Theory (COLT), 1997.
- A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks. In International Conference on Machine Learning (ICML), 2019.
- Hausdorff Dimension, Heavy Tails, and Generalization in Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Tim van Erven and Peter Harremoës. Rényi Divergence and Kullback-Leibler Divergence. IEEE Transactions on Information Theory, 2014.
- On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities. In Doklady Akademii Nauk USSR, 1968.
- On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities. Theory of Probability and its Applications, 1971.
- Roman Vershynin. High-dimensional probability: An introduction with applications in data science. Cambridge University Press, 2018.
- A general framework for the practical disintegration of PAC-Bayesian bounds. Machine Learning, 2024.
- Information-Theoretic Analysis of Generalization Capability of Learning Algorithms. Advances in Neural Information Processing Systems (NIPS 2017), 2017.
- Learning Stochastic Majority Votes by Minimizing a PAC-Bayes Generalization Bound. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations (ICLR), 2017.
- Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 2021.