Can semi-supervised learning use all the data effectively? A lower bound perspective (2311.18557v1)
Abstract: Prior works have shown that semi-supervised learning algorithms can leverage unlabeled data to improve over the labeled sample complexity of supervised learning (SL) algorithms. However, existing theoretical analyses focus on regimes where the unlabeled data is sufficient to learn a good decision boundary using unsupervised learning (UL) alone. This begs the question: Can SSL algorithms simultaneously improve upon both UL and SL? To this end, we derive a tight lower bound for 2-Gaussian mixture models that explicitly depends on the labeled and the unlabeled dataset size as well as the signal-to-noise ratio of the mixture distribution. Surprisingly, our result implies that no SSL algorithm can improve upon the minimax-optimal statistical error rates of SL or UL algorithms for these distributions. Nevertheless, we show empirically on real-world data that SSL algorithms can still outperform UL and SL methods. Therefore, our work suggests that, while proving performance gains for SSL algorithms is possible, it requires careful tracking of constants.
- Density-sensitive semisupervised inference. The Annals of Statistics, 2013.
- Minimax theory for high-dimensional Gaussian mixtures with sparse mean separation. In Advances in Neural Information Processing Systems, 2013.
- M.-F. Balcan and A. Blum. A discriminative model for semi-supervised learning. Journal of the ACM, 2010.
- A cookbook of self-supervised learning. arXiv:2304.12210, 2023.
- Does unlabeled data provably help? Worst-case analysis of the sample complexity of semi-supervised learning. In Annual Conference on Learning Theory (COLT), 2008.
- A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Annual Conference on Learning Theory (COLT), 1998.
- Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 2020.
- A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, 2020.
- Big self-supervised models are strong semi-supervised learners. In Advances in Neural Information Processing Systems, 2020.
- Elements of Information Theory. 2006.
- Are large-scale datasets necessary for self-supervised pre-training? arXiv:2112.10740, 2021.
- Self-training converts weak learners to strong learners in mixture models. In International Conference on Artificial Intelligence and Statistics, 2022.
- C. Giraud. Introduction to high-dimensional statistics. CRC Press, 2021.
- When can unlabeled data improve the learning rate? In Annual Conference on Learning Theory (COLT), 2019.
- Scaling and benchmarking self-supervised visual representation learning. In International Conference on Computer Vision, 2019.
- Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems, 2020.
- Y. LeCun and C. Cortes. MNIST handwritten digit database, 2010.
- Minimax Gaussian classification & clustering. In International Conference on Artificial Intelligence and Statistics, 2017.
- Barely-supervised learning: Semi-supervised learning with very few labeled images. In AAAI Conference on Artificial Intelligence, 2022.
- P. Massart. Concentration inequalities and model selection: Ecole d’Eté de Probabilités de Saint-Flour XXXIII-2003. Springer, 2007.
- A. Mey and M. Loog. Improvability through semi-supervised learning: A survey of theoretical results. arXiv:1908.09574, 2020.
- Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 2011.
- J. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 1999.
- J. Ratsaby and S. S. Venkatesh. Learning from a mixture of labeled and unlabeled examples with parametric side information. In Annual Conference on Learning Theory (COLT), 1995.
- P. Rigollet. Generalization error bounds in semi-supervised classification under the cluster assumption. Journal of Machine Learning Research, 2006.
- On causal and anticausal learning. In Proceedings of the International Conference on Machine Learning, 2012.
- Unlabeled data: Now it helps, now it doesn’t. In Advances in Neural Information Processing Systems, 2008.
- FixMatch: Simplifying semi-supervised learning with consistency and confidence. In Advances in Neural Information Processing Systems, 2020.
- E. Sula and L. Zheng. On the semi-supervised Expectation Maximization. arXiv:2211.00537, 2022.
- I. O. Tolstikhin and D. Lopez-Paz. Minimax lower bounds for realizable transductive classification. arXiv:1602.03027, 2016.
- OpenML: Networked science in machine learning. SIGKDD Explorations, 2013.
- Theoretical analysis of self-training with deep networks on unlabeled data. In International Conference on Learning Representations, 2021.
- Y. Wu and H. H. Zhou. Randomly initialized EM algorithm for two-component Gaussian mixture achieves near optimality in O(n)𝑂𝑛{O}(\sqrt{n})italic_O ( square-root start_ARG italic_n end_ARG ) iterations. Mathematical Statistics and Learning, 2021.
- Self-training with noisy student improves ImageNet classification. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Annual Meeting on Association for Computational Linguistics, 1995.
- Rethinking pre-training and self-training. In Advances in Neural Information Processing Systems, 2020.