Out-Of-Domain Unlabeled Data Improves Generalization (2310.00027v2)

Published 29 Sep 2023 in stat.ML and cs.LG

Abstract: We propose a novel framework for incorporating unlabeled data into semi-supervised classification problems, where scenarios involving the minimization of either i) adversarially robust or ii) non-robust loss functions have been considered. Notably, we allow the unlabeled samples to deviate slightly (in total variation sense) from the in-domain distribution. The core idea behind our framework is to combine Distributionally Robust Optimization (DRO) with self-supervised training. As a result, we also leverage efficient polynomial-time algorithms for the training stage. From a theoretical standpoint, we apply our framework on the classification problem of a mixture of two Gaussians in $\mathbb{R}^d$, where in addition to the $m$ independent and labeled samples from the true distribution, a set of $n$ (usually with $n\gg m$) out of domain and unlabeled samples are given as well. Using only the labeled data, it is known that the generalization error can be bounded by $\propto\left(d/m\right)^{1/2}$. However, using our method on both isotropic and non-isotropic Gaussian mixture models, one can derive a new set of analytically explicit and non-asymptotic bounds which show substantial improvement on the generalization error compared to ERM. Our results underscore two significant insights: 1) out-of-domain samples, even when unlabeled, can be harnessed to narrow the generalization gap, provided that the true data distribution adheres to a form of the ``cluster assumption", and 2) the semi-supervised learning paradigm can be regarded as a special case of our framework when there are no distributional shifts. We validate our claims through experiments conducted on a variety of synthetic and real-world datasets.

Authors (6)

Amir Hossein Saberi (3 papers)
Amir Najafi (15 papers)
Alireza Heidari (10 papers)
Mohammad Hosein Movasaghinia (1 paper)
Abolfazl Motahari (6 papers)
Babak H. Khalaj (16 papers)

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Out-Of-Domain Unlabeled Data Improves Generalization (2310.00027v2)

Summary

Related Papers

Tweets