3 research outputs found
The information-theoretic value of unlabeled data in semi-supervised learning
We quantify the separation between the numbers of labeled examples required
to learn in two settings: Settings with and without the knowledge of the
distribution of the unlabeled data. More specifically, we prove a separation by
multiplicative factor for the class of projections over the
Boolean hypercube of dimension . We prove that there is no separation for
the class of all functions on domain of any size.
Learning with the knowledge of the distribution (a.k.a. fixed-distribution
learning) can be viewed as an idealized scenario of semi-supervised learning
where the number of unlabeled data points is so great that the unlabeled
distribution is known exactly. For this reason, we call the separation the
value of unlabeled data
When can unlabeled data improve the learning rate?
In semi-supervised classification, one is given access both to labeled and
unlabeled data. As unlabeled data is typically cheaper to acquire than labeled
data, this setup becomes advantageous as soon as one can exploit the unlabeled
data in order to produce a better classifier than with labeled data alone.
However, the conditions under which such an improvement is possible are not
fully understood yet. Our analysis focuses on improvements in the minimax
learning rate in terms of the number of labeled examples (with the number of
unlabeled examples being allowed to depend on the number of labeled ones). We
argue that for such improvements to be realistic and indisputable, certain
specific conditions should be satisfied and previous analyses have failed to
meet those conditions. We then demonstrate examples where these conditions can
be met, in particular showing rate changes from to
and from to . These results improve our understanding
of what is and isn't possible in semi-supervised learning
Improvability Through Semi-Supervised Learning: A Survey of Theoretical Results
Semi-supervised learning is a setting in which one has labeled and unlabeled
data available. In this survey we explore different types of theoretical
results when one uses unlabeled data in classification and regression tasks.
Most methods that use unlabeled data rely on certain assumptions about the data
distribution. When those assumptions are not met in reality, including
unlabeled data may actually decrease performance. Studying such methods, it
therefore is particularly important to have an understanding of the underlying
theory. In this review we gather results about the possible gains one can
achieve when using semi-supervised learning as well as results about the limits
of such methods. More precisely, this review collects the answers to the
following questions: What are, in terms of improving supervised methods, the
limits of semi-supervised learning? What are the assumptions of different
methods? What can we achieve if the assumptions are true? Finally, we also
discuss the biggest bottleneck of semi-supervised learning, namely the
assumptions they make