Search CORE

262 research outputs found

Unlabeled data does provably help

Author: Darnstädt Malte
Simon Hans Ulrich
Szörényi Balázs
Publication venue: IBFI, Dagstuhl
Publication date: 01/01/2013
Field of study

SZTE Publicatio Repozitórium - SZTE - Repository of Publications

Unlabeled Data Does Provably Help

Author: Simon Hans Ulrich
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 30th International Symposium on Theoretical Aspects of Computer Science (STACS 2013)
Publication date: 01/01/2013
Field of study

A fully supervised learner needs access to correctly labeled examples whereas a semi-supervised learner has access to examples part of which are labeled and part of which are not. The hope is that a large collection of unlabeled examples significantly reduces the need for labeled-ones. It is widely believed that this reduction of "label complexity" is marginal unless the hidden target concept and the domain distribution satisfy some "compatibility assumptions". There are some recent papers in support of this belief. In this paper, we revitalize the discussion by presenting a result that goes in the other direction. To this end, we consider the PAC-learning model in two settings: the (classical) fully supervised setting and the semi-supervised setting. We show that the "label-complexity gap"\u27 between the semi-supervised and the fully supervised setting can become arbitrarily large for concept classes of infinite VC-dimension (or sequences of classes whose VC-dimensions are finite but become arbitrarily large). On the other hand, this gap is bounded by O(ln |C|) for each finite concept class C that contains the constant zero- and the constant one-function. A similar statement holds for all classes C of finite VC-dimension

Dagstuhl Research Online Publication Server

Density-sensitive semisupervised inference

Author: Azizyan Martin
Singh Aarti
Wasserman Larry
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 24/05/2013
Field of study

Semisupervised methods are techniques for using labeled data

(X_1,Y_1),\ldots,(X_n,Y_n)

together with unlabeled data

X_{n+1},\ldots,X_N

to make predictions. These methods invoke some assumptions that link the marginal distribution

P_X

of X to the regression function f(x). For example, it is common to assume that f is very smooth over high density regions of

P_X

. Many of the methods are ad-hoc and have been shown to work in specific examples but are lacking a theoretical foundation. We provide a minimax framework for analyzing semisupervised methods. In particular, we study methods based on metrics that are sensitive to the distribution

P_X

. Our model includes a parameter

\alpha

that controls the strength of the semisupervised assumption. We then use the data to adapt to

\alpha

.Comment: Published in at http://dx.doi.org/10.1214/13-AOS1092 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

Asymptotic Analysis of Generative Semi-Supervised Learning

Author: Balasubramanian Krishnakumar
Dillon Joshua V
Lebanon Guy
Publication venue
Publication date: 01/01/2010
Field of study

Semisupervised learning has emerged as a popular framework for improving modeling accuracy while controlling labeling cost. Based on an extension of stochastic composite likelihood we quantify the asymptotic accuracy of generative semi-supervised learning. In doing so, we complement distribution-free analysis by providing an alternative framework to measure the value associated with different labeling policies and resolve the fundamental question of how much data to label and in what manner. We demonstrate our approach with both simulation studies and real world experiments using naive Bayes for text classification and MRFs and CRFs for structured prediction in NLP.Comment: 12 pages, 9 figure

arXiv.org e-Print Archive

CiteSeerX

Semi-Supervised Learning, Causality and the Conditional Cluster Assumption

Author: Loog Marco
Mey Alexander
Schölkopf Bernhard
von Kügelgen Julius
Publication venue
Publication date: 24/06/2020
Field of study

While the success of semi-supervised learning (SSL) is still not fully understood, Sch\"olkopf et al. (2012) have established a link to the principle of independent causal mechanisms. They conclude that SSL should be impossible when predicting a target variable from its causes, but possible when predicting it from its effects. Since both these cases are somewhat restrictive, we extend their work by considering classification using cause and effect features at the same time, such as predicting disease from both risk factors and symptoms. While standard SSL exploits information contained in the marginal distribution of all inputs (to improve the estimate of the conditional distribution of the target given inputs), we argue that in our more general setting we should use information in the conditional distribution of effect features given causal features. We explore how this insight generalises the previous understanding, and how it relates to and can be exploited algorithmically for SSL.Comment: 36th Conference on Uncertainty in Artificial Intelligence (2020) (Previously presented at the NeurIPS 2019 workshop "Do the right thing": machine learning and causal inference for improved decision making, Vancouver, Canada.

arXiv.org e-Print Archive

MPG.PuRe