241 research outputs found
Explicit Learning Curves for Transduction and Application to Clustering and Compression Algorithms
Inductive learning is based on inferring a general rule from a finite data
set and using it to label new data. In transduction one attempts to solve the
problem of using a labeled training set to label a set of unlabeled points,
which are given to the learner prior to learning. Although transduction seems
at the outset to be an easier task than induction, there have not been many
provably useful algorithms for transduction. Moreover, the precise relation
between induction and transduction has not yet been determined. The main
theoretical developments related to transduction were presented by Vapnik more
than twenty years ago. One of Vapnik's basic results is a rather tight error
bound for transductive classification based on an exact computation of the
hypergeometric tail. While tight, this bound is given implicitly via a
computational routine. Our first contribution is a somewhat looser but explicit
characterization of a slightly extended PAC-Bayesian version of Vapnik's
transductive bound. This characterization is obtained using concentration
inequalities for the tail of sums of random variables obtained by sampling
without replacement. We then derive error bounds for compression schemes such
as (transductive) support vector machines and for transduction algorithms based
on clustering. The main observation used for deriving these new error bounds
and algorithms is that the unlabeled test points, which in the transductive
setting are known in advance, can be used in order to construct useful data
dependent prior distributions over the hypothesis space
PAC-Bayesian Learning and Domain Adaptation
In machine learning, Domain Adaptation (DA) arises when the distribution gen-
erating the test (target) data differs from the one generating the learning
(source) data. It is well known that DA is an hard task even under strong
assumptions, among which the covariate-shift where the source and target
distributions diverge only in their marginals, i.e. they have the same labeling
function. Another popular approach is to consider an hypothesis class that
moves closer the two distributions while implying a low-error for both tasks.
This is a VC-dim approach that restricts the complexity of an hypothesis class
in order to get good generalization. Instead, we propose a PAC-Bayesian
approach that seeks for suitable weights to be given to each hypothesis in
order to build a majority vote. We prove a new DA bound in the PAC-Bayesian
context. This leads us to design the first DA-PAC-Bayesian algorithm based on
the minimization of the proposed bound. Doing so, we seek for a \rho-weighted
majority vote that takes into account a trade-off between three quantities. The
first two quantities being, as usual in the PAC-Bayesian approach, (a) the
complexity of the majority vote (measured by a Kullback-Leibler divergence) and
(b) its empirical risk (measured by the \rho-average errors on the source
sample). The third quantity is (c) the capacity of the majority vote to
distinguish some structural difference between the source and target samples.Comment: https://sites.google.com/site/multitradeoffs2012
Domain Adaptation of Majority Votes via Perturbed Variation-based Label Transfer
We tackle the PAC-Bayesian Domain Adaptation (DA) problem. This arrives when
one desires to learn, from a source distribution, a good weighted majority vote
(over a set of classifiers) on a different target distribution. In this
context, the disagreement between classifiers is known crucial to control. In
non-DA supervised setting, a theoretical bound - the C-bound - involves this
disagreement and leads to a majority vote learning algorithm: MinCq. In this
work, we extend MinCq to DA by taking advantage of an elegant divergence
between distribution called the Perturbed Varation (PV). Firstly, justified by
a new formulation of the C-bound, we provide to MinCq a target sample labeled
thanks to a PV-based self-labeling focused on regions where the source and
target marginal distributions are closer. Secondly, we propose an original
process for tuning the hyperparameters. Our framework shows very promising
results on a toy problem
Improved Vapnik Cervonenkis bounds
We give a new proof of VC bounds where we avoid the use of symmetrization and
use a shadow sample of arbitrary size. We also improve on the variance term.
This results in better constants, as shown on numerical examples. Moreover our
bounds still hold for non identically distributed independent random variables.
Keywords: Statistical learning theory, PAC-Bayesian theorems, VC dimension
Domain adaptation of weighted majority votes via perturbed variation-based self-labeling
In machine learning, the domain adaptation problem arrives when the test
(target) and the train (source) data are generated from different
distributions. A key applied issue is thus the design of algorithms able to
generalize on a new distribution, for which we have no label information. We
focus on learning classification models defined as a weighted majority vote
over a set of real-val ued functions. In this context, Germain et al. (2013)
have shown that a measure of disagreement between these functions is crucial to
control. The core of this measure is a theoretical bound--the C-bound (Lacasse
et al., 2007)--which involves the disagreement and leads to a well performing
majority vote learning algorithm in usual non-adaptative supervised setting:
MinCq. In this work, we propose a framework to extend MinCq to a domain
adaptation scenario. This procedure takes advantage of the recent perturbed
variation divergence between distributions proposed by Harel and Mannor (2012).
Justified by a theoretical bound on the target risk of the vote, we provide to
MinCq a target sample labeled thanks to a perturbed variation-based
self-labeling focused on the regions where the source and target marginals
appear similar. We also study the influence of our self-labeling, from which we
deduce an original process for tuning the hyperparameters. Finally, our
framework called PV-MinCq shows very promising results on a rotation and
translation synthetic problem
Validation of Matching
We introduce a technique to compute probably approximately correct (PAC)
bounds on precision and recall for matching algorithms. The bounds require some
verified matches, but those matches may be used to develop the algorithms. The
bounds can be applied to network reconciliation or entity resolution
algorithms, which identify nodes in different networks or values in a data set
that correspond to the same entity. For network reconciliation, the bounds do
not require knowledge of the network generation process
A review of domain adaptation without target labels
Domain adaptation has become a prominent problem setting in machine learning
and related fields. This review asks the question: how can a classifier learn
from a source domain and generalize to a target domain? We present a
categorization of approaches, divided into, what we refer to as, sample-based,
feature-based and inference-based methods. Sample-based methods focus on
weighting individual observations during training based on their importance to
the target domain. Feature-based methods revolve around on mapping, projecting
and representing features such that a source classifier performs well on the
target domain and inference-based methods incorporate adaptation into the
parameter estimation procedure, for instance through constraints on the
optimization procedure. Additionally, we review a number of conditions that
allow for formulating bounds on the cross-domain generalization error. Our
categorization highlights recurring ideas and raises questions important to
further research.Comment: 20 pages, 5 figure
- …