65,345 research outputs found
Fast learning rates for plug-in classifiers
It has been recently shown that, under the margin (or low noise) assumption,
there exist classifiers attaining fast rates of convergence of the excess Bayes
risk, that is, rates faster than . The work on this subject has
suggested the following two conjectures: (i) the best achievable fast rate is
of the order , and (ii) the plug-in classifiers generally converge more
slowly than the classifiers based on empirical risk minimization. We show that
both conjectures are not correct. In particular, we construct plug-in
classifiers that can achieve not only fast, but also super-fast rates, that is,
rates faster than . We establish minimax lower bounds showing that the
obtained rates cannot be improved.Comment: Published at http://dx.doi.org/10.1214/009053606000001217 in the
Annals of Statistics (http://www.imstat.org/aos/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Fast learning rates for plug-in classifiers under the margin condition
It has been recently shown that, under the margin (or low noise) assumption,
there exist classifiers attaining fast rates of convergence of the excess Bayes
risk, i.e., the rates faster than . The works on this subject
suggested the following two conjectures: (i) the best achievable fast rate is
of the order , and (ii) the plug-in classifiers generally converge
slower than the classifiers based on empirical risk minimization. We show that
both conjectures are not correct. In particular, we construct plug-in
classifiers that can achieve not only the fast, but also the {\it super-fast}
rates, i.e., the rates faster than . We establish minimax lower bounds
showing that the obtained rates cannot be improved.Comment: 36 page
Anisotropic oracle inequalities in noisy quantization
The effect of errors in variables in quantization is investigated. We prove
general exact and non-exact oracle inequalities with fast rates for an
empirical minimization based on a noisy sample
, where are i.i.d. with density and
are i.i.d. with density . These rates depend on the geometry
of the density and the asymptotic behaviour of the characteristic function
of .
This general study can be applied to the problem of -means clustering with
noisy data. For this purpose, we introduce a deconvolution -means stochastic
minimization which reaches fast rates of convergence under standard Pollard's
regularity assumptions.Comment: 30 pages. arXiv admin note: text overlap with arXiv:1205.141
Gibbs Max-margin Topic Models with Data Augmentation
Max-margin learning is a powerful approach to building classifiers and
structured output predictors. Recent work on max-margin supervised topic models
has successfully integrated it with Bayesian topic models to discover
discriminative latent semantic structures and make accurate predictions for
unseen testing data. However, the resulting learning problems are usually hard
to solve because of the non-smoothness of the margin loss. Existing approaches
to building max-margin supervised topic models rely on an iterative procedure
to solve multiple latent SVM subproblems with additional mean-field assumptions
on the desired posterior distributions. This paper presents an alternative
approach by defining a new max-margin loss. Namely, we present Gibbs max-margin
supervised topic models, a latent variable Gibbs classifier to discover hidden
topic representations for various tasks, including classification, regression
and multi-task learning. Gibbs max-margin supervised topic models minimize an
expected margin loss, which is an upper bound of the existing margin loss
derived from an expected prediction rule. By introducing augmented variables
and integrating out the Dirichlet variables analytically by conjugacy, we
develop simple Gibbs sampling algorithms with no restricting assumptions and no
need to solve SVM subproblems. Furthermore, each step of the
"augment-and-collapse" Gibbs sampling algorithms has an analytical conditional
distribution, from which samples can be easily drawn. Experimental results
demonstrate significant improvements on time efficiency. The classification
performance is also significantly improved over competitors on binary,
multi-class and multi-label classification tasks.Comment: 35 page
Faster Rates for Policy Learning
This article improves the existing proven rates of regret decay in optimal
policy estimation. We give a margin-free result showing that the regret decay
for estimating a within-class optimal policy is second-order for empirical risk
minimizers over Donsker classes, with regret decaying at a faster rate than the
standard error of an efficient estimator of the value of an optimal policy. We
also give a result from the classification literature that shows that faster
regret decay is possible via plug-in estimation provided a margin condition
holds. Four examples are considered. In these examples, the regret is expressed
in terms of either the mean value or the median value; the number of possible
actions is either two or finitely many; and the sampling scheme is either
independent and identically distributed or sequential, where the latter
represents a contextual bandit sampling scheme
Generalization error for multi-class margin classification
In this article, we study rates of convergence of the generalization error of
multi-class margin classifiers. In particular, we develop an upper bound theory
quantifying the generalization error of various large margin classifiers. The
theory permits a treatment of general margin losses, convex or nonconvex, in
presence or absence of a dominating class. Three main results are established.
First, for any fixed margin loss, there may be a trade-off between the ideal
and actual generalization performances with respect to the choice of the class
of candidate decision functions, which is governed by the trade-off between the
approximation and estimation errors. In fact, different margin losses lead to
different ideal or actual performances in specific cases. Second, we
demonstrate, in a problem of linear learning, that the convergence rate can be
arbitrarily fast in the sample size depending on the joint distribution of
the input/output pair. This goes beyond the anticipated rate .
Third, we establish rates of convergence of several margin classifiers in
feature selection with the number of candidate variables allowed to greatly
exceed the sample size but no faster than .Comment: Published at http://dx.doi.org/10.1214/07-EJS069 in the Electronic
Journal of Statistics (http://www.i-journals.org/ejs/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Fast rates in statistical and online learning
The speed with which a learning algorithm converges as it is presented with
more data is a central problem in machine learning --- a fast rate of
convergence means less data is needed for the same level of performance. The
pursuit of fast rates in online and statistical learning has led to the
discovery of many conditions in learning theory under which fast learning is
possible. We show that most of these conditions are special cases of a single,
unifying condition, that comes in two forms: the central condition for 'proper'
learning algorithms that always output a hypothesis in the given model, and
stochastic mixability for online algorithms that may make predictions outside
of the model. We show that under surprisingly weak assumptions both conditions
are, in a certain sense, equivalent. The central condition has a
re-interpretation in terms of convexity of a set of pseudoprobabilities,
linking it to density estimation under misspecification. For bounded losses, we
show how the central condition enables a direct proof of fast rates and we
prove its equivalence to the Bernstein condition, itself a generalization of
the Tsybakov margin condition, both of which have played a central role in
obtaining fast rates in statistical learning. Yet, while the Bernstein
condition is two-sided, the central condition is one-sided, making it more
suitable to deal with unbounded losses. In its stochastic mixability form, our
condition generalizes both a stochastic exp-concavity condition identified by
Juditsky, Rigollet and Tsybakov and Vovk's notion of mixability. Our unifying
conditions thus provide a substantial step towards a characterization of fast
rates in statistical learning, similar to how classical mixability
characterizes constant regret in the sequential prediction with expert advice
setting.Comment: 69 pages, 3 figure
Robust classification via MOM minimization
We present an extension of Vapnik's classical empirical risk minimizer (ERM)
where the empirical risk is replaced by a median-of-means (MOM) estimator, the
new estimators are called MOM minimizers. While ERM is sensitive to corruption
of the dataset for many classical loss functions used in classification, we
show that MOM minimizers behave well in theory, in the sense that it achieves
Vapnik's (slow) rates of convergence under weak assumptions: data are only
required to have a finite second moment and some outliers may also have
corrupted the dataset.
We propose an algorithm inspired by MOM minimizers. These algorithms can be
analyzed using arguments quite similar to those used for Stochastic Block
Gradient descent. As a proof of concept, we show how to modify a proof of
consistency for a descent algorithm to prove consistency of its MOM version. As
MOM algorithms perform a smart subsampling, our procedure can also help to
reduce substantially time computations and memory ressources when applied to
non linear algorithms.
These empirical performances are illustrated on both simulated and real
datasets
Classification with the nearest neighbor rule in general finite dimensional spaces: necessary and sufficient conditions
Given an -sample of random vectors whose
joint law is unknown, the long-standing problem of supervised classification
aims to \textit{optimally} predict the label of a given a new observation
. In this context, the nearest neighbor rule is a popular flexible and
intuitive method in non-parametric situations.
Even if this algorithm is commonly used in the machine learning and
statistics communities, less is known about its prediction ability in general
finite dimensional spaces, especially when the support of the density of the
observations is . This paper is devoted to the study of the
statistical properties of the nearest neighbor rule in various situations. In
particular, attention is paid to the marginal law of , as well as the
smoothness and margin properties of the \textit{regression function} . We identify two necessary and sufficient conditions to
obtain uniform consistency rates of classification and to derive sharp
estimates in the case of the nearest neighbor rule. Some numerical experiments
are proposed at the end of the paper to help illustrate the discussion.Comment: 53 Pages, 3 figure
- …