862 research outputs found
Sharp generalization error bounds for randomly-projected classifiers
We derive sharp bounds on the generalization error of a generic linear classifier trained by empirical risk minimization on randomly projected data. We make no restrictive assumptions (such as sparsity or separability) on the data: Instead we use the fact that, in a classification setting, the question of interest is really āwhat is the effect of random projection on the predicted class labels?ā and we therefore derive the exact probability of ālabel flippingā under Gaussian random projection in order to quantify this effect precisely in our bounds
Generalization Error in Deep Learning
Deep learning models have lately shown great performance in various fields
such as computer vision, speech recognition, speech translation, and natural
language processing. However, alongside their state-of-the-art performance, it
is still generally unclear what is the source of their generalization ability.
Thus, an important question is what makes deep neural networks able to
generalize well from the training set to new data. In this article, we provide
an overview of the existing theory and bounds for the characterization of the
generalization error of deep neural networks, combining both classical and more
recent theoretical and empirical results
COMET: A Recipe for Learning and Using Large Ensembles on Massive Data
COMET is a single-pass MapReduce algorithm for learning on large-scale data.
It builds multiple random forest ensembles on distributed blocks of data and
merges them into a mega-ensemble. This approach is appropriate when learning
from massive-scale data that is too large to fit on a single machine. To get
the best accuracy, IVoting should be used instead of bagging to generate the
training subset for each decision tree in the random forest. Experiments with
two large datasets (5GB and 50GB compressed) show that COMET compares favorably
(in both accuracy and training time) to learning on a subsample of data using a
serial algorithm. Finally, we propose a new Gaussian approach for lazy ensemble
evaluation which dynamically decides how many ensemble members to evaluate per
data point; this can reduce evaluation cost by 100X or more
Investigating Randomised Sphere Covers in Supervised Learning
cĀ©This copy of the thesis has been supplied on condition that anyone who consults it is understood to recognise that its copyright rests with the author and that no quotation from the thesis, nor any information derived therefrom, may be published without the authorās prior, written consent. In this thesis, we thoroughly investigate a simple Instance Based Learning (IBL) classifier known as Sphere Cover. We propose a simple Randomized Sphere Cover Classifier (Ī±RSC) and use several datasets in order to evaluate the classification performance of the Ī±RSC classifier. In addition, we analyse the generalization error of the proposed classifier using bias/variance decomposition. A Sphere Cover Classifier may be described from the compression scheme which stipulates data compression as the reason for high generalization performance. We investigate the compression capacity of Ī±RSC using a sample compression bound. The Compression Scheme prompted us to search new compressibility methods for Ī±RSC. As such, we used a Gaussian kernel to investigate further data compression
A PAC-Bayesian bound for Lifelong Learning
Transfer learning has received a lot of attention in the machine learning
community over the last years, and several effective algorithms have been
developed. However, relatively little is known about their theoretical
properties, especially in the setting of lifelong learning, where the goal is
to transfer information to tasks for which no data have been observed so far.
In this work we study lifelong learning from a theoretical perspective. Our
main result is a PAC-Bayesian generalization bound that offers a unified view
on existing paradigms for transfer learning, such as the transfer of parameters
or the transfer of low-dimensional representations. We also use the bound to
derive two principled lifelong learning algorithms, and we show that these
yield results comparable with existing methods.Comment: to appear at ICML 201
Efficient Learning with Partially Observed Attributes
We describe and analyze efficient algorithms for learning a linear predictor
from examples when the learner can only view a few attributes of each training
example. This is the case, for instance, in medical research, where each
patient participating in the experiment is only willing to go through a small
number of tests. Our analysis bounds the number of additional examples
sufficient to compensate for the lack of full information on each training
example. We demonstrate the efficiency of our algorithms by showing that when
running on digit recognition data, they obtain a high prediction accuracy even
when the learner gets to see only four pixels of each image.Comment: This is a full version of the paper appearing in The 27th
International Conference on Machine Learning (ICML 2010
PAC-Bayesian Bounds on Rate-Efficient Classifiers
We derive analytic bounds on the noise invariance of majority vote classifiers operating on compressed inputs. Specifically, starting from recent
bounds on the true risk of majority vote classifiers,
we extend the applicability of PAC-Bayesian theory to quantify the resilience of majority votes to
input noise stemming from compression. The derived bounds are intuitive in binary classification
settings, where they can be measured as expressions of voter differentials and voter pair agreement. By combining measures of input distortion
with analytic guarantees on noise invariance, we
prescribe rate-efficient machines to compress inputs without affecting subsequent classification.
Our validation shows how bounding noise invariance can inform the compression stage for any
majority vote classifier such that worst-case implications of bad input reconstructions are known,
and inputs can be compressed to the minimum
amount of information needed prior to inference
Convergence of Online Mirror Descent
In this paper we consider online mirror descent (OMD) algorithms, a class of
scalable online learning algorithms exploiting data geometric structures
through mirror maps. Necessary and sufficient conditions are presented in terms
of the step size sequence for the convergence of an OMD
algorithm with respect to the expected Bregman distance induced by the mirror
map. The condition is in the case of positive variances. It is
reduced to in the case of zero variances for
which the linear convergence may be achieved by taking a constant step size
sequence. A sufficient condition on the almost sure convergence is also given.
We establish tight error bounds under mild conditions on the mirror map, the
loss function, and the regularizer. Our results are achieved by some novel
analysis on the one-step progress of the OMD algorithm using smoothness and
strong convexity of the mirror map and the loss function.Comment: Published in Applied and Computational Harmonic Analysis, 202
- ā¦