283 research outputs found
Learning Stochastic Majority Votes by Minimizing a PAC-Bayes Generalization Bound
We investigate a stochastic counterpart of majority votes over finite ensembles of classifiers, and study its generalization properties. While our approach holds for arbitrary distributions, we instantiate it with Dirichlet distributions: this allows for a closed-form and differentiable expression for the expected risk, which then turns the generalization bound into a tractable training objective.The resulting stochastic majority vote learning algorithm achieves state-of-the-art accuracy and benefits from (non-vacuous) tight generalization bounds, in a series of numerical experiments when compared to competing algorithms which also minimize PAC-Bayes objectives -- both with uninformed (data-independent) and informed (data-dependent) priors
Minimax risk classifiers with 0-1 loss
Supervised classification techniques use training samples to learn a
classification rule with small expected 0-1 loss (error probability).
Conventional methods enable tractable learning and provide out-of-sample
generalization by using surrogate losses instead of the 0-1 loss and
considering specific families of rules (hypothesis classes). This paper
presents minimax risk classifiers (MRCs) that minimize the worst-case 0-1 loss
over general classification rules and provide tight performance guarantees at
learning. We show that MRCs are strongly universally consistent using feature
mappings given by characteristic kernels. The paper also proposes efficient
optimization techniques for MRC learning and shows that the methods presented
can provide accurate classification together with tight performance guarantees
in practice
Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary -Mixing Processes
Pac-Bayes bounds are among the most accurate generalization bounds for
classifiers learned from independently and identically distributed (IID) data,
and it is particularly so for margin classifiers: there have been recent
contributions showing how practical these bounds can be either to perform model
selection (Ambroladze et al., 2007) or even to directly guide the learning of
linear classifiers (Germain et al., 2009). However, there are many practical
situations where the training data show some dependencies and where the
traditional IID assumption does not hold. Stating generalization bounds for
such frameworks is therefore of the utmost interest, both from theoretical and
practical standpoints. In this work, we propose the first - to the best of our
knowledge - Pac-Bayes generalization bounds for classifiers trained on data
exhibiting interdependencies. The approach undertaken to establish our results
is based on the decomposition of a so-called dependency graph that encodes the
dependencies within the data, in sets of independent data, thanks to graph
fractional covers. Our bounds are very general, since being able to find an
upper bound on the fractional chromatic number of the dependency graph is
sufficient to get new Pac-Bayes bounds for specific settings. We show how our
results can be used to derive bounds for ranking statistics (such as Auc) and
classifiers trained on data distributed according to a stationary {\ss}-mixing
process. In the way, we show how our approach seemlessly allows us to deal with
U-processes. As a side note, we also provide a Pac-Bayes generalization bound
for classifiers learned on data from stationary -mixing distributions.Comment: Long version of the AISTATS 09 paper:
http://jmlr.csail.mit.edu/proceedings/papers/v5/ralaivola09a/ralaivola09a.pd
Generalization Error in Deep Learning
Deep learning models have lately shown great performance in various fields
such as computer vision, speech recognition, speech translation, and natural
language processing. However, alongside their state-of-the-art performance, it
is still generally unclear what is the source of their generalization ability.
Thus, an important question is what makes deep neural networks able to
generalize well from the training set to new data. In this article, we provide
an overview of the existing theory and bounds for the characterization of the
generalization error of deep neural networks, combining both classical and more
recent theoretical and empirical results
PAC-Bayesian Computation
Risk bounds, which are also called generalisation bounds in the statistical learning literature, are important objects of study because they give some information on the expected error that a predictor may incur on randomly chosen data points. In classical statistical learning, the analyses focus on individual hypotheses, and the aim is deriving risk bounds that are valid for the data-dependent hypothesis output by some learning method. Often, however, such risk bounds are valid uniformly over a hypothesis class, which is a consequence of the methods used to derive them, namely the theory of uniform convergence of empirical processes. This is a source of looseness of these classical kinds of bounds which has lead to debates and criticisms, and motivated the search of alternative methods to derive tighter bounds.
The PAC-Bayes analysis focuses on distributions over hypotheses and randomised predictors defined by such distributions. Other prediction schemes can be devised based on a distribution over hypotheses, however, the randomised predictor is a typical starting point. Lifting the analysis to distributions over hypotheses, rather than individual hypotheses, makes available sharp analysis tools, which arguably account for the tightness of PAC-Bayes bounds. Two main uses of PAC-Bayes bounds are (1) risk certification, and (2) cost function derivation. The first consists of evaluating numerical risk certificates for the distributions over hypotheses learned by some method, while the second consists of turning a PAC-Bayes bound into a training objective, to learn a distribution by minimising the bound. This thesis revisits both kinds of uses of PAC-Bayes bounds. We contribute results on certifying the risk of randomised kernel and neural network classifiers, adding evidence to the success of PAC-Bayes bounds at delivering tight certificates. This thesis proposes the name “PAC-Bayesian Computation” as a generic name to encompass the class of methods that learn a distribution over hypotheses by minimising a PAC-Bayes bound (i.e. the second use case described above: cost function derivation), and reports an interesting case of PAC-Bayesian Computation leading to self-certified learning: we develop a learning and certification strategy that uses all the available data to produce a predictor together with a tight risk certificate, as demonstrated with randomised neural network classifiers on two benchmark data sets (MNIST, CIFAR-10)
PAC-Bayesian Analysis of the Exploration-Exploitation Trade-off
We develop a coherent framework for integrative simultaneous analysis of the
exploration-exploitation and model order selection trade-offs. We improve over
our preceding results on the same subject (Seldin et al., 2011) by combining
PAC-Bayesian analysis with Bernstein-type inequality for martingales. Such a
combination is also of independent interest for studies of multiple
simultaneously evolving martingales.Comment: On-line Trading of Exploration and Exploitation 2 - ICML-2011
workshop. http://explo.cs.ucl.ac.uk/workshop
PAC-Bayesian Learning of Optimization Algorithms
We apply the PAC-Bayes theory to the setting of learning-to-optimize. To the
best of our knowledge, we present the first framework to learn optimization
algorithms with provable generalization guarantees (PAC-bounds) and explicit
trade-off between a high probability of convergence and a high convergence
speed. Even in the limit case, where convergence is guaranteed, our learned
optimization algorithms provably outperform related algorithms based on a
(deterministic) worst-case analysis. Our results rely on PAC-Bayes bounds for
general, unbounded loss-functions based on exponential families. By
generalizing existing ideas, we reformulate the learning procedure into a
one-dimensional minimization problem and study the possibility to find a global
minimum, which enables the algorithmic realization of the learning procedure.
As a proof-of-concept, we learn hyperparameters of standard optimization
algorithms to empirically underline our theory.Comment: Accepted to AISTATS 202
Tighter risk certificates for neural networks
This paper presents an empirical study regarding training probabilistic
neural networks using training objectives derived from PAC-Bayes bounds. In the
context of probabilistic neural networks, the output of training is a
probability distribution over network weights. We present two training
objectives, used here for the first time in connection with training neural
networks. These two training objectives are derived from tight PAC-Bayes
bounds. We also re-implement a previously used training objective based on a
classical PAC-Bayes bound, to compare the properties of the predictors learned
using the different training objectives. We compute risk certificates that are
valid on any unseen examples for the learnt predictors. We further experiment
with different types of priors on the weights (both data-free and
data-dependent priors) and neural network architectures. Our experiments on
MNIST and CIFAR-10 show that our training methods produce competitive test set
errors and non-vacuous risk bounds with much tighter values than previous
results in the literature, showing promise not only to guide the learning
algorithm through bounding the risk but also for model selection. These
observations suggest that the methods studied here might be good candidates for
self-certified learning, in the sense of certifying the risk on any unseen data
without the need for data-splitting protocols.Comment: Preprint under revie
- …