140 research outputs found
Tighter risk certificates for neural networks
This paper presents an empirical study regarding training probabilistic
neural networks using training objectives derived from PAC-Bayes bounds. In the
context of probabilistic neural networks, the output of training is a
probability distribution over network weights. We present two training
objectives, used here for the first time in connection with training neural
networks. These two training objectives are derived from tight PAC-Bayes
bounds. We also re-implement a previously used training objective based on a
classical PAC-Bayes bound, to compare the properties of the predictors learned
using the different training objectives. We compute risk certificates that are
valid on any unseen examples for the learnt predictors. We further experiment
with different types of priors on the weights (both data-free and
data-dependent priors) and neural network architectures. Our experiments on
MNIST and CIFAR-10 show that our training methods produce competitive test set
errors and non-vacuous risk bounds with much tighter values than previous
results in the literature, showing promise not only to guide the learning
algorithm through bounding the risk but also for model selection. These
observations suggest that the methods studied here might be good candidates for
self-certified learning, in the sense of certifying the risk on any unseen data
without the need for data-splitting protocols.Comment: Preprint under revie
Tighter risk certificates for neural networks
This paper presents an empirical study regarding training probabilistic neural networks using training objectives derived from PAC-Bayes bounds. In the context of probabilistic neural networks, the output of training is a probability distribution over network weights. We present two training objectives, used here for the first time in connection with training neural networks. These two training objectives are derived from tight PAC-Bayes bounds. We also re-implement a previously used training objective based on a classical PAC-Bayes bound, to compare the properties of the predictors learned using the different training objectives. We compute risk certificates for the learnt predictors, based on part of the data used to learn the predictors. We further experiment with different types of priors on the weights (both data-free and data-dependent priors) and neural network architectures. Our experiments on MNIST and CIFAR-10 show that our training methods produce competitive test set errors and non-vacuous risk bounds with much tighter values than previous results in the literature, showing promise not only to guide the learning algorithm through bounding the risk but also for model selection. These observations suggest that the methods studied here might be good candidates for self-certified learning, in the sense of using the whole data set for learning a predictor and certifying its risk on any unseen data (from the same distribution as the training data) potentially without the need for holding out test data
PAC-Bayesian Computation
Risk bounds, which are also called generalisation bounds in the statistical learning literature, are important objects of study because they give some information on the expected error that a predictor may incur on randomly chosen data points. In classical statistical learning, the analyses focus on individual hypotheses, and the aim is deriving risk bounds that are valid for the data-dependent hypothesis output by some learning method. Often, however, such risk bounds are valid uniformly over a hypothesis class, which is a consequence of the methods used to derive them, namely the theory of uniform convergence of empirical processes. This is a source of looseness of these classical kinds of bounds which has lead to debates and criticisms, and motivated the search of alternative methods to derive tighter bounds.
The PAC-Bayes analysis focuses on distributions over hypotheses and randomised predictors defined by such distributions. Other prediction schemes can be devised based on a distribution over hypotheses, however, the randomised predictor is a typical starting point. Lifting the analysis to distributions over hypotheses, rather than individual hypotheses, makes available sharp analysis tools, which arguably account for the tightness of PAC-Bayes bounds. Two main uses of PAC-Bayes bounds are (1) risk certification, and (2) cost function derivation. The first consists of evaluating numerical risk certificates for the distributions over hypotheses learned by some method, while the second consists of turning a PAC-Bayes bound into a training objective, to learn a distribution by minimising the bound. This thesis revisits both kinds of uses of PAC-Bayes bounds. We contribute results on certifying the risk of randomised kernel and neural network classifiers, adding evidence to the success of PAC-Bayes bounds at delivering tight certificates. This thesis proposes the name “PAC-Bayesian Computation” as a generic name to encompass the class of methods that learn a distribution over hypotheses by minimising a PAC-Bayes bound (i.e. the second use case described above: cost function derivation), and reports an interesting case of PAC-Bayesian Computation leading to self-certified learning: we develop a learning and certification strategy that uses all the available data to produce a predictor together with a tight risk certificate, as demonstrated with randomised neural network classifiers on two benchmark data sets (MNIST, CIFAR-10)
Non-Vacuous Generalization Bounds at the ImageNet Scale: A PAC-Bayesian Compression Approach
Modern neural networks are highly overparameterized, with capacity to
substantially overfit to training data. Nevertheless, these networks often
generalize well in practice. It has also been observed that trained networks
can often be "compressed" to much smaller representations. The purpose of this
paper is to connect these two empirical observations. Our main technical result
is a generalization bound for compressed networks based on the compressed size.
Combined with off-the-shelf compression algorithms, the bound leads to state of
the art generalization guarantees; in particular, we provide the first
non-vacuous generalization guarantees for realistic architectures applied to
the ImageNet classification problem. As additional evidence connecting
compression and generalization, we show that compressibility of models that
tend to overfit is limited: We establish an absolute limit on expected
compressibility as a function of expected generalization error, where the
expectations are over the random choice of training examples. The bounds are
complemented by empirical results that show an increase in overfitting implies
an increase in the number of bits required to describe a trained network.Comment: 16 pages, 1 figure. Accepted at ICLR 201
PAC-tuning:Fine-tuning Pretrained Language Models with PAC-driven Perturbed Gradient Descent
Fine-tuning pretrained language models (PLMs) for downstream tasks is a
large-scale optimization problem, in which the choice of the training algorithm
critically determines how well the trained model can generalize to unseen test
data, especially in the context of few-shot learning. To achieve good
generalization performance and avoid overfitting, techniques such as data
augmentation and pruning are often applied. However, adding these
regularizations necessitates heavy tuning of the hyperparameters of
optimization algorithms, such as the popular Adam optimizer. In this paper, we
propose a two-stage fine-tuning method, PAC-tuning, to address this
optimization challenge. First, based on PAC-Bayes training, PAC-tuning directly
minimizes the PAC-Bayes generalization bound to learn proper parameter
distribution. Second, PAC-tuning modifies the gradient by injecting noise with
the variance learned in the first stage into the model parameters during
training, resulting in a variant of perturbed gradient descent (PGD). In the
past, the few-shot scenario posed difficulties for PAC-Bayes training because
the PAC-Bayes bound, when applied to large models with limited training data,
might not be stringent. Our experimental results across 5 GLUE benchmark tasks
demonstrate that PAC-tuning successfully handles the challenges of fine-tuning
tasks and outperforms strong baseline methods by a visible margin, further
confirming the potential to apply PAC training for any other settings where the
Adam optimizer is currently used for training.Comment: Accepted to EMNLP23 mai
Learning via Wasserstein-Based High Probability Generalisation Bounds
Minimising upper bounds on the population risk or the generalisation gap has
been widely used in structural risk minimisation (SRM) -- this is in particular
at the core of PAC-Bayesian learning. Despite its successes and unfailing surge
of interest in recent years, a limitation of the PAC-Bayesian framework is that
most bounds involve a Kullback-Leibler (KL) divergence term (or its
variations), which might exhibit erratic behavior and fail to capture the
underlying geometric structure of the learning problem -- hence restricting its
use in practical applications. As a remedy, recent studies have attempted to
replace the KL divergence in the PAC-Bayesian bounds with the Wasserstein
distance. Even though these bounds alleviated the aforementioned issues to a
certain extent, they either hold in expectation, are for bounded losses, or are
nontrivial to minimize in an SRM framework. In this work, we contribute to this
line of research and prove novel Wasserstein distance-based PAC-Bayesian
generalisation bounds for both batch learning with independent and identically
distributed (i.i.d.) data, and online learning with potentially non-i.i.d.
data. Contrary to previous art, our bounds are stronger in the sense that (i)
they hold with high probability, (ii) they apply to unbounded (potentially
heavy-tailed) losses, and (iii) they lead to optimizable training objectives
that can be used in SRM. As a result we derive novel Wasserstein-based
PAC-Bayesian learning algorithms and we illustrate their empirical advantage on
a variety of experiments.Comment: Accepted to NeurIPS 202
- …