11 research outputs found
Improved Vapnik Cervonenkis bounds
We give a new proof of VC bounds where we avoid the use of symmetrization and
use a shadow sample of arbitrary size. We also improve on the variance term.
This results in better constants, as shown on numerical examples. Moreover our
bounds still hold for non identically distributed independent random variables.
Keywords: Statistical learning theory, PAC-Bayesian theorems, VC dimension
Differentiable PAC-Bayes Objectives with Partially Aggregated Neural Networks
We make three related contributions motivated by the challenge of training
stochastic neural networks, particularly in a PAC-Bayesian setting: (1) we show
how averaging over an ensemble of stochastic neural networks enables a new
class of \emph{partially-aggregated} estimators; (2) we show that these lead to
provably lower-variance gradient estimates for non-differentiable signed-output
networks; (3) we reformulate a PAC-Bayesian bound for these networks to derive
a directly optimisable, differentiable objective and a generalisation
guarantee, without using a surrogate loss or loosening the bound. This bound is
twice as tight as that of Letarte et al. (2019) on a similar network type. We
show empirically that these innovations make training easier and lead to
competitive guarantees
Differentiable PAC–Bayes Objectives with Partially Aggregated Neural Networks
We make two related contributions motivated by the challenge of training stochastic neural networks, particularly in a PAC–Bayesian setting: (1) we show how averaging over an ensemble of stochastic neural networks enables a new class of partially-aggregated estimators, proving that these lead to unbiased lower-variance output and gradient estimators; (2) we reformulate a PAC–Bayesian bound for signed-output networks to derive in combination with the above a directly optimisable, differentiable objective and a generalisation guarantee, without using a surrogate loss or loosening the bound. We show empirically that this leads to competitive generalisation guarantees and compares favourably to other methods for training such networks. Finally, we note that the above leads to a simpler PAC–Bayesian training scheme for sign-activation networks than previous work
PAC-Bayesian inductive and transductive learning
We present here a PAC-Bayesian point of view on adaptive supervised
classification. Using convex analysis, we show how to get local measures of the
complexity of the classification model involving the relative entropy of
posterior distributions with respect to Gibbs posterior measures. We discuss
relative bounds, comparing two classification rules, to show how the margin
assumption of Mammen and Tsybakov can be replaced with some empirical measure
of the covariance structure of the classification model. We also show how to
associate to any posterior distribution an {\em effective temperature} relating
it to the Gibbs prior distribution with the same level of expected error rate,
and how to estimate this effective temperature from data, resulting in an
estimator whose expected error rate adaptively converges according to the best
possible power of the sample size. Then we introduce a PAC-Bayesian point of
view on transductive learning and use it to improve on known Vapnik's
generalization bounds, extending them to the case when the sample is
independent but not identically distributed. Eventually we review briefly the
construction of Support Vector Machines and show how to derive generalization
bounds for them, measuring the complexity either through the number of support
vectors or through transductive or inductive margin estimates
Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning
This monograph deals with adaptive supervised classification, using tools
borrowed from statistical mechanics and information theory, stemming from the
PACBayesian approach pioneered by David McAllester and applied to a conception
of statistical learning theory forged by Vladimir Vapnik. Using convex analysis
on the set of posterior probability measures, we show how to get local measures
of the complexity of the classification model involving the relative entropy of
posterior distributions with respect to Gibbs posterior measures. We then
discuss relative bounds, comparing the generalization error of two
classification rules, showing how the margin assumption of Mammen and Tsybakov
can be replaced with some empirical measure of the covariance structure of the
classification model.We show how to associate to any posterior distribution an
effective temperature relating it to the Gibbs prior distribution with the same
level of expected error rate, and how to estimate this effective temperature
from data, resulting in an estimator whose expected error rate converges
according to the best possible power of the sample size adaptively under any
margin and parametric complexity assumptions. We describe and study an
alternative selection scheme based on relative bounds between estimators, and
present a two step localization technique which can handle the selection of a
parametric model from a family of those. We show how to extend systematically
all the results obtained in the inductive setting to transductive learning, and
use this to improve Vapnik's generalization bounds, extending them to the case
when the sample is made of independent non-identically distributed pairs of
patterns and labels. Finally we review briefly the construction of Support
Vector Machines and show how to derive generalization bounds for them,
measuring the complexity either through the number of support vectors or
through the value of the transductive or inductive margin.Comment: Published in at http://dx.doi.org/10.1214/074921707000000391 the IMS
Lecture Notes Monograph Series
(http://www.imstat.org/publications/lecnotes.htm) by the Institute of
Mathematical Statistics (http://www.imstat.org
An improved predictive accuracy bound for averaging classifiers
We present an improved bound on the difference between training and test errors for voting classifiers. This improved averaging bound provides a theoretical justification for popular averaging techniques such as Bayesian classification, Maximum Entropy discrimination, Winnow and Bayes point machines and has implications for learning algorithm design. 1
Hash kernels and structured learning
Vast amounts of data being generated, how to process massive data remains a challenge for machine learning algorithms. We propose hash kernels to facilitate efficient kernels which can deal with massive multi-class problems. We show a principled way to compute the kernel matrix for data streams and sparse feature spaces. We further generalise it via sampling to graphs. Later we exploit the connection between hash kernels with compressed sensing, and apply hashing to face recognition which significantly speeds up over the state-of-the-art with competitive accuracy. And we give a recovery rate on the sparse representation and a bounded recognition rate.
As hash kernels can deal with data with structures in the input such as graphs and face images, the second part of the thesis moves on to an even more challenging task - dealing with data with structures in the output.
Recent advances in machine learning exploit the dependency among data output, hence dealing with complex, structured data becomes possible. We study the most popular structured learning algorithms and categorise them into two categories - probabilistic approaches and Max Margin approaches. We show the connections of different algorithms, reformulate them in the empirical risk minimisation framework, and compare their advantages and disadvantages, which help choose suitable algorithms according to the characteristics of the application.
We have made practical and theoretical contributions in this thesis.
We show some real-world applications using structured learning as follows: a) We propose a novel approach for automatic paragraph segmentation, namely training Semi-Markov models discriminatively using a Max-Margin method. This method allows us to model the sequential nature of the problem and to incorporate features of a whole paragraph, such as paragraph coherence which cannot be used in previous models. b) We jointly segment and recognise actions in video sequences with a discriminative semi-Markov model framework, which incorporates features that capture the characteristics on boundary frames, action segments and neighbouring action segments. A Viterbi-like algorithm is devised to help efficiently solve the induced optimisation problem. c) We propose a novel hybrid loss of Conditional Random Fields (CRFs) and Support Vector Machines (SVMs). We apply the hybrid loss to various applications such as Text chunking, Named Entity Recognition and Joint Image Categorisation.
We have made the following theoretical contributions: a) We study the recent advance in PAC-Bayes bounds, and apply it to structured learning. b) We propose a more refined notion of Fisher consistency, namely Conditional Fisher Consistency for Classification (CFCC), that conditions on the knowledge of the true distribution of class labels. c) We show that the hybrid loss has the advantages of both CRFs and SVMs - it is consistent and has a tight PAC-Bayes bound which shrinks as the margin increases. d) We also introduce Probabilistic margins which take the label distribution into account. And we show that many existing algorithms can be viewed as special cases of the new margin concept which may help understand existing algorithms as well as design new algorithms.
At last, we discuss some future directions such as tightening PAC-Bayes bounds, adaptive hybrid losses and graphical model inference via Compressed Sensing