21 research outputs found
PAC-Bayesian Theory Meets Bayesian Inference
We exhibit a strong link between frequentist PAC-Bayesian risk bounds and the
Bayesian marginal likelihood. That is, for the negative log-likelihood loss
function, we show that the minimization of PAC-Bayesian generalization risk
bounds maximizes the Bayesian marginal likelihood. This provides an alternative
explanation to the Bayesian Occam's razor criteria, under the assumption that
the data is generated by an i.i.d distribution. Moreover, as the negative
log-likelihood is an unbounded loss function, we motivate and propose a
PAC-Bayesian theorem tailored for the sub-gamma loss family, and we show that
our approach is sound on classical Bayesian linear regression tasks.Comment: Published at NIPS 2015
(http://papers.nips.cc/paper/6569-pac-bayesian-theory-meets-bayesian-inference
Random deep neural networks are biased towards simple functions
We prove that the binary classifiers of bit strings generated by random wide
deep neural networks with ReLU activation function are biased towards simple
functions. The simplicity is captured by the following two properties. For any
given input bit string, the average Hamming distance of the closest input bit
string with a different classification is at least sqrt(n / (2{\pi} log n)),
where n is the length of the string. Moreover, if the bits of the initial
string are flipped randomly, the average number of flips required to change the
classification grows linearly with n. These results are confirmed by numerical
experiments on deep neural networks with two hidden layers, and settle the
conjecture stating that random deep neural networks are biased towards simple
functions. This conjecture was proposed and numerically explored in [Valle
P\'erez et al., ICLR 2019] to explain the unreasonably good generalization
properties of deep learning algorithms. The probability distribution of the
functions generated by random deep neural networks is a good choice for the
prior probability distribution in the PAC-Bayesian generalization bounds. Our
results constitute a fundamental step forward in the characterization of this
distribution, therefore contributing to the understanding of the generalization
properties of deep learning algorithms
PAC-Bayes Analysis of Multi-view Learning
This paper presents eight PAC-Bayes bounds to analyze the generalization
performance of multi-view classifiers. These bounds adopt data dependent
Gaussian priors which emphasize classifiers with high view agreements. The
center of the prior for the first two bounds is the origin, while the center of
the prior for the third and fourth bounds is given by a data dependent vector.
An important technique to obtain these bounds is two derived logarithmic
determinant inequalities whose difference lies in whether the dimensionality of
data is involved. The centers of the fifth and sixth bounds are calculated on a
separate subset of the training set. The last two bounds use unlabeled data to
represent view agreements and are thus applicable to semi-supervised multi-view
learning. We evaluate all the presented multi-view PAC-Bayes bounds on
benchmark data and compare them with previous single-view PAC-Bayes bounds. The
usefulness and performance of the multi-view bounds are discussed.Comment: 35 page
A Primer on PAC-Bayesian Learning
International audienc
PAC-Bayes unleashed: generalisation bounds with unbounded losses
We present new PAC-Bayesian generalisation bounds for learning problems with
unbounded loss functions. This extends the relevance and applicability of the
PAC-Bayes learning framework, where most of the existing literature focuses on
supervised learning problems with a bounded loss function (typically assumed to
take values in the interval [0;1]). In order to relax this assumption, we
propose a new notion called HYPE (standing for \emph{HYPothesis-dependent
rangE}), which effectively allows the range of the loss to depend on each
predictor. Based on this new notion we derive a novel PAC-Bayesian
generalisation bound for unbounded loss functions, and we instantiate it on a
linear regression problem. To make our theory usable by the largest audience
possible, we include discussions on actual computation, practicality and
limitations of our assumptions.Comment: 24 page