57 research outputs found
Emergence of Invariance and Disentanglement in Deep Representations
Using established principles from Statistics and Information Theory, we show
that invariance to nuisance factors in a deep neural network is equivalent to
information minimality of the learned representation, and that stacking layers
and injecting noise during training naturally bias the network towards learning
invariant representations. We then decompose the cross-entropy loss used during
training and highlight the presence of an inherent overfitting term. We propose
regularizing the loss by bounding such a term in two equivalent ways: One with
a Kullbach-Leibler term, which relates to a PAC-Bayes perspective; the other
using the information in the weights as a measure of complexity of a learned
model, yielding a novel Information Bottleneck for the weights. Finally, we
show that invariance and independence of the components of the representation
learned by the network are bounded above and below by the information in the
weights, and therefore are implicitly optimized during training. The theory
enables us to quantify and predict sharp phase transitions between underfitting
and overfitting of random labels when using our regularized loss, which we
verify in experiments, and sheds light on the relation between the geometry of
the loss function, invariance properties of the learned representation, and
generalization error.Comment: Deep learning, neural network, representation, flat minima,
information bottleneck, overfitting, generalization, sufficiency, minimality,
sensitivity, information complexity, stochastic gradient descent,
regularization, total correlation, PAC-Baye
Relative Flatness and Generalization
Flatness of the loss curve is conjectured to be connected to the generalization
ability of machine learning models, in particular neural networks. While it has
been empirically observed that flatness measures consistently correlate strongly
with generalization, it is still an open theoretical problem why and under which
circumstances flatness is connected to generalization, in particular in light of
reparameterizations that change certain flatness measures but leave generalization
unchanged. We investigate the connection between flatness and generalization
by relating it to the interpolation from representative data, deriving notions of
representativeness, and feature robustness. The notions allow us to rigorously
connect flatness and generalization and to identify conditions under which the
connection holds. Moreover, they give rise to a novel, but natural relative flatness
measure that correlates strongly with generalization, simplifies to ridge regression
for ordinary least squares, and solves the reparameterization issue
The Geometry of Neural Nets' Parameter Spaces Under Reparametrization
Model reparametrization -- transforming the parameter space via a bijective
differentiable map -- is a popular way to improve the training of neural
networks. But reparametrizations have also been problematic since they induce
inconsistencies in, e.g., Hessian-based flatness measures, optimization
trajectories, and modes of probability density functions. This complicates
downstream analyses, e.g. one cannot make a definitive statement about the
connection between flatness and generalization. In this work, we study the
invariance quantities of neural nets under reparametrization from the
perspective of Riemannian geometry. We show that this notion of invariance is
an inherent property of any neural net, as long as one acknowledges the
assumptions about the metric that is always present, albeit often implicitly,
and uses the correct transformation rules under reparametrization. We present
discussions on measuring the flatness of minima, in optimization, and in
probability-density maximization, along with applications in studying the
biases of optimizers and in Bayesian inference
Flat Seeking Bayesian Neural Networks
Bayesian Neural Networks (BNNs) provide a probabilistic interpretation for
deep learning models by imposing a prior distribution over model parameters and
inferring a posterior distribution based on observed data. The model sampled
from the posterior distribution can be used for providing ensemble predictions
and quantifying prediction uncertainty. It is well-known that deep learning
models with lower sharpness have better generalization ability. However,
existing posterior inferences are not aware of sharpness/flatness in terms of
formulation, possibly leading to high sharpness for the models sampled from
them. In this paper, we develop theories, the Bayesian setting, and the
variational inference approach for the sharpness-aware posterior. Specifically,
the models sampled from our sharpness-aware posterior, and the optimal
approximate posterior estimating this sharpness-aware posterior, have better
flatness, hence possibly possessing higher generalization ability. We conduct
experiments by leveraging the sharpness-aware posterior with state-of-the-art
Bayesian Neural Networks, showing that the flat-seeking counterparts outperform
their baselines in all metrics of interest.Comment: Accepted at NeurIPS 202
- …