23,495 research outputs found
How Does Information Bottleneck Help Deep Learning?
Numerous deep learning algorithms have been inspired by and understood via
the notion of information bottleneck, where unnecessary information is (often
implicitly) minimized while task-relevant information is maximized. However, a
rigorous argument for justifying why it is desirable to control information
bottlenecks has been elusive. In this paper, we provide the first rigorous
learning theory for justifying the benefit of information bottleneck in deep
learning by mathematically relating information bottleneck to generalization
errors. Our theory proves that controlling information bottleneck is one way to
control generalization errors in deep learning, although it is not the only or
necessary way. We investigate the merit of our new mathematical findings with
experiments across a range of architectures and learning settings. In many
cases, generalization errors are shown to correlate with the degree of
information bottleneck: i.e., the amount of the unnecessary information at
hidden layers. This paper provides a theoretical foundation for current and
future methods through the lens of information bottleneck. Our new
generalization bounds scale with the degree of information bottleneck, unlike
the previous bounds that scale with the number of parameters, VC dimension,
Rademacher complexity, stability or robustness. Our code is publicly available
at: https://github.com/xu-ji/information-bottleneckComment: Accepted at ICML 2023. Code is available at
https://github.com/xu-ji/information-bottlenec
Emergence of Invariance and Disentanglement in Deep Representations
Using established principles from Statistics and Information Theory, we show
that invariance to nuisance factors in a deep neural network is equivalent to
information minimality of the learned representation, and that stacking layers
and injecting noise during training naturally bias the network towards learning
invariant representations. We then decompose the cross-entropy loss used during
training and highlight the presence of an inherent overfitting term. We propose
regularizing the loss by bounding such a term in two equivalent ways: One with
a Kullbach-Leibler term, which relates to a PAC-Bayes perspective; the other
using the information in the weights as a measure of complexity of a learned
model, yielding a novel Information Bottleneck for the weights. Finally, we
show that invariance and independence of the components of the representation
learned by the network are bounded above and below by the information in the
weights, and therefore are implicitly optimized during training. The theory
enables us to quantify and predict sharp phase transitions between underfitting
and overfitting of random labels when using our regularized loss, which we
verify in experiments, and sheds light on the relation between the geometry of
the loss function, invariance properties of the learned representation, and
generalization error.Comment: Deep learning, neural network, representation, flat minima,
information bottleneck, overfitting, generalization, sufficiency, minimality,
sensitivity, information complexity, stochastic gradient descent,
regularization, total correlation, PAC-Baye
The Conditional Entropy Bottleneck
Much of the field of Machine Learning exhibits a prominent set of failure
modes, including vulnerability to adversarial examples, poor
out-of-distribution (OoD) detection, miscalibration, and willingness to
memorize random labelings of datasets. We characterize these as failures of
robust generalization, which extends the traditional measure of generalization
as accuracy or related metrics on a held-out set. We hypothesize that these
failures to robustly generalize are due to the learning systems retaining too
much information about the training data. To test this hypothesis, we propose
the Minimum Necessary Information (MNI) criterion for evaluating the quality of
a model. In order to train models that perform well with respect to the MNI
criterion, we present a new objective function, the Conditional Entropy
Bottleneck (CEB), which is closely related to the Information Bottleneck (IB).
We experimentally test our hypothesis by comparing the performance of CEB
models with deterministic models and Variational Information Bottleneck (VIB)
models on a variety of different datasets and robustness challenges. We find
strong empirical evidence supporting our hypothesis that MNI models improve on
these problems of robust generalization
On Neural Networks Fitting, Compression, and Generalization Behavior via Information-Bottleneck-like Approaches
It is well-known that a neural network learning process—along with its connections to fitting, compression, and generalization—is not yet well understood. In this paper, we propose a novel approach to capturing such neural network dynamics using information-bottleneck-type techniques, involving the replacement of mutual information measures (which are notoriously difficult to estimate in high-dimensional spaces) by other more tractable ones, including (1) the minimum mean-squared error associated with the reconstruction of the network input data from some intermediate network representation and (2) the cross-entropy associated with a certain class label given some network representation. We then conducted an empirical study in order to ascertain how different network models, network learning algorithms, and datasets may affect the learning dynamics. Our experiments show that our proposed approach appears to be more reliable in comparison with classical information bottleneck ones in capturing network dynamics during both the training and testing phases. Our experiments also reveal that the fitting and compression phases exist regardless of the choice of activation function. Additionally, our findings suggest that model architectures, training algorithms, and datasets that lead to better generalization tend to exhibit more pronounced fitting and compression phases
Learning to Learn with Variational Information Bottleneck for Domain Generalization
Domain generalization models learn to generalize to previously unseen
domains, but suffer from prediction uncertainty and domain shift. In this
paper, we address both problems. We introduce a probabilistic meta-learning
model for domain generalization, in which classifier parameters shared across
domains are modeled as distributions. This enables better handling of
prediction uncertainty on unseen domains. To deal with domain shift, we learn
domain-invariant representations by the proposed principle of meta variational
information bottleneck, we call MetaVIB. MetaVIB is derived from novel
variational bounds of mutual information, by leveraging the meta-learning
setting of domain generalization. Through episodic training, MetaVIB learns to
gradually narrow domain gaps to establish domain-invariant representations,
while simultaneously maximizing prediction accuracy. We conduct experiments on
three benchmarks for cross-domain visual recognition. Comprehensive ablation
studies validate the benefits of MetaVIB for domain generalization. The
comparison results demonstrate our method outperforms previous approaches
consistently.Comment: 15 pages, 4 figures, ECCV202
The Role of Mutual Information in Variational Classifiers
Overfitting data is a well-known phenomenon related with the generation of a
model that mimics too closely (or exactly) a particular instance of data, and
may therefore fail to predict future observations reliably. In practice, this
behaviour is controlled by various--sometimes heuristics--regularization
techniques, which are motivated by developing upper bounds to the
generalization error. In this work, we study the generalization error of
classifiers relying on stochastic encodings trained on the cross-entropy loss,
which is often used in deep learning for classification problems. We derive
bounds to the generalization error showing that there exists a regime where the
generalization error is bounded by the mutual information between input
features and the corresponding representations in the latent space, which are
randomly generated according to the encoding distribution. Our bounds provide
an information-theoretic understanding of generalization in the so-called class
of variational classifiers, which are regularized by a Kullback-Leibler (KL)
divergence term. These results give theoretical grounds for the highly popular
KL term in variational inference methods that was already recognized to act
effectively as a regularization penalty. We further observe connections with
well studied notions such as Variational Autoencoders, Information Dropout,
Information Bottleneck and Boltzmann Machines. Finally, we perform numerical
experiments on MNIST and CIFAR datasets and show that mutual information is
indeed highly representative of the behaviour of the generalization error
- …