62 research outputs found
Information-theoretic analysis of generalization capability of learning algorithms
We derive upper bounds on the generalization error of a learning algorithm in
terms of the mutual information between its input and output. The bounds
provide an information-theoretic understanding of generalization in learning
problems, and give theoretical guidelines for striking the right balance
between data fit and generalization by controlling the input-output mutual
information. We propose a number of methods for this purpose, among which are
algorithms that regularize the ERM algorithm with relative entropy or with
random noise. Our work extends and leads to nontrivial improvements on the
recent results of Russo and Zou.Comment: Final version, accepted to NIPS 201
Average-Case Information Complexity of Learning
How many bits of information are revealed by a learning algorithm for a
concept class of VC-dimension ? Previous works have shown that even for
the amount of information may be unbounded (tend to with the
universe size). Can it be that all concepts in the class require leaking a
large amount of information? We show that typically concepts do not require
leakage. There exists a proper learning algorithm that reveals bits of
information for most concepts in the class. This result is a special case of a
more general phenomenon we explore. If there is a low information learner when
the algorithm {\em knows} the underlying distribution on inputs, then there is
a learner that reveals little information on an average concept {\em without
knowing} the distribution on inputs
Quantization-Based Regularization for Autoencoders
Autoencoders and their variations provide unsupervised models for learning
low-dimensional representations for downstream tasks. Without proper
regularization, autoencoder models are susceptible to the overfitting problem
and the so-called posterior collapse phenomenon. In this paper, we introduce a
quantization-based regularizer in the bottleneck stage of autoencoder models to
learn meaningful latent representations. We combine both perspectives of Vector
Quantized-Variational AutoEncoders (VQ-VAE) and classical denoising
regularization methods of neural networks. We interpret quantizers as
regularizers that constrain latent representations while fostering a
similarity-preserving mapping at the encoder. Before quantization, we impose
noise on the latent codes and use a Bayesian estimator to optimize the
quantizer-based representation. The introduced bottleneck Bayesian estimator
outputs the posterior mean of the centroids to the decoder, and thus, is
performing soft quantization of the noisy latent codes. We show that our
proposed regularization method results in improved latent representations for
both supervised learning and clustering downstream tasks when compared to
autoencoders using other bottleneck structures.Comment: AAAI 202
SHADE: Information Based Regularization for Deep Learning
Regularization is a big issue for training deep neural networks. In this
paper, we propose a new information-theory-based regularization scheme named
SHADE for SHAnnon DEcay. The originality of the approach is to define a prior
based on conditional entropy, which explicitly decouples the learning of
invariant representations in the regularizer and the learning of correlations
between inputs and labels in the data fitting term. Our second contribution is
to derive a stochastic version of the regularizer compatible with deep
learning, resulting in a tractable training scheme. We empirically validate the
efficiency of our approach to improve classification performances compared to
common regularization schemes on several standard architectures
Strengthened Information-theoretic Bounds on the Generalization Error
The following problem is considered: given a joint distribution and
an event , bound in terms of (where is the
product of the marginals of ) and a measure of dependence of and
. Such bounds have direct applications in the analysis of the generalization
error of learning algorithms, where represents a large error event and the
measure of dependence controls the degree of overfitting. Herein, bounds are
demonstrated using several information-theoretic metrics, in particular: mutual
information, lautum information, maximal leakage, and . The mutual
information bound can outperform comparable bounds in the literature by an
arbitrarily large factor.Comment: Submitted to ISIT 201
Chaining Mutual Information and Tightening Generalization Bounds
Bounding the generalization error of learning algorithms has a long history,
which yet falls short in explaining various generalization successes including
those of deep learning. Two important difficulties are (i) exploiting the
dependencies between the hypotheses, (ii) exploiting the dependence between the
algorithm's input and output. Progress on the first point was made with the
chaining method, originating from the work of Kolmogorov, and used in the
VC-dimension bound. More recently, progress on the second point was made with
the mutual information method by Russo and Zou '15. Yet, these two methods are
currently disjoint. In this paper, we introduce a technique to combine the
chaining and mutual information methods, to obtain a generalization bound that
is both algorithm-dependent and that exploits the dependencies between the
hypotheses. We provide an example in which our bound significantly outperforms
both the chaining and the mutual information bounds. As a corollary, we tighten
Dudley's inequality when the learning algorithm chooses its output from a small
subset of hypotheses with high probability.Comment: 20 pages, 1 figure; published at the NeurIPS 2018 conferenc
An Optimal Transport View on Generalization
We derive upper bounds on the generalization error of learning algorithms
based on their \emph{algorithmic transport cost}: the expected Wasserstein
distance between the output hypothesis and the output hypothesis conditioned on
an input example. The bounds provide a novel approach to study the
generalization of learning algorithms from an optimal transport view and impose
less constraints on the loss function, such as sub-gaussian or bounded. We
further provide several upper bounds on the algorithmic transport cost in terms
of total variation distance, relative entropy (or KL-divergence), and VC
dimension, thus further bridging optimal transport theory and information
theory with statistical learning theory. Moreover, we also study different
conditions for loss functions under which the generalization error of a
learning algorithm can be upper bounded by different probability metrics
between distributions relating to the output hypothesis and/or the input data.
Finally, under our established framework, we analyze the generalization in deep
learning and conclude that the generalization error in deep neural networks
(DNNs) decreases exponentially to zero as the number of layers increases. Our
analyses of generalization error in deep learning mainly exploit the
hierarchical structure in DNNs and the contraction property of -divergence,
which may be of independent interest in analyzing other learning models with
hierarchical structure.Comment: 27 pages, 2 figures, 1 tabl
An Information-Theoretic View for Deep Learning
Deep learning has transformed computer vision, natural language processing,
and speech recognition\cite{badrinarayanan2017segnet, dong2016image,
ren2017faster, ji20133d}. However, two critical questions remain obscure: (1)
why do deep neural networks generalize better than shallow networks; and (2)
does it always hold that a deeper network leads to better performance?
Specifically, letting be the number of convolutional and pooling layers in
a deep neural network, and be the size of the training sample, we derive an
upper bound on the expected generalization error for this network, i.e.,
\begin{eqnarray*}
\mathbb{E}[R(W)-R_S(W)] \leq
\exp{\left(-\frac{L}{2}\log{\frac{1}{\eta}}\right)}\sqrt{\frac{2\sigma^2}{n}I(S,W)
}
\end{eqnarray*} where is a constant depending on the loss
function, is a constant depending on the information loss for each
convolutional or pooling layer, and is the mutual information between
the training sample and the output hypothesis . This upper bound shows
that as the number of convolutional and pooling layers increases in the
network, the expected generalization error will decrease exponentially to zero.
Layers with strict information loss, such as the convolutional layers, reduce
the generalization error for the whole network; this answers the first
question. However, algorithms with zero expected generalization error does not
imply a small test error or . This is because
is large when the information for fitting the data is lost
as the number of layers increases. This suggests that the claim `the deeper the
better' is conditioned on a small training error or .
Finally, we show that deep learning satisfies a weak notion of stability and
the sample complexity of deep neural networks will decrease as increases.Comment: Add details in the proof of Theorem
Tightening Mutual Information Based Bounds on Generalization Error
An information-theoretic upper bound on the generalization error of
supervised learning algorithms is derived. The bound is constructed in terms of
the mutual information between each individual training sample and the output
of the learning algorithm. The bound is derived under more general conditions
on the loss function than in existing studies; nevertheless, it provides a
tighter characterization of the generalization error. Examples of learning
algorithms are provided to demonstrate the the tightness of the bound, and to
show that it has a broad range of applicability. Application to noisy and
iterative algorithms, e.g., stochastic gradient Langevin dynamics (SGLD), is
also studied, where the constructed bound provides a tighter characterization
of the generalization error than existing results. Finally, it is demonstrated
that, unlike existing bounds, which are difficult to compute and evaluate
empirically, the proposed bound can be estimated easily in practice
Learning Invariant Feature Representation to Improve Generalization across Chest X-ray Datasets
Chest radiography is the most common medical image examination for screening
and diagnosis in hospitals. Automatic interpretation of chest X-rays at the
level of an entry-level radiologist can greatly benefit work prioritization and
assist in analyzing a larger population. Subsequently, several datasets and
deep learning-based solutions have been proposed to identify diseases based on
chest X-ray images. However, these methods are shown to be vulnerable to shift
in the source of data: a deep learning model performing well when tested on the
same dataset as training data, starts to perform poorly when it is tested on a
dataset from a different source. In this work, we address this challenge of
generalization to a new source by forcing the network to learn a
source-invariant representation. By employing an adversarial training strategy,
we show that a network can be forced to learn a source-invariant
representation. Through pneumonia-classification experiments on multi-source
chest X-ray datasets, we show that this algorithm helps in improving
classification accuracy on a new source of X-ray dataset.Comment: Accepted to Machine Learning in Medical Imaging (MLMI 2020), in
conjunction with MICCAI 2020, Oct. 4, 202
- …