10 research outputs found
What do CNNs Learn in the First Layer and Why? A Linear Systems Perspective
It has previously been reported that the representation that is learned in
the first layer of deep Convolutional Neural Networks (CNNs) is highly
consistent across initializations and architectures. In this work, we quantify
this consistency by considering the first layer as a filter bank and measuring
its energy distribution. We find that the energy distribution is very different
from that of the initial weights and is remarkably consistent across random
initializations, datasets, architectures and even when the CNNs are trained
with random labels. In order to explain this consistency, we derive an
analytical formula for the energy profile of linear CNNs and show that this
profile is mostly dictated by the second order statistics of image patches in
the training set and it will approach a whitening transformation when the
number of iterations goes to infinity. Finally, we show that this formula for
linear CNNs also gives an excellent fit for the energy profiles learned by
commonly used nonlinear CNNs such as ResNet and VGG, and that the first layer
of these CNNs indeed perform approximate whitening of their inputs
Neural networks trained with SGD learn distributions of increasing complexity
The ability of deep neural networks to generalise well even when they
interpolate their training data has been explained using various "simplicity
biases". These theories postulate that neural networks avoid overfitting by
first learning simple functions, say a linear classifier, before learning more
complex, non-linear functions. Meanwhile, data structure is also recognised as
a key ingredient for good generalisation, yet its role in simplicity biases is
not yet understood. Here, we show that neural networks trained using stochastic
gradient descent initially classify their inputs using lower-order input
statistics, like mean and covariance, and exploit higher-order statistics only
later during training. We first demonstrate this distributional simplicity bias
(DSB) in a solvable model of a neural network trained on synthetic data. We
empirically demonstrate DSB in a range of deep convolutional networks and
visual transformers trained on CIFAR10, and show that it even holds in networks
pre-trained on ImageNet. We discuss the relation of DSB to other simplicity
biases and consider its implications for the principle of Gaussian universality
in learning.Comment: Source code available at https://github.com/sgoldt/dist_inc_com
Gradient-trained Weights in Wide Neural Networks Align Layerwise to Error-scaled Input Correlations
Recent works have examined how deep neural networks, which can solve a
variety of difficult problems, incorporate the statistics of training data to
achieve their success. However, existing results have been established only in
limited settings. In this work, we derive the layerwise weight dynamics of
infinite-width neural networks with nonlinear activations trained by gradient
descent. We show theoretically that weight updates are aligned with input
correlations from intermediate layers weighted by error, and demonstrate
empirically that the result also holds in finite-width wide networks. The
alignment result allows us to formulate backpropagation-free learning rules,
named Align-zero and Align-ada, that theoretically achieve the same alignment
as backpropagation. Finally, we test these learning rules on benchmark problems
in feedforward and recurrent neural networks and demonstrate, in wide networks,
comparable performance to backpropagation.Comment: 22 pages, 11 figure
Classification of Heavy-tailed Features in High Dimensions: a Superstatistical Approach
We characterise the learning of a mixture of two clouds of data points with
generic centroids via empirical risk minimisation in the high dimensional
regime, under the assumptions of generic convex loss and convex regularisation.
Each cloud of data points is obtained via a double-stochastic process, where
the sample is obtained from a Gaussian distribution whose variance is itself a
random parameter sampled from a scalar distribution . As a result, our
analysis covers a large family of data distributions, including the case of
power-law-tailed distributions with no covariance, and allows us to test recent
"Gaussian universality" claims. We study the generalisation performance of the
obtained estimator, we analyse the role of regularisation, and we analytically
characterise the separability transition.Comment: 25 pages, 8 figure
On information captured by neural networks: connections with memorization and generalization
Despite the popularity and success of deep learning, there is limited
understanding of when, how, and why neural networks generalize to unseen
examples. Since learning can be seen as extracting information from data, we
formally study information captured by neural networks during training.
Specifically, we start with viewing learning in presence of noisy labels from
an information-theoretic perspective and derive a learning algorithm that
limits label noise information in weights. We then define a notion of unique
information that an individual sample provides to the training of a deep
network, shedding some light on the behavior of neural networks on examples
that are atypical, ambiguous, or belong to underrepresented subpopulations. We
relate example informativeness to generalization by deriving nonvacuous
generalization gap bounds. Finally, by studying knowledge distillation, we
highlight the important role of data and label complexity in generalization.
Overall, our findings contribute to a deeper understanding of the mechanisms
underlying neural network generalization.Comment: PhD thesi