176 research outputs found
Heavy-Tailed Universality Predicts Trends in Test Accuracies for Very Large Pre-Trained Deep Neural Networks
Given two or more Deep Neural Networks (DNNs) with the same or similar
architectures, and trained on the same dataset, but trained with different
solvers, parameters, hyper-parameters, regularization, etc., can we predict
which DNN will have the best test accuracy, and can we do so without peeking at
the test data? In this paper, we show how to use a new Theory of Heavy-Tailed
Self-Regularization (HT-SR) to answer this. HT-SR suggests, among other things,
that modern DNNs exhibit what we call Heavy-Tailed Mechanistic Universality
(HT-MU), meaning that the correlations in the layer weight matrices can be fit
to a power law (PL) with exponents that lie in common Universality classes from
Heavy-Tailed Random Matrix Theory (HT-RMT). From this, we develop a Universal
capacity control metric that is a weighted average of PL exponents. Rather than
considering small toy NNs, we examine over 50 different, large-scale
pre-trained DNNs, ranging over 15 different architectures, trained on
ImagetNet, each of which has been reported to have different test accuracies.
We show that this new capacity metric correlates very well with the reported
test accuracies of these DNNs, looking across each architecture
(VGG16/.../VGG19, ResNet10/.../ResNet152, etc.). We also show how to
approximate the metric by the more familiar Product Norm capacity measure, as
the average of the log Frobenius norm of the layer weight matrices. Our
approach requires no changes to the underlying DNN or its loss function, it
does not require us to train a model (although it could be used to monitor
training), and it does not even require access to the ImageNet data.Comment: Updated as will appear in SDM2
Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning
Random Matrix Theory (RMT) is applied to analyze weight matrices of Deep
Neural Networks (DNNs), including both production quality, pre-trained models
such as AlexNet and Inception, and smaller models trained from scratch, such as
LeNet5 and a miniature-AlexNet. Empirical and theoretical results clearly
indicate that the DNN training process itself implicitly implements a form of
Self-Regularization. The empirical spectral density (ESD) of DNN layer matrices
displays signatures of traditionally-regularized statistical models, even in
the absence of exogenously specifying traditional forms of explicit
regularization. Building on relatively recent results in RMT, most notably its
extension to Universality classes of Heavy-Tailed matrices, we develop a theory
to identify 5+1 Phases of Training, corresponding to increasing amounts of
Implicit Self-Regularization. These phases can be observed during the training
process as well as in the final learned DNNs. For smaller and/or older DNNs,
this Implicit Self-Regularization is like traditional Tikhonov regularization,
in that there is a "size scale" separating signal from noise. For
state-of-the-art DNNs, however, we identify a novel form of Heavy-Tailed
Self-Regularization, similar to the self-organization seen in the statistical
physics of disordered systems. This results from correlations arising at all
size scales, which arises implicitly due to the training process itself. This
implicit Self-Regularization can depend strongly on the many knobs of the
training process. By exploiting the generalization gap phenomena, we
demonstrate that we can cause a small model to exhibit all 5+1 phases of
training simply by changing the batch size. This demonstrates that---all else
being equal---DNN optimization with larger batch sizes leads to less-well
implicitly-regularized models, and it provides an explanation for the
generalization gap phenomena.Comment: 59 pages, 31 figure
Traditional and Heavy-Tailed Self Regularization in Neural Network Models
Random Matrix Theory (RMT) is applied to analyze the weight matrices of Deep
Neural Networks (DNNs), including both production quality, pre-trained models
such as AlexNet and Inception, and smaller models trained from scratch, such as
LeNet5 and a miniature-AlexNet. Empirical and theoretical results clearly
indicate that the empirical spectral density (ESD) of DNN layer matrices
displays signatures of traditionally-regularized statistical models, even in
the absence of exogenously specifying traditional forms of regularization, such
as Dropout or Weight Norm constraints. Building on recent results in RMT, most
notably its extension to Universality classes of Heavy-Tailed matrices, we
develop a theory to identify \emph{5+1 Phases of Training}, corresponding to
increasing amounts of \emph{Implicit Self-Regularization}. For smaller and/or
older DNNs, this Implicit Self-Regularization is like traditional Tikhonov
regularization, in that there is a `size scale' separating signal from noise.
For state-of-the-art DNNs, however, we identify a novel form of
\emph{Heavy-Tailed Self-Regularization}, similar to the self-organization seen
in the statistical physics of disordered systems. This implicit
Self-Regularization can depend strongly on the many knobs of the training
process. By exploiting the generalization gap phenomena, we demonstrate that we
can cause a small model to exhibit all 5+1 phases of training simply by
changing the batch size.Comment: Very abridged version of arXiv:1810.0107
Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data
In many applications, one works with neural network models trained by someone
else. For such pretrained models, one may not have access to training data or
test data. Moreover, one may not know details about the model, e.g., the
specifics of the training data, the loss function, the hyperparameter values,
etc. Given one or many pretrained models, it is a challenge to say anything
about the expected performance or quality of the models. Here, we address this
challenge by providing a detailed meta-analysis of hundreds of
publicly-available pretrained models. We examine norm based capacity control
metrics as well as power law based metrics from the recently-developed Theory
of Heavy-Tailed Self Regularization. We find that norm based metrics correlate
well with reported test accuracies for well-trained models, but that they often
cannot distinguish well-trained versus poorly-trained models. We also find that
power law based metrics can do much better -- quantitatively better at
discriminating among series of well-trained models with a given architecture;
and qualitatively better at discriminating well-trained versus poorly-trained
models. These methods can be used to identify when a pretrained neural network
has problems that cannot be detected simply by examining training/test
accuracies.Comment: 35 pages, 8 tables, 17 figures. To appear in Nature Communication
Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior
We describe an approach to understand the peculiar and counterintuitive
generalization properties of deep neural networks. The approach involves going
beyond worst-case theoretical capacity control frameworks that have been
popular in machine learning in recent years to revisit old ideas in the
statistical mechanics of neural networks. Within this approach, we present a
prototypical Very Simple Deep Learning (VSDL) model, whose behavior is
controlled by two control parameters, one describing an effective amount of
data, or load, on the network (that decreases when noise is added to the
input), and one with an effective temperature interpretation (that increases
when algorithms are early stopped). Using this model, we describe how a very
simple application of ideas from the statistical mechanics theory of
generalization provides a strong qualitative description of recently-observed
empirical results regarding the inability of deep neural networks not to
overfit training data, discontinuous learning and sharp transitions in the
generalization properties of learning algorithms, etc.Comment: 31 pages; added brief discussion of recent papers that use/extend
these idea
Periodic Spectral Ergodicity: A Complexity Measure for Deep Neural Networks and Neural Architecture Search
Establishing associations between the structure and the generalisation
ability of deep neural networks (DNNs) is a challenging task in modern machine
learning. Producing solutions to this challenge will bring progress both in the
theoretical understanding of DNNs and in building new architectures
efficiently. In this work, we address this challenge by developing a new
complexity measure based on the concept of {Periodic Spectral Ergodicity} (PSE)
originating from quantum statistical mechanics. Based on this measure a
technique is devised to quantify the complexity of deep neural networks from
the learned weights and traversing the network connectivity in a sequential
manner, hence the term cascading PSE (cPSE), as an empirical complexity
measure. This measure will capture both topological and internal neural
processing complexity simultaneously. Because of this cascading approach, i.e.,
a symmetric divergence of PSE on the consecutive layers, it is possible to use
this measure for Neural Architecture Search (NAS). We demonstrate the
usefulness of this measure in practice on two sets of vision models, ResNet and
VGG, and sketch the computation of cPSE for more complex network structures.Comment: 9 pages, 5 figure
Machine learning identifies scale-free properties in disordered materials
The vast amount of design freedom in disordered systems expands the parameter
space for signal processing, allowing for unique signal flows that are
distinguished from those in regular systems. However, this large degree of
freedom has hindered the deterministic design of disordered systems for target
functionalities. Here, we employ a machine learning (ML) approach for
predicting and designing wave-matter interactions in disordered structures,
thereby identifying scale-free properties for waves. To abstract and map the
features of wave behaviours and disordered structures, we develop
disorder-to-localization and localization-to-disorder convolutional neural
networks (CNNs). Each CNN enables the instantaneous prediction of wave
localization in disordered structures and the instantaneous generation of
disordered structures from given localizations. We demonstrate that
CNN-generated disordered structures have scale-free properties with heavy tails
and hub atoms, which exhibit an increase of multiple orders of magnitude in
robustness to accidental defects, such as material or structural imperfection.
Our results verify the critical role of ML network structures in determining
ML-generated real-space structures, which can be used in the design of
defect-immune and efficiently tunable devices.Comment: 44 pages, 15 figure
Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum under Heavy-Tailed Gradient Noise
Stochastic gradient descent with momentum (SGDm) is one of the most popular
optimization algorithms in deep learning. While there is a rich theory of SGDm
for convex problems, the theory is considerably less developed in the context
of deep learning where the problem is non-convex and the gradient noise might
exhibit a heavy-tailed behavior, as empirically observed in recent studies. In
this study, we consider a \emph{continuous-time} variant of SGDm, known as the
underdamped Langevin dynamics (ULD), and investigate its asymptotic properties
under heavy-tailed perturbations. Supported by recent studies from statistical
physics, we argue both theoretically and empirically that the heavy-tails of
such perturbations can result in a bias even when the step-size is small, in
the sense that \emph{the optima of stationary distribution} of the dynamics
might not match \emph{the optima of the cost function to be optimized}. As a
remedy, we develop a novel framework, which we coin as \emph{fractional} ULD
(FULD), and prove that FULD targets the so-called Gibbs distribution, whose
optima exactly match the optima of the original cost. We observe that the Euler
discretization of FULD has noteworthy algorithmic similarities with
\emph{natural gradient} methods and \emph{gradient clipping}, bringing a new
perspective on understanding their role in deep learning. We support our theory
with experiments conducted on a synthetic model and neural networks.Comment: 20 pages, Published at International Conference on Machine Learning
202
Multiplicative noise and heavy tails in stochastic optimization
Although stochastic optimization is central to modern machine learning, the
precise mechanisms underlying its success, and in particular, the precise role
of the stochasticity, still remain unclear. Modelling stochastic optimization
algorithms as discrete random recurrence relations, we show that multiplicative
noise, as it commonly arises due to variance in local rates of convergence,
results in heavy-tailed stationary behaviour in the parameters. A detailed
analysis is conducted for SGD applied to a simple linear regression problem,
followed by theoretical results for a much larger class of models (including
non-linear and non-convex) and optimizers (including momentum, Adam, and
stochastic Newton), demonstrating that our qualitative results hold much more
generally. In each case, we describe dependence on key factors, including step
size, batch size, and data variability, all of which exhibit similar
qualitative behavior to recent empirical results on state-of-the-art neural
network models from computer vision and natural language processing.
Furthermore, we empirically demonstrate how multiplicative noise and
heavy-tailed structure improve capacity for basin hopping and exploration of
non-convex loss surfaces, over commonly-considered stochastic dynamics with
only additive noise and light-tailed structure.Comment: 30 pages, 7 figure
Knowledge Capture and Replay for Continual Learning
Deep neural networks have shown promise in several domains, and the learned
task-specific information is implicitly stored in the network parameters. It
will be vital to utilize representations from these networks for downstream
tasks such as continual learning. In this paper, we introduce the notion of
{\em flashcards} that are visual representations to {\em capture} the encoded
knowledge of a network, as a function of random image patterns. We demonstrate
the effectiveness of flashcards in capturing representations and show that they
are efficient replay methods for general and task agnostic continual learning
setting. Thus, while adapting to a new task, a limited number of constructed
flashcards, help to prevent catastrophic forgetting of the previously learned
tasks. Most interestingly, such flashcards neither require external memory
storage nor need to be accumulated over multiple tasks and only need to be
constructed just before learning the subsequent new task, irrespective of the
number of tasks trained before and are hence task agnostic. We first
demonstrate the efficacy of flashcards in capturing knowledge representation
from a trained network, and empirically validate the efficacy of flashcards on
a variety of continual learning tasks: continual unsupervised reconstruction,
continual denoising, and new-instance learning classification, using a number
of heterogeneous benchmark datasets. These studies also indicate that continual
learning algorithms with flashcards as the replay strategy perform better than
other state-of-the-art replay methods, and exhibits on par performance with the
best possible baseline using coreset sampling, with the least additional
computational complexity and storage
- …