1,540 research outputs found
Traditional and Heavy-Tailed Self Regularization in Neural Network Models
Random Matrix Theory (RMT) is applied to analyze the weight matrices of Deep
Neural Networks (DNNs), including both production quality, pre-trained models
such as AlexNet and Inception, and smaller models trained from scratch, such as
LeNet5 and a miniature-AlexNet. Empirical and theoretical results clearly
indicate that the empirical spectral density (ESD) of DNN layer matrices
displays signatures of traditionally-regularized statistical models, even in
the absence of exogenously specifying traditional forms of regularization, such
as Dropout or Weight Norm constraints. Building on recent results in RMT, most
notably its extension to Universality classes of Heavy-Tailed matrices, we
develop a theory to identify \emph{5+1 Phases of Training}, corresponding to
increasing amounts of \emph{Implicit Self-Regularization}. For smaller and/or
older DNNs, this Implicit Self-Regularization is like traditional Tikhonov
regularization, in that there is a `size scale' separating signal from noise.
For state-of-the-art DNNs, however, we identify a novel form of
\emph{Heavy-Tailed Self-Regularization}, similar to the self-organization seen
in the statistical physics of disordered systems. This implicit
Self-Regularization can depend strongly on the many knobs of the training
process. By exploiting the generalization gap phenomena, we demonstrate that we
can cause a small model to exhibit all 5+1 phases of training simply by
changing the batch size.Comment: Very abridged version of arXiv:1810.0107
Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning
Random Matrix Theory (RMT) is applied to analyze weight matrices of Deep
Neural Networks (DNNs), including both production quality, pre-trained models
such as AlexNet and Inception, and smaller models trained from scratch, such as
LeNet5 and a miniature-AlexNet. Empirical and theoretical results clearly
indicate that the DNN training process itself implicitly implements a form of
Self-Regularization. The empirical spectral density (ESD) of DNN layer matrices
displays signatures of traditionally-regularized statistical models, even in
the absence of exogenously specifying traditional forms of explicit
regularization. Building on relatively recent results in RMT, most notably its
extension to Universality classes of Heavy-Tailed matrices, we develop a theory
to identify 5+1 Phases of Training, corresponding to increasing amounts of
Implicit Self-Regularization. These phases can be observed during the training
process as well as in the final learned DNNs. For smaller and/or older DNNs,
this Implicit Self-Regularization is like traditional Tikhonov regularization,
in that there is a "size scale" separating signal from noise. For
state-of-the-art DNNs, however, we identify a novel form of Heavy-Tailed
Self-Regularization, similar to the self-organization seen in the statistical
physics of disordered systems. This results from correlations arising at all
size scales, which arises implicitly due to the training process itself. This
implicit Self-Regularization can depend strongly on the many knobs of the
training process. By exploiting the generalization gap phenomena, we
demonstrate that we can cause a small model to exhibit all 5+1 phases of
training simply by changing the batch size. This demonstrates that---all else
being equal---DNN optimization with larger batch sizes leads to less-well
implicitly-regularized models, and it provides an explanation for the
generalization gap phenomena.Comment: 59 pages, 31 figure
Heavy-Tailed Universality Predicts Trends in Test Accuracies for Very Large Pre-Trained Deep Neural Networks
Given two or more Deep Neural Networks (DNNs) with the same or similar
architectures, and trained on the same dataset, but trained with different
solvers, parameters, hyper-parameters, regularization, etc., can we predict
which DNN will have the best test accuracy, and can we do so without peeking at
the test data? In this paper, we show how to use a new Theory of Heavy-Tailed
Self-Regularization (HT-SR) to answer this. HT-SR suggests, among other things,
that modern DNNs exhibit what we call Heavy-Tailed Mechanistic Universality
(HT-MU), meaning that the correlations in the layer weight matrices can be fit
to a power law (PL) with exponents that lie in common Universality classes from
Heavy-Tailed Random Matrix Theory (HT-RMT). From this, we develop a Universal
capacity control metric that is a weighted average of PL exponents. Rather than
considering small toy NNs, we examine over 50 different, large-scale
pre-trained DNNs, ranging over 15 different architectures, trained on
ImagetNet, each of which has been reported to have different test accuracies.
We show that this new capacity metric correlates very well with the reported
test accuracies of these DNNs, looking across each architecture
(VGG16/.../VGG19, ResNet10/.../ResNet152, etc.). We also show how to
approximate the metric by the more familiar Product Norm capacity measure, as
the average of the log Frobenius norm of the layer weight matrices. Our
approach requires no changes to the underlying DNN or its loss function, it
does not require us to train a model (although it could be used to monitor
training), and it does not even require access to the ImageNet data.Comment: Updated as will appear in SDM2
Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior
We describe an approach to understand the peculiar and counterintuitive
generalization properties of deep neural networks. The approach involves going
beyond worst-case theoretical capacity control frameworks that have been
popular in machine learning in recent years to revisit old ideas in the
statistical mechanics of neural networks. Within this approach, we present a
prototypical Very Simple Deep Learning (VSDL) model, whose behavior is
controlled by two control parameters, one describing an effective amount of
data, or load, on the network (that decreases when noise is added to the
input), and one with an effective temperature interpretation (that increases
when algorithms are early stopped). Using this model, we describe how a very
simple application of ideas from the statistical mechanics theory of
generalization provides a strong qualitative description of recently-observed
empirical results regarding the inability of deep neural networks not to
overfit training data, discontinuous learning and sharp transitions in the
generalization properties of learning algorithms, etc.Comment: 31 pages; added brief discussion of recent papers that use/extend
these idea
L1 regularization is better than L2 for learning and predicting chaotic systems
Emergent behaviors are in the focus of recent research interest. It is then
of considerable importance to investigate what optimizations suit the learning
and prediction of chaotic systems, the putative candidates for emergence. We
have compared L1 and L2 regularizations on predicting chaotic time series using
linear recurrent neural networks. The internal representation and the weights
of the networks were optimized in a unifying framework. Computational tests on
different problems indicate considerable advantages for the L1 regularization:
It had considerably better learning time and better interpolating capabilities.
We shall argue that optimization viewed as a maximum likelihood estimation
justifies our results, because L1 regularization fits heavy-tailed
distributions -- an apparently general feature of emergent systems -- better.Comment: 13 pages, 4 figure
A Generic Network Compression Framework for Sequential Recommender Systems
Sequential recommender systems (SRS) have become the key technology in
capturing user's dynamic interests and generating high-quality recommendations.
Current state-of-the-art sequential recommender models are typically based on a
sandwich-structured deep neural network, where one or more middle (hidden)
layers are placed between the input embedding layer and output softmax layer.
In general, these models require a large number of parameters (such as using a
large embedding dimension or a deep network architecture) to obtain their
optimal performance. Despite the effectiveness, at some point, further
increasing model size may be harder for model deployment in resource-constraint
devices, resulting in longer responding time and larger memory footprint. To
resolve the issues, we propose a compressed sequential recommendation
framework, termed as CpRec, where two generic model shrinking techniques are
employed. Specifically, we first propose a block-wise adaptive decomposition to
approximate the input and softmax matrices by exploiting the fact that items in
SRS obey a long-tailed distribution. To reduce the parameters of the middle
layers, we introduce three layer-wise parameter sharing schemes. We instantiate
CpRec using deep convolutional neural network with dilated kernels given
consideration to both recommendation accuracy and efficiency. By the extensive
ablation studies, we demonstrate that the proposed CpRec can achieve up to
48 times compression rates in real-world SRS datasets. Meanwhile, CpRec
is faster during training\inference, and in most cases outperforms its
uncompressed counterpart.Comment: Accepted by SIGIR202
ShrinkTeaNet: Million-scale Lightweight Face Recognition via Shrinking Teacher-Student Networks
Large-scale face recognition in-the-wild has been recently achieved matured
performance in many real work applications. However, such systems are built on
GPU platforms and mostly deploy heavy deep network architectures. Given a
high-performance heavy network as a teacher, this work presents a simple and
elegant teacher-student learning paradigm, namely ShrinkTeaNet, to train a
portable student network that has significantly fewer parameters and
competitive accuracy against the teacher network. Far apart from prior
teacher-student frameworks mainly focusing on accuracy and compression ratios
in closed-set problems, our proposed teacher-student network is proved to be
more robust against open-set problem, i.e. large-scale face recognition. In
addition, this work introduces a novel Angular Distillation Loss for distilling
the feature direction and the sample distributions of the teacher's hypersphere
to its student. Then ShrinkTeaNet framework can efficiently guide the student's
learning process with the teacher's knowledge presented in both intermediate
and last stages of the feature embedding. Evaluations on LFW, CFP-FP, AgeDB,
IJB-B and IJB-C Janus, and MegaFace with one million distractors have
demonstrated the efficiency of the proposed approach to learn robust student
networks which have satisfying accuracy and compact sizes. Our ShrinkTeaNet is
able to support the light-weight architecture achieving high performance with
99.77% on LFW and 95.64% on large-scale Megaface protocols
Deep Learning for Energy Markets
Deep Learning is applied to energy markets to predict extreme loads observed
in energy grids. Forecasting energy loads and prices is challenging due to
sharp peaks and troughs that arise due to supply and demand fluctuations from
intraday system constraints. We propose deep spatio-temporal models and extreme
value theory (EVT) to capture theses effects and in particular the tail
behavior of load spikes. Deep LSTM architectures with ReLU and
activation functions can model trends and temporal dependencies while EVT
captures highly volatile load spikes above a pre-specified threshold. To
illustrate our methodology, we use hourly price and demand data from 4719 nodes
of the PJM interconnection, and we construct a deep predictor. We show that
DL-EVT outperforms traditional Fourier time series methods, both in-and
out-of-sample, by capturing the observed nonlinearities in prices. Finally, we
conclude with directions for future research
Designing Accurate Emulators for Scientific Processes using Calibration-Driven Deep Models
Predictive models that accurately emulate complex scientific processes can
achieve exponential speed-ups over numerical simulators or experiments, and at
the same time provide surrogates for improving the subsequent analysis.
Consequently, there is a recent surge in utilizing modern machine learning (ML)
methods, such as deep neural networks, to build data-driven emulators. While
the majority of existing efforts has focused on tailoring off-the-shelf ML
solutions to better suit the scientific problem at hand, we study an often
overlooked, yet important, problem of choosing loss functions to measure the
discrepancy between observed data and the predictions from a model. Due to lack
of better priors on the expected residual structure, in practice, simple
choices such as the mean squared error and the mean absolute error are made.
However, the inherent symmetric noise assumption made by these loss functions
makes them inappropriate in cases where the data is heterogeneous or when the
noise distribution is asymmetric. We propose Learn-by-Calibrating (LbC), a
novel deep learning approach based on interval calibration for designing
emulators in scientific applications, that are effective even with
heterogeneous data and are robust to outliers. Using a large suite of
use-cases, we show that LbC provides significant improvements in generalization
error over widely-adopted loss function choices, achieves high-quality
emulators even in small data regimes and more importantly, recovers the
inherent noise structure without any explicit priors
Multiplicative noise and heavy tails in stochastic optimization
Although stochastic optimization is central to modern machine learning, the
precise mechanisms underlying its success, and in particular, the precise role
of the stochasticity, still remain unclear. Modelling stochastic optimization
algorithms as discrete random recurrence relations, we show that multiplicative
noise, as it commonly arises due to variance in local rates of convergence,
results in heavy-tailed stationary behaviour in the parameters. A detailed
analysis is conducted for SGD applied to a simple linear regression problem,
followed by theoretical results for a much larger class of models (including
non-linear and non-convex) and optimizers (including momentum, Adam, and
stochastic Newton), demonstrating that our qualitative results hold much more
generally. In each case, we describe dependence on key factors, including step
size, batch size, and data variability, all of which exhibit similar
qualitative behavior to recent empirical results on state-of-the-art neural
network models from computer vision and natural language processing.
Furthermore, we empirically demonstrate how multiplicative noise and
heavy-tailed structure improve capacity for basin hopping and exploration of
non-convex loss surfaces, over commonly-considered stochastic dynamics with
only additive noise and light-tailed structure.Comment: 30 pages, 7 figure
- …