32 research outputs found
Combining Machine Learning and Physics to Understand Glassy Systems
Our understanding of supercooled liquids and glasses has lagged significantly
behind that of simple liquids and crystalline solids. This is in part due to
the many possibly relevant degrees of freedom that are present due to the
disorder inherent to these systems and in part to non-equilibrium effects which
are difficult to treat in the standard context of statistical physics. Together
these issues have resulted in a field whose theories are under-constrained by
experiment and where fundamental questions are still unresolved. Mean field
results have been successful in infinite dimensions but it is unclear to what
extent they apply to realistic systems and assume uniform local structure. At
odds with this are theories premised on the existence of structural defects.
However, until recently it has been impossible to find structural signatures
that are predictive of dynamics. Here we summarize and recast the results from
several recent papers offering a data driven approach to building a
phenomenological theory of disordered materials by combining machine learning
with physical intuition
Mean Field Residual Networks: On the Edge of Chaos
We study randomly initialized residual networks using mean field theory and
the theory of difference equations. Classical feedforward neural networks, such
as those with tanh activations, exhibit exponential behavior on the average
when propagating inputs forward or gradients backward. The exponential forward
dynamics causes rapid collapsing of the input space geometry, while the
exponential backward dynamics causes drastic vanishing or exploding gradients.
We show, in contrast, that by adding skip connections, the network will,
depending on the nonlinearity, adopt subexponential forward and backward
dynamics, and in many cases in fact polynomial. The exponents of these
polynomials are obtained through analytic methods and proved and verified
empirically to be correct. In terms of the "edge of chaos" hypothesis, these
subexponential and polynomial laws allow residual networks to "hover over the
boundary between stability and chaos," thus preserving the geometry of the
input space and the gradient information flow. In our experiments, for each
activation function we study here, we initialize residual networks with
different hyperparameters and train them on MNIST. Remarkably, our
initialization time theory can accurately predict test time performance of
these networks, by tracking either the expected amount of gradient explosion or
the expected squared distance between the images of two input vectors.
Importantly, we show, theoretically as well as empirically, that common
initializations such as the Xavier or the He schemes are not optimal for
residual networks, because the optimal initialization variances depend on the
depth. Finally, we have made mathematical contributions by deriving several new
identities for the kernels of powers of ReLU functions by relating them to the
zeroth Bessel function of the second kind.Comment: NIPS 201
The Emergence of Spectral Universality in Deep Networks
Recent work has shown that tight concentration of the entire spectrum of
singular values of a deep network's input-output Jacobian around one at
initialization can speed up learning by orders of magnitude. Therefore, to
guide important design choices, it is important to build a full theoretical
understanding of the spectra of Jacobians at initialization. To this end, we
leverage powerful tools from free probability theory to provide a detailed
analytic understanding of how a deep network's Jacobian spectrum depends on
various hyperparameters including the nonlinearity, the weight and bias
distributions, and the depth. For a variety of nonlinearities, our work reveals
the emergence of new universal limiting spectral distributions that remain
concentrated around one even as the depth goes to infinity.Comment: 17 pages, 4 figures. Appearing at the 21st International Conference
on Artificial Intelligence and Statistics (AISTATS) 201
Predicting plasticity with soft vibrational modes: from dislocations to glasses
We show that quasi localized low-frequency modes in the vibrational spectrum
can be used to construct soft spots, or regions vulnerable to rearrangement,
which serve as a universal tool for the identification of flow defects in
solids. We show that soft spots not only encode spatial information, via their
location, but also directional information, via directors for particles within
each soft spot. Single crystals with isolated dislocations exhibit
low-frequency phonon modes that localize at the core, and their polarization
pattern predicts the motion of atoms during elementary dislocation glide in
exquisite detail. Even in polycrystals and disordered solids, we find that the
directors associated with particles in soft spots are highly correlated with
the direction of particle displacements in rearrangements
Deep equilibrium networks are sensitive to initialization statistics
Deep equilibrium networks (DEQs) are a promising way to construct models
which trade off memory for compute. However, theoretical understanding of these
models is still lacking compared to traditional networks, in part because of
the repeated application of a single set of weights. We show that DEQs are
sensitive to the higher order statistics of the matrix families from which they
are initialized. In particular, initializing with orthogonal or symmetric
matrices allows for greater stability in training. This gives us a practical
prescription for initializations which allow for training with a broader range
of initial weight scales
Disentangling Trainability and Generalization in Deep Neural Networks
A longstanding goal in the theory of deep learning is to characterize the
conditions under which a given neural network architecture will be trainable,
and if so, how well it might generalize to unseen data. In this work, we
provide such a characterization in the limit of very wide and very deep
networks, for which the analysis simplifies considerably. For wide networks,
the trajectory under gradient descent is governed by the Neural Tangent Kernel
(NTK), and for deep networks the NTK itself maintains only weak data
dependence. By analyzing the spectrum of the NTK, we formulate necessary
conditions for trainability and generalization across a range of architectures,
including Fully Connected Networks (FCNs) and Convolutional Neural Networks
(CNNs). We identify large regions of hyperparameter space for which networks
can memorize the training set but completely fail to generalize. We find that
CNNs without global average pooling behave almost identically to FCNs, but that
CNNs with pooling have markedly different and often better generalization
performance. These theoretical results are corroborated experimentally on
CIFAR10 for a variety of network architectures and we include a colab notebook
that reproduces the essential results of the paper.Comment: 22 pages, 3 figures, ICML 2020. Associated Colab notebook at
https://colab.research.google.com/github/google/neural-tangents/blob/master/notebooks/Disentangling_Trainability_and_Generalization.ipyn
Dynamical Isometry and a Mean Field Theory of RNNs: Gating Enables Signal Propagation in Recurrent Neural Networks
Recurrent neural networks have gained widespread use in modeling sequence
data across various domains. While many successful recurrent architectures
employ a notion of gating, the exact mechanism that enables such remarkable
performance is not well understood. We develop a theory for signal propagation
in recurrent networks after random initialization using a combination of mean
field theory and random matrix theory. To simplify our discussion, we introduce
a new RNN cell with a simple gating mechanism that we call the minimalRNN and
compare it with vanilla RNNs. Our theory allows us to define a maximum
timescale over which RNNs can remember an input. We show that this theory
predicts trainability for both recurrent architectures. We show that gated
recurrent networks feature a much broader, more robust, trainable region than
vanilla RNNs, which corroborates recent experimental findings. Finally, we
develop a closed-form critical initialization scheme that achieves dynamical
isometry in both vanilla RNNs and minimalRNNs. We show that this results in
significantly improvement in training dynamics. Finally, we demonstrate that
the minimalRNN achieves comparable performance to its more complex
counterparts, such as LSTMs or GRUs, on a language modeling task.Comment: ICML 2018 Conference Proceeding
Neural Message Passing for Quantum Chemistry
Supervised learning on molecules has incredible potential to be useful in
chemistry, drug discovery, and materials science. Luckily, several promising
and closely related neural network models invariant to molecular symmetries
have already been described in the literature. These models learn a message
passing algorithm and aggregation procedure to compute a function of their
entire input graph. At this point, the next step is to find a particularly
effective variant of this general approach and apply it to chemical prediction
benchmarks until we either solve them or reach the limits of the approach. In
this paper, we reformulate existing models into a single common framework we
call Message Passing Neural Networks (MPNNs) and explore additional novel
variations within this framework. Using MPNNs we demonstrate state of the art
results on an important molecular property prediction benchmark; these results
are strong enough that we believe future work should focus on datasets with
larger molecules or more accurate ground truth labels.Comment: 14 page
A structural approach to relaxation in glassy liquids
When a liquid freezes, a change in the local atomic structure marks the
transition to the crystal. When a liquid is cooled to form a glass, however, no
noticeable structural change marks the glass transition. Indeed, characteristic
features of glassy dynamics that appear below an onset temperature, T_0, are
qualitatively captured by mean field theory, which assumes uniform local
structure at all temperatures. Even studies of more realistic systems have
found only weak correlations between structure and dynamics. This raises the
question: is structure important to glassy dynamics in three dimensions? Here,
we answer this question affirmatively by using machine learning methods to
identify a new field, that we call softness, which characterizes local
structure and is strongly correlated with rearrangement dynamics. We find that
the onset of glassy dynamics at T_0 is marked by the onset of correlations
between softness (i.e. structure) and dynamics. Moreover, we use softness to
construct a simple model of slow glassy relaxation that is in excellent
agreement with our simulation results, showing that a theory of the evolution
of softness in time would constitute a theory of glassy dynamics
Stability of jammed packings II: the transverse length scale
As a function of packing fraction at zero temperature and applied stress, an
amorphous packing of spheres exhibits a jamming transition where the system is
sensitive to boundary conditions even in the thermodynamic limit. Upon further
compression, the system should become insensitive to boundary conditions
provided it is sufficiently large. Here we explore the linear response to a
large class of boundary perturbations in 2 and 3 dimensions. We consider each
finite packing with periodic-boundary conditions as the basis of an infinite
square or cubic lattice and study properties of vibrational modes at arbitrary
wave vector. We find that the stability of such modes be understood in terms of
a competition between plane waves and the anomalous vibrational modes
associated with the jamming transition; infinitesimal boundary perturbations
become irrelevant for systems that are larger than a length scale that
characterizes the transverse excitations. This previously identified length
diverges at the jamming transition.Comment: 8 pages, 5 figure