63,599 research outputs found
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
Despite the widespread practical success of deep learning methods, our
theoretical understanding of the dynamics of learning in deep neural networks
remains quite sparse. We attempt to bridge the gap between the theory and
practice of deep learning by systematically analyzing learning dynamics for the
restricted case of deep linear neural networks. Despite the linearity of their
input-output map, such networks have nonlinear gradient descent dynamics on
weights that change with the addition of each new hidden layer. We show that
deep linear networks exhibit nonlinear learning phenomena similar to those seen
in simulations of nonlinear networks, including long plateaus followed by rapid
transitions to lower error solutions, and faster convergence from greedy
unsupervised pretraining initial conditions than from random initial
conditions. We provide an analytical description of these phenomena by finding
new exact solutions to the nonlinear dynamics of deep learning. Our theoretical
analysis also reveals the surprising finding that as the depth of a network
approaches infinity, learning speed can nevertheless remain finite: for a
special class of initial conditions on the weights, very deep networks incur
only a finite, depth independent, delay in learning speed relative to shallow
networks. We show that, under certain conditions on the training data,
unsupervised pretraining can find this special class of initial conditions,
while scaled random Gaussian initializations cannot. We further exhibit a new
class of random orthogonal initial conditions on weights that, like
unsupervised pre-training, enjoys depth independent learning times. We further
show that these initial conditions also lead to faithful propagation of
gradients even in deep nonlinear networks, as long as they operate in a special
regime known as the edge of chaos.Comment: Submission to ICLR2014. Revised based on reviewer feedbac
Dynamical Isometry is Achieved in Residual Networks in a Universal Way for any Activation Function
We demonstrate that in residual neural networks (ResNets) dynamical isometry
is achievable irrespectively of the activation function used. We do that by
deriving, with the help of Free Probability and Random Matrix Theories, a
universal formula for the spectral density of the input-output Jacobian at
initialization, in the large network width and depth limit. The resulting
singular value spectrum depends on a single parameter, which we calculate for a
variety of popular activation functions, by analyzing the signal propagation in
the artificial neural network. We corroborate our results with numerical
simulations of both random matrices and ResNets applied to the CIFAR-10
classification problem. Moreover, we study the consequence of this universal
behavior for the initial and late phases of the learning processes. We conclude
by drawing attention to the simple fact, that initialization acts as a
confounding factor between the choice of activation function and the rate of
learning. We propose that in ResNets this can be resolved based on our results,
by ensuring the same level of dynamical isometry at initialization
- …