3,537 research outputs found
Landscape connectivity and dropout stability of SGD solutions for over-parameterized neural networks
The optimization of multilayer neural networks typically leads to a solution
with zero training error, yet the landscape can exhibit spurious local minima
and the minima can be disconnected. In this paper, we shed light on this
phenomenon: we show that the combination of stochastic gradient descent (SGD)
and over-parameterization makes the landscape of multilayer neural networks
approximately connected and thus more favorable to optimization. More
specifically, we prove that SGD solutions are connected via a piecewise linear
path, and the increase in loss along this path vanishes as the number of
neurons grows large. This result is a consequence of the fact that the
parameters found by SGD are increasingly dropout stable as the network becomes
wider. We show that, if we remove part of the neurons (and suitably rescale the
remaining ones), the change in loss is independent of the total number of
neurons, and it depends only on how many neurons are left. Our results exhibit
a mild dependence on the input dimension: they are dimension-free for two-layer
networks and depend linearly on the dimension for multilayer networks. We
validate our theoretical findings with numerical experiments for different
architectures and classification tasks
Neural Network Parametrization of Deep-Inelastic Structure Functions
We construct a parametrization of deep-inelastic structure functions which
retains information on experimental errors and correlations, and which does not
introduce any theoretical bias while interpolating between existing data
points. We generate a Monte Carlo sample of pseudo-data configurations and we
train an ensemble of neural networks on them. This effectively provides us with
a probability measure in the space of structure functions, within the whole
kinematic region where data are available. This measure can then be used to
determine the value of the structure function, its error, point-to-point
correlations and generally the value and uncertainty of any function of the
structure function itself. We apply this technique to the determination of the
structure function F_2 of the proton and deuteron, and a precision
determination of the isotriplet combination F_2[p-d]. We discuss in detail
these results, check their stability and accuracy, and make them available in
various formats for applications.Comment: Latex, 43 pages, 22 figures. (v2) Final version, published in JHEP;
Sect.5.2 and Fig.9 improved, a few typos corrected and other minor
improvements. (v3) Some inconsequential typos in Tab.1 and Tab 5 corrected.
Neural parametrization available at http://sophia.ecm.ub.es/f2neura
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
Despite the widespread practical success of deep learning methods, our
theoretical understanding of the dynamics of learning in deep neural networks
remains quite sparse. We attempt to bridge the gap between the theory and
practice of deep learning by systematically analyzing learning dynamics for the
restricted case of deep linear neural networks. Despite the linearity of their
input-output map, such networks have nonlinear gradient descent dynamics on
weights that change with the addition of each new hidden layer. We show that
deep linear networks exhibit nonlinear learning phenomena similar to those seen
in simulations of nonlinear networks, including long plateaus followed by rapid
transitions to lower error solutions, and faster convergence from greedy
unsupervised pretraining initial conditions than from random initial
conditions. We provide an analytical description of these phenomena by finding
new exact solutions to the nonlinear dynamics of deep learning. Our theoretical
analysis also reveals the surprising finding that as the depth of a network
approaches infinity, learning speed can nevertheless remain finite: for a
special class of initial conditions on the weights, very deep networks incur
only a finite, depth independent, delay in learning speed relative to shallow
networks. We show that, under certain conditions on the training data,
unsupervised pretraining can find this special class of initial conditions,
while scaled random Gaussian initializations cannot. We further exhibit a new
class of random orthogonal initial conditions on weights that, like
unsupervised pre-training, enjoys depth independent learning times. We further
show that these initial conditions also lead to faithful propagation of
gradients even in deep nonlinear networks, as long as they operate in a special
regime known as the edge of chaos.Comment: Submission to ICLR2014. Revised based on reviewer feedbac
- …