85,840 research outputs found
Non-convergence of stochastic gradient descent in the training of deep neural networks
Deep neural networks have successfully been trained in various application
areas with stochastic gradient descent. However, there exists no rigorous
mathematical explanation why this works so well. The training of neural
networks with stochastic gradient descent has four different discretization
parameters: (i) the network architecture; (ii) the amount of training data;
(iii) the number of gradient steps; and (iv) the number of randomly initialized
gradient trajectories. While it can be shown that the approximation error
converges to zero if all four parameters are sent to infinity in the right
order, we demonstrate in this paper that stochastic gradient descent fails to
converge for ReLU networks if their depth is much larger than their width and
the number of random initializations does not increase to infinity fast enough
Estimating Full Lipschitz Constants of Deep Neural Networks
We estimate the Lipschitz constants of the gradient of a deep neural network
and the network itself with respect to the full set of parameters. We first
develop estimates for a deep feed-forward densely connected network and then,
in a more general framework, for all neural networks that can be represented as
solutions of controlled ordinary differential equations, where time appears as
continuous depth. These estimates can be used to set the step size of
stochastic gradient descent methods, which is illustrated for one example
method
Law of Balance and Stationary Distribution of Stochastic Gradient Descent
The stochastic gradient descent (SGD) algorithm is the algorithm we use to
train neural networks. However, it remains poorly understood how the SGD
navigates the highly nonlinear and degenerate loss landscape of a neural
network. In this work, we prove that the minibatch noise of SGD regularizes the
solution towards a balanced solution whenever the loss function contains a
rescaling symmetry. Because the difference between a simple diffusion process
and SGD dynamics is the most significant when symmetries are present, our
theory implies that the loss function symmetries constitute an essential probe
of how SGD works. We then apply this result to derive the stationary
distribution of stochastic gradient flow for a diagonal linear network with
arbitrary depth and width. The stationary distribution exhibits complicated
nonlinear phenomena such as phase transitions, broken ergodicity, and
fluctuation inversion. These phenomena are shown to exist uniquely in deep
networks, implying a fundamental difference between deep and shallow models.Comment: Preprin
- …