Search CORE

85,840 research outputs found

Non-convergence of stochastic gradient descent in the training of deep neural networks

Author: Cheridito Patrick
Jentzen Arnulf
Rossmannek Florian
Publication venue: 'Elsevier BV'
Publication date: 29/01/2021
Field of study

Deep neural networks have successfully been trained in various application areas with stochastic gradient descent. However, there exists no rigorous mathematical explanation why this works so well. The training of neural networks with stochastic gradient descent has four different discretization parameters: (i) the network architecture; (ii) the amount of training data; (iii) the number of gradient steps; and (iv) the number of randomly initialized gradient trajectories. While it can be shown that the approximation error converges to zero if all four parameters are sent to infinity in the right order, we demonstrate in this paper that stochastic gradient descent fails to converge for ReLU networks if their depth is much larger than their width and the number of random initializations does not increase to infinity fast enough

arXiv.org e-Print Archive

Repository for Publications and Research Data

Estimating Full Lipschitz Constants of Deep Neural Networks

Author: Herrera Calypso
Krach Florian
Teichmann Josef
Publication venue
Publication date: 08/06/2020
Field of study

We estimate the Lipschitz constants of the gradient of a deep neural network and the network itself with respect to the full set of parameters. We first develop estimates for a deep feed-forward densely connected network and then, in a more general framework, for all neural networks that can be represented as solutions of controlled ordinary differential equations, where time appears as continuous depth. These estimates can be used to set the step size of stochastic gradient descent methods, which is illustrated for one example method

arXiv.org e-Print Archive

Law of Balance and Stationary Distribution of Stochastic Gradient Descent

Author: Li Hongchao
Ueda Masahito
Ziyin Liu
Publication venue
Publication date: 12/08/2023
Field of study

The stochastic gradient descent (SGD) algorithm is the algorithm we use to train neural networks. However, it remains poorly understood how the SGD navigates the highly nonlinear and degenerate loss landscape of a neural network. In this work, we prove that the minibatch noise of SGD regularizes the solution towards a balanced solution whenever the loss function contains a rescaling symmetry. Because the difference between a simple diffusion process and SGD dynamics is the most significant when symmetries are present, our theory implies that the loss function symmetries constitute an essential probe of how SGD works. We then apply this result to derive the stationary distribution of stochastic gradient flow for a diagonal linear network with arbitrary depth and width. The stationary distribution exhibits complicated nonlinear phenomena such as phase transitions, broken ergodicity, and fluctuation inversion. These phenomena are shown to exist uniquely in deep networks, implying a fundamental difference between deep and shallow models.Comment: Preprin

arXiv.org e-Print Archive