14,580 research outputs found

    Asymptotic analysis of deep learning algorithms

    Get PDF
    We investigate the asymptotic properties of deep residual networks as the number of layers increases. We first show the existence of scaling regimes for trained weights markedly different from those implicitly assumed in the neural ODE literature. We study the convergence of the hidden state dynamics in these scaling regimes, showing that one may obtain an ODE, a stochastic differential equation (SDE) or neither. Furthermore, we derive the corresponding scaling limits for the backpropagation dynamics. Finally, we prove that in the case of a smooth activation function, the scaling regime arises as a consequence of using gradient descent. In particular, we prove linear convergence of gradient descent to a global minimum for the training of deep residual networks. We also show that if the trained weights, as a function of the layer index, admit a scaling limit as the depth increases, then the limit has finite 2-variation. This work also investigate the mean-field limit of path-homogeneous neural architectures. We prove convergence of the Wasserstein gradient flow to a global minimum, and we derive a generalization bound based on the stability of the optimization algorithm for 2-layer neural networks with ReLU activation

    Convergence Theory of Learning Over-parameterized ResNet: A Full Characterization

    Full text link
    ResNet structure has achieved great empirical success since its debut. Recent work established the convergence of learning over-parameterized ResNet with a scaling factor τ=1/L\tau=1/L on the residual branch where LL is the network depth. However, it is not clear how learning ResNet behaves for other values of τ\tau. In this paper, we fully characterize the convergence theory of gradient descent for learning over-parameterized ResNet with different values of τ\tau. Specifically, with hiding logarithmic factor and constant coefficients, we show that for τ≤1/L\tau\le 1/\sqrt{L} gradient descent is guaranteed to converge to the global minma, and especially when τ≤1/L\tau\le 1/L the convergence is irrelevant of the network depth. Conversely, we show that for τ>L−12+c\tau>L^{-\frac{1}{2}+c}, the forward output grows at least with rate LcL^c in expectation and then the learning fails because of gradient explosion for large LL. This means the bound τ≤1/L\tau\le 1/\sqrt{L} is sharp for learning ResNet with arbitrary depth. To the best of our knowledge, this is the first work that studies learning ResNet with full range of τ\tau.Comment: 31 page
    • …
    corecore