Search CORE

3,479 research outputs found

On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport

Author: Bach Francis
Chizat Lenaic
Publication venue
Publication date: 29/10/2018
Field of study

Many tasks in machine learning and signal processing can be solved by minimizing a convex function of a measure. This includes sparse spikes deconvolution or training a neural network with a single hidden layer. For these problems, we study a simple minimization method: the unknown measure is discretized into a mixture of particles and a continuous-time gradient descent is performed on their weights and positions. This is an idealization of the usual way to train neural networks with a large hidden layer. We show that, when initialized correctly and in the many-particle limit, this gradient flow, although non-convex, converges to global minimizers. The proof involves Wasserstein gradient flows, a by-product of optimal transport theory. Numerical experiments show that this asymptotic behavior is already at play for a reasonable number of particles, even in high dimension.Comment: Advances in Neural Information Processing Systems (NIPS), Dec 2018, Montr\'eal, Canad

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport

Author: Bach Francis
Chizat Lenaic
Publication venue: HAL CCSD
Publication date: 02/12/2018
Field of study

International audienceMany tasks in machine learning and signal processing can be solved by minimizing a convex function of a measure. This includes sparse spikes deconvolution or training a neural network with a single hidden layer. For these problems, we study a simple minimization method: the unknown measure is discretized into a mixture of particles and a continuous-time gradient descent is performed on their weights and positions. This is an idealization of the usual way to train neural networks with a large hidden layer. We show that, when initialized correctly and in the many-particle limit, this gradient flow, although non-convex, converges to global minimizers. The proof involves Wasserstein gradient flows, a by-product of optimal transport theory. Numerical experiments show that this asymptotic behavior is already at play for a reasonable number of particles, even in high dimension

INRIA a CCSD electronic archive server

Generalization Error Bounds of Gradient Descent for Learning Over-parameterized Deep ReLU Networks

Author: Cao Yuan
Gu Quanquan
Publication venue
Publication date: 27/11/2019
Field of study

Empirical studies show that gradient-based methods can learn deep neural networks (DNNs) with very good generalization performance in the over-parameterization regime, where DNNs can easily fit a random labeling of the training data. Very recently, a line of work explains in theory that with over-parameterization and proper random initialization, gradient-based methods can find the global minima of the training loss for DNNs. However, existing generalization error bounds are unable to explain the good generalization performance of over-parameterized DNNs. The major limitation of most existing generalization bounds is that they are based on uniform convergence and are independent of the training algorithm. In this work, we derive an algorithm-dependent generalization error bound for deep ReLU networks, and show that under certain assumptions on the data distribution, gradient descent (GD) with proper random initialization is able to train a sufficiently over-parameterized DNN to achieve arbitrarily small generalization error. Our work sheds light on explaining the good generalization performance of over-parameterized deep neural networks.Comment: 27 pages. This version simplifies the proof and improves the presentation in Version 3. In AAAI 202

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Convergence Theory of Learning Over-parameterized ResNet: A Full Characterization

Author: Chen Wei
Liu Tie-Yan
Yi Mingyang
Yu Da
Zhang Huishuai
Publication venue
Publication date: 12/07/2019
Field of study

ResNet structure has achieved great empirical success since its debut. Recent work established the convergence of learning over-parameterized ResNet with a scaling factor

\tau=1/L

on the residual branch where

L

is the network depth. However, it is not clear how learning ResNet behaves for other values of

\tau

. In this paper, we fully characterize the convergence theory of gradient descent for learning over-parameterized ResNet with different values of

\tau

. Specifically, with hiding logarithmic factor and constant coefficients, we show that for

\tau\le 1/\sqrt{L}

gradient descent is guaranteed to converge to the global minma, and especially when

\tau\le 1/L

the convergence is irrelevant of the network depth. Conversely, we show that for

\tau>L^{-\frac{1}{2}+c}

, the forward output grows at least with rate

L^c

in expectation and then the learning fails because of gradient explosion for large

L

. This means the bound

\tau\le 1/\sqrt{L}

is sharp for learning ResNet with arbitrary depth. To the best of our knowledge, this is the first work that studies learning ResNet with full range of

\tau

.Comment: 31 page

arXiv.org e-Print Archive