3,479 research outputs found

    On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport

    Get PDF
    Many tasks in machine learning and signal processing can be solved by minimizing a convex function of a measure. This includes sparse spikes deconvolution or training a neural network with a single hidden layer. For these problems, we study a simple minimization method: the unknown measure is discretized into a mixture of particles and a continuous-time gradient descent is performed on their weights and positions. This is an idealization of the usual way to train neural networks with a large hidden layer. We show that, when initialized correctly and in the many-particle limit, this gradient flow, although non-convex, converges to global minimizers. The proof involves Wasserstein gradient flows, a by-product of optimal transport theory. Numerical experiments show that this asymptotic behavior is already at play for a reasonable number of particles, even in high dimension.Comment: Advances in Neural Information Processing Systems (NIPS), Dec 2018, Montr\'eal, Canad

    On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport

    Get PDF
    International audienceMany tasks in machine learning and signal processing can be solved by minimizing a convex function of a measure. This includes sparse spikes deconvolution or training a neural network with a single hidden layer. For these problems, we study a simple minimization method: the unknown measure is discretized into a mixture of particles and a continuous-time gradient descent is performed on their weights and positions. This is an idealization of the usual way to train neural networks with a large hidden layer. We show that, when initialized correctly and in the many-particle limit, this gradient flow, although non-convex, converges to global minimizers. The proof involves Wasserstein gradient flows, a by-product of optimal transport theory. Numerical experiments show that this asymptotic behavior is already at play for a reasonable number of particles, even in high dimension

    Generalization Error Bounds of Gradient Descent for Learning Over-parameterized Deep ReLU Networks

    Full text link
    Empirical studies show that gradient-based methods can learn deep neural networks (DNNs) with very good generalization performance in the over-parameterization regime, where DNNs can easily fit a random labeling of the training data. Very recently, a line of work explains in theory that with over-parameterization and proper random initialization, gradient-based methods can find the global minima of the training loss for DNNs. However, existing generalization error bounds are unable to explain the good generalization performance of over-parameterized DNNs. The major limitation of most existing generalization bounds is that they are based on uniform convergence and are independent of the training algorithm. In this work, we derive an algorithm-dependent generalization error bound for deep ReLU networks, and show that under certain assumptions on the data distribution, gradient descent (GD) with proper random initialization is able to train a sufficiently over-parameterized DNN to achieve arbitrarily small generalization error. Our work sheds light on explaining the good generalization performance of over-parameterized deep neural networks.Comment: 27 pages. This version simplifies the proof and improves the presentation in Version 3. In AAAI 202

    Convergence Theory of Learning Over-parameterized ResNet: A Full Characterization

    Full text link
    ResNet structure has achieved great empirical success since its debut. Recent work established the convergence of learning over-parameterized ResNet with a scaling factor τ=1/L\tau=1/L on the residual branch where LL is the network depth. However, it is not clear how learning ResNet behaves for other values of τ\tau. In this paper, we fully characterize the convergence theory of gradient descent for learning over-parameterized ResNet with different values of τ\tau. Specifically, with hiding logarithmic factor and constant coefficients, we show that for τ≤1/L\tau\le 1/\sqrt{L} gradient descent is guaranteed to converge to the global minma, and especially when τ≤1/L\tau\le 1/L the convergence is irrelevant of the network depth. Conversely, we show that for τ>L−12+c\tau>L^{-\frac{1}{2}+c}, the forward output grows at least with rate LcL^c in expectation and then the learning fails because of gradient explosion for large LL. This means the bound τ≤1/L\tau\le 1/\sqrt{L} is sharp for learning ResNet with arbitrary depth. To the best of our knowledge, this is the first work that studies learning ResNet with full range of τ\tau.Comment: 31 page
    • …
    corecore