Search CORE

50 research outputs found

Convergence Rates of Gradient Methods for Convex Optimization in the Space of Measures

Author: Chizat Lénaïc
Publication venue
Publication date: 01/01/2022
Field of study

We study the convergence rate of Bregman gradient methods for convex optimization in the space of measures on a

d

-dimensional manifold. Under basic regularity assumptions, we show that the suboptimality gap at iteration

k

is in

O(log(k)k^{--1})

for multiplicative updates, while it is in

O(k^{--q/(d+q)})

for additive updates for some

q \in {1, 2, 4}

determined by the structure of the objective function. Our flexible proof strategy, based on approximation arguments, allows us to painlessly cover all Bregman Proximal Gradient Methods (PGM) and their acceleration (APGM) under various geometries such as the hyperbolic entropy and

L^p

divergences. We also prove the tightness of our analysis with matching lower bounds and confirm the theoretical results with numerical experiments on low dimensional problems. Note that all these optimization methods must additionally pay the computational cost of discretization, which can be exponential in

d

arXiv.org e-Print Archive

Open Journal of Mathematical Optimization

On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport

Author: Bach Francis
Chizat Lenaic
Publication venue
Publication date: 29/10/2018
Field of study

Many tasks in machine learning and signal processing can be solved by minimizing a convex function of a measure. This includes sparse spikes deconvolution or training a neural network with a single hidden layer. For these problems, we study a simple minimization method: the unknown measure is discretized into a mixture of particles and a continuous-time gradient descent is performed on their weights and positions. This is an idealization of the usual way to train neural networks with a large hidden layer. We show that, when initialized correctly and in the many-particle limit, this gradient flow, although non-convex, converges to global minimizers. The proof involves Wasserstein gradient flows, a by-product of optimal transport theory. Numerical experiments show that this asymptotic behavior is already at play for a reasonable number of particles, even in high dimension.Comment: Advances in Neural Information Processing Systems (NIPS), Dec 2018, Montr\'eal, Canad

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss

Author: Bach Francis
Chizat Lenaic
Publication venue: HAL CCSD
Publication date: 11/02/2020
Field of study

Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks. Towards understanding this phenomenon, we analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations. We show that the limits of the gradient flow on exponentially tailed losses can be fully characterized as a max-margin classifier in a certain non-Hilbertian space of functions. In presence of hidden low-dimensional structures, the resulting margin is independent of the ambiant dimension, which leads to strong generalization bounds. In contrast, training only the output layer implicitly solves a kernel support vector machine, which a priori does not enjoy such an adaptivity. Our analysis of training is non-quantitative in terms of running time but we prove computational guarantees in simplified settings by showing equivalences with online mirror descent. Finally, numerical experiments suggest that our analysis describes well the practical behavior of two-layer neural networks with ReLU activation and confirm the statistical benefits of this implicit bias

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

On the Effect of Initialization: The Scaling Path of 2-Layer Neural Networks

Author: Chizat Lénaïc
Neumayer Sebastian
Unser Michael
Publication venue
Publication date: 31/03/2023
Field of study

In supervised learning, the regularization path is sometimes used as a convenient theoretical proxy for the optimization path of gradient descent initialized with zero. In this paper, we study a modification of the regularization path for infinite-width 2-layer ReLU neural networks with non-zero initial distribution of the weights at different scales. By exploiting a link with unbalanced optimal transport theory, we show that, despite the non-convexity of the 2-layer network training, this problem admits an infinite dimensional convex counterpart. We formulate the corresponding functional optimization problem and investigate its main properties. In particular, we show that as the scale of the initialization ranges between

0

and

+\infty

, the associated path interpolates continuously between the so-called kernel and rich regimes. The numerical experiments confirm that, in our setting, the scaling path and the final states of the optimization path behave similarly even beyond these extreme points

arXiv.org e-Print Archive

On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport

Author: Bach Francis
Chizat Lenaic
Publication venue: HAL CCSD
Publication date: 02/12/2018
Field of study

International audienceMany tasks in machine learning and signal processing can be solved by minimizing a convex function of a measure. This includes sparse spikes deconvolution or training a neural network with a single hidden layer. For these problems, we study a simple minimization method: the unknown measure is discretized into a mixture of particles and a continuous-time gradient descent is performed on their weights and positions. This is an idealization of the usual way to train neural networks with a large hidden layer. We show that, when initialized correctly and in the many-particle limit, this gradient flow, although non-convex, converges to global minimizers. The proof involves Wasserstein gradient flows, a by-product of optimal transport theory. Numerical experiments show that this asymptotic behavior is already at play for a reasonable number of particles, even in high dimension

INRIA a CCSD electronic archive server