50 research outputs found
Convergence Rates of Gradient Methods for Convex Optimization in the Space of Measures
We study the convergence rate of Bregman gradient methods for convex
optimization in the space of measures on a -dimensional manifold. Under
basic regularity assumptions, we show that the suboptimality gap at iteration
is in for multiplicative updates, while it is in
for additive updates for some determined
by the structure of the objective function. Our flexible proof strategy, based
on approximation arguments, allows us to painlessly cover all Bregman Proximal
Gradient Methods (PGM) and their acceleration (APGM) under various geometries
such as the hyperbolic entropy and divergences. We also prove the
tightness of our analysis with matching lower bounds and confirm the
theoretical results with numerical experiments on low dimensional problems.
Note that all these optimization methods must additionally pay the
computational cost of discretization, which can be exponential in
On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport
Many tasks in machine learning and signal processing can be solved by
minimizing a convex function of a measure. This includes sparse spikes
deconvolution or training a neural network with a single hidden layer. For
these problems, we study a simple minimization method: the unknown measure is
discretized into a mixture of particles and a continuous-time gradient descent
is performed on their weights and positions. This is an idealization of the
usual way to train neural networks with a large hidden layer. We show that,
when initialized correctly and in the many-particle limit, this gradient flow,
although non-convex, converges to global minimizers. The proof involves
Wasserstein gradient flows, a by-product of optimal transport theory. Numerical
experiments show that this asymptotic behavior is already at play for a
reasonable number of particles, even in high dimension.Comment: Advances in Neural Information Processing Systems (NIPS), Dec 2018,
Montr\'eal, Canad
Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss
Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks. Towards understanding this phenomenon, we analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations. We show that the limits of the gradient flow on exponentially tailed losses can be fully characterized as a max-margin classifier in a certain non-Hilbertian space of functions. In presence of hidden low-dimensional structures, the resulting margin is independent of the ambiant dimension, which leads to strong generalization bounds. In contrast, training only the output layer implicitly solves a kernel support vector machine, which a priori does not enjoy such an adaptivity. Our analysis of training is non-quantitative in terms of running time but we prove computational guarantees in simplified settings by showing equivalences with online mirror descent. Finally, numerical experiments suggest that our analysis describes well the practical behavior of two-layer neural networks with ReLU activation and confirm the statistical benefits of this implicit bias
On the Effect of Initialization: The Scaling Path of 2-Layer Neural Networks
In supervised learning, the regularization path is sometimes used as a
convenient theoretical proxy for the optimization path of gradient descent
initialized with zero. In this paper, we study a modification of the
regularization path for infinite-width 2-layer ReLU neural networks with
non-zero initial distribution of the weights at different scales. By exploiting
a link with unbalanced optimal transport theory, we show that, despite the
non-convexity of the 2-layer network training, this problem admits an infinite
dimensional convex counterpart. We formulate the corresponding functional
optimization problem and investigate its main properties. In particular, we
show that as the scale of the initialization ranges between and ,
the associated path interpolates continuously between the so-called kernel and
rich regimes. The numerical experiments confirm that, in our setting, the
scaling path and the final states of the optimization path behave similarly
even beyond these extreme points
On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport
International audienceMany tasks in machine learning and signal processing can be solved by minimizing a convex function of a measure. This includes sparse spikes deconvolution or training a neural network with a single hidden layer. For these problems, we study a simple minimization method: the unknown measure is discretized into a mixture of particles and a continuous-time gradient descent is performed on their weights and positions. This is an idealization of the usual way to train neural networks with a large hidden layer. We show that, when initialized correctly and in the many-particle limit, this gradient flow, although non-convex, converges to global minimizers. The proof involves Wasserstein gradient flows, a by-product of optimal transport theory. Numerical experiments show that this asymptotic behavior is already at play for a reasonable number of particles, even in high dimension