30 research outputs found
Convergence analysis of a proximal Gauss-Newton method
An extension of the Gauss-Newton algorithm is proposed to find local
minimizers of penalized nonlinear least squares problems, under generalized
Lipschitz assumptions. Convergence results of local type are obtained, as well
as an estimate of the radius of the convergence ball. Some applications for
solving constrained nonlinear equations are discussed and the numerical
performance of the method is assessed on some significant test problems
Sinkhorn Barycenters with Free Support via Frank-Wolfe Algorithm
We present a novel algorithm to estimate the barycenter of arbitrary
probability distributions with respect to the Sinkhorn divergence. Based on a
Frank-Wolfe optimization strategy, our approach proceeds by populating the
support of the barycenter incrementally, without requiring any pre-allocation.
We consider discrete as well as continuous distributions, proving convergence
rates of the proposed algorithm in both settings. Key elements of our analysis
are a new result showing that the Sinkhorn divergence on compact domains has
Lipschitz continuous gradient with respect to the Total Variation and a
characterization of the sample complexity of Sinkhorn potentials. Experiments
validate the effectiveness of our method in practice.Comment: 46 pages, 8 figure
Convergence Properties of Stochastic Hypergradients
Bilevel optimization problems are receiving increasing attention in machine learning as they provide a natural framework for hyperparameter optimization and meta-learning. A key step to tackle these problems is the efficient computation of the gradient of the upper-level objective (hypergradient). In this work, we study stochastic approximation schemes for the hypergradient, which are important when the lower-level problem is empirical risk minimization on a large dataset. The method that we propose is a stochastic variant of the approximate implicit differentiation approach in (Pedregosa, 2016). We provide bounds for the mean square error of the hypergradient approximation, under the assumption that the lower-level problem is accessible only through a stochastic mapping which is a contraction in expectation. In particular, our main bound is agnostic to the choice of the two stochastic solvers employed by the procedure. We provide numerical experiments to support our theoretical analysis and to show the advantage of using stochastic hypergradients in practice
Batch Greenkhorn Algorithm for Entropic-Regularized Multimarginal Optimal Transport: Linear Rate of Convergence and Iteration Complexity
In this work we propose a batch multimarginal version of the Greenkhorn algorithm for the entropic-regularized optimal transport problem. This framework is general enough to cover, as particular cases, existing Sinkhorn and Greenkhorn algorithms for the bi-marginal setting, and greedy MultiSinkhorn for the general multimarginal case. We provide a comprehensive convergence analysis based on the properties of the iterative Bregman projections method with greedy control. Linear rate of convergence as well as explicit bounds on the iteration complexity are obtained. When specialized to the above mentioned algorithms, our results give new convergence rates or provide key improvements over the state-of-the-art rates. We present numerical experiments showing that the flexibility of the batch can be exploited to improve performance of Sinkhorn algorithm both in bi-marginal and multimarginal settings
On the Iteration Complexity of Hypergradient Computation
We study a general class of bilevel problems, consisting in the minimization
of an upper-level objective which depends on the solution to a parametric
fixed-point equation. Important instances arising in machine learning include
hyperparameter optimization, meta-learning, and certain graph and recurrent
neural networks. Typically the gradient of the upper-level objective
(hypergradient) is hard or even impossible to compute exactly, which has raised
the interest in approximation methods. We investigate some popular approaches
to compute the hypergradient, based on reverse mode iterative differentiation
and approximate implicit differentiation. Under the hypothesis that the fixed
point equation is defined by a contraction mapping, we present a unified
analysis which allows for the first time to quantitatively compare these
methods, providing explicit bounds for their iteration complexity. This
analysis suggests a hierarchy in terms of computational efficiency among the
above methods, with approximate implicit differentiation based on conjugate
gradient performing best. We present an extensive experimental comparison among
the methods which confirm the theoretical findings.Comment: accepted at ICML 2020; 19 pages, 4 figures; code at
https://github.com/prolearner/hypertorch (corrected typos and one reference
Variance reduction techniques for stochastic proximal point algorithms
In the context of finite sums minimization, variance reduction techniques are
widely used to improve the performance of state-of-the-art stochastic gradient
methods. Their practical impact is clear, as well as their theoretical
properties. Stochastic proximal point algorithms have been studied as an
alternative to stochastic gradient algorithms since they are more stable with
respect to the choice of the stepsize but a proper variance reduced version is
missing. In this work, we propose the first study of variance reduction
techniques for stochastic proximal point algorithms. We introduce a stochastic
proximal version of SVRG, SAGA, and some of their variants for smooth and
convex functions. We provide several convergence results for the iterates and
the objective function values. In addition, under the Polyak-{\L}ojasiewicz
(PL) condition, we obtain linear convergence rates for the iterates and the
function values. Our numerical experiments demonstrate the advantages of the
proximal variance reduction methods over their gradient counterparts,
especially about the stability with respect to the choice of the step size