93 research outputs found
On the Iteration Complexity of Hypergradient Computation
We study a general class of bilevel problems, consisting in the minimization of an upper-level objective which depends on the solution to a parametric fixed-point equation. Important instances arising in machine learning include hyperparameter optimization, meta-learning, and certain graph and recurrent neural networks. Typically the gradient of the upper-level objective (hypergradient) is hard or even impossible to compute exactly, which has raised the interest in approximation methods. We investigate some popular approaches to compute the hypergradient, based on reverse mode iterative differentiation and approximate implicit differentiation. Under the hypothesis that the fixed point equation is defined by a contraction mapping, we present a unified analysis which allows for the first time to quantitatively compare these methods, providing explicit bounds for their iteration complexity. This analysis suggests a hierarchy in terms of computational efficiency among the above methods, with approximate implicit differentiation based on conjugate gradient performing best. We present an extensive experimental comparison among the methods which confirm the theoretical findings
On the Iteration Complexity of Hypergradient Computation
We study a general class of bilevel problems, consisting in the minimization
of an upper-level objective which depends on the solution to a parametric
fixed-point equation. Important instances arising in machine learning include
hyperparameter optimization, meta-learning, and certain graph and recurrent
neural networks. Typically the gradient of the upper-level objective
(hypergradient) is hard or even impossible to compute exactly, which has raised
the interest in approximation methods. We investigate some popular approaches
to compute the hypergradient, based on reverse mode iterative differentiation
and approximate implicit differentiation. Under the hypothesis that the fixed
point equation is defined by a contraction mapping, we present a unified
analysis which allows for the first time to quantitatively compare these
methods, providing explicit bounds for their iteration complexity. This
analysis suggests a hierarchy in terms of computational efficiency among the
above methods, with approximate implicit differentiation based on conjugate
gradient performing best. We present an extensive experimental comparison among
the methods which confirm the theoretical findings.Comment: accepted at ICML 2020; 19 pages, 4 figures; code at
https://github.com/prolearner/hypertorch (corrected typos and one reference
Convergence Properties of Stochastic Hypergradients
Bilevel optimization problems are receiving increasing attention in machine learning as they provide a natural framework for hyperparameter optimization and meta-learning. A key step to tackle these problems is the efficient computation of the gradient of the upper-level objective (hypergradient). In this work, we study stochastic approximation schemes for the hypergradient, which are important when the lower-level problem is empirical risk minimization on a large dataset. The method that we propose is a stochastic variant of the approximate implicit differentiation approach in (Pedregosa, 2016). We provide bounds for the mean square error of the hypergradient approximation, under the assumption that the lower-level problem is accessible only through a stochastic mapping which is a contraction in expectation. In particular, our main bound is agnostic to the choice of the two stochastic solvers employed by the procedure. We provide numerical experiments to support our theoretical analysis and to show the advantage of using stochastic hypergradients in practice
Principled and Efficient Bilevel Optimization for Machine Learning
Automatic differentiation (AD) is a core element of most modern machine learning
libraries that allows to efficiently compute derivatives of a function from the corresponding program. Thanks to AD, machine learning practitioners have tackled
increasingly complex learning models, such as deep neural networks with up to hundreds of billions of parameters, which are learned using the derivative (or gradient)
of a loss function with respect to those parameters. While in most cases gradients
can be computed exactly and relatively cheaply, in others the exact computation
is either impossible or too expensive and AD must be used in combination with
approximation methods. Some of these challenging scenarios arising for example in
meta-learning or hyperparameter optimization, can be framed as bilevel optimization
problems, where the goal is to minimize an objective function that is evaluated by
first solving another optimization problem, the lower-level problem. In this work, we
study efficient gradient-based bilevel optimization algorithms for machine learning
problems. In particular, we establish convergence rates for some simple approaches
to approximate the gradient of the bilevel objective, namely the hypergradient, when
the objective is smooth and the lower-level problem consists in finding the fixed
point of a contraction map. Leveraging such results, we also prove that the projected
inexact hypergradient method achieves a (near) optimal rate of convergence. We
establish these results for both the deterministic and stochastic settings. Additionally, we provide an efficient implementation of the methods studied and perform
several numerical experiments on hyperparameter optimization, meta-learning, datapoisoning and equilibrium models, which show that our theoretical results are good
indicators of the performance in practice
Analyzing Inexact Hypergradients for Bilevel Learning
Estimating hyperparameters has been a long-standing problem in machine
learning. We consider the case where the task at hand is modeled as the
solution to an optimization problem. Here the exact gradient with respect to
the hyperparameters cannot be feasibly computed and approximate strategies are
required. We introduce a unified framework for computing hypergradients that
generalizes existing methods based on the implicit function theorem and
automatic differentiation/backpropagation, showing that these two seemingly
disparate approaches are actually tightly connected. Our framework is extremely
flexible, allowing its subproblems to be solved with any suitable method, to
any degree of accuracy. We derive a priori and computable a posteriori error
bounds for all our methods, and numerically show that our a posteriori bounds
are usually more accurate. Our numerical results also show that, surprisingly,
for efficient bilevel optimization, the choice of hypergradient algorithm is at
least as important as the choice of lower-level solver.Comment: Accepted to IMA Journal of Applied Mathematic
- …