93 research outputs found

    On the Iteration Complexity of Hypergradient Computation

    Get PDF
    We study a general class of bilevel problems, consisting in the minimization of an upper-level objective which depends on the solution to a parametric fixed-point equation. Important instances arising in machine learning include hyperparameter optimization, meta-learning, and certain graph and recurrent neural networks. Typically the gradient of the upper-level objective (hypergradient) is hard or even impossible to compute exactly, which has raised the interest in approximation methods. We investigate some popular approaches to compute the hypergradient, based on reverse mode iterative differentiation and approximate implicit differentiation. Under the hypothesis that the fixed point equation is defined by a contraction mapping, we present a unified analysis which allows for the first time to quantitatively compare these methods, providing explicit bounds for their iteration complexity. This analysis suggests a hierarchy in terms of computational efficiency among the above methods, with approximate implicit differentiation based on conjugate gradient performing best. We present an extensive experimental comparison among the methods which confirm the theoretical findings

    On the Iteration Complexity of Hypergradient Computation

    Get PDF
    We study a general class of bilevel problems, consisting in the minimization of an upper-level objective which depends on the solution to a parametric fixed-point equation. Important instances arising in machine learning include hyperparameter optimization, meta-learning, and certain graph and recurrent neural networks. Typically the gradient of the upper-level objective (hypergradient) is hard or even impossible to compute exactly, which has raised the interest in approximation methods. We investigate some popular approaches to compute the hypergradient, based on reverse mode iterative differentiation and approximate implicit differentiation. Under the hypothesis that the fixed point equation is defined by a contraction mapping, we present a unified analysis which allows for the first time to quantitatively compare these methods, providing explicit bounds for their iteration complexity. This analysis suggests a hierarchy in terms of computational efficiency among the above methods, with approximate implicit differentiation based on conjugate gradient performing best. We present an extensive experimental comparison among the methods which confirm the theoretical findings.Comment: accepted at ICML 2020; 19 pages, 4 figures; code at https://github.com/prolearner/hypertorch (corrected typos and one reference

    Convergence Properties of Stochastic Hypergradients

    Get PDF
    Bilevel optimization problems are receiving increasing attention in machine learning as they provide a natural framework for hyperparameter optimization and meta-learning. A key step to tackle these problems is the efficient computation of the gradient of the upper-level objective (hypergradient). In this work, we study stochastic approximation schemes for the hypergradient, which are important when the lower-level problem is empirical risk minimization on a large dataset. The method that we propose is a stochastic variant of the approximate implicit differentiation approach in (Pedregosa, 2016). We provide bounds for the mean square error of the hypergradient approximation, under the assumption that the lower-level problem is accessible only through a stochastic mapping which is a contraction in expectation. In particular, our main bound is agnostic to the choice of the two stochastic solvers employed by the procedure. We provide numerical experiments to support our theoretical analysis and to show the advantage of using stochastic hypergradients in practice

    Principled and Efficient Bilevel Optimization for Machine Learning

    Get PDF
    Automatic differentiation (AD) is a core element of most modern machine learning libraries that allows to efficiently compute derivatives of a function from the corresponding program. Thanks to AD, machine learning practitioners have tackled increasingly complex learning models, such as deep neural networks with up to hundreds of billions of parameters, which are learned using the derivative (or gradient) of a loss function with respect to those parameters. While in most cases gradients can be computed exactly and relatively cheaply, in others the exact computation is either impossible or too expensive and AD must be used in combination with approximation methods. Some of these challenging scenarios arising for example in meta-learning or hyperparameter optimization, can be framed as bilevel optimization problems, where the goal is to minimize an objective function that is evaluated by first solving another optimization problem, the lower-level problem. In this work, we study efficient gradient-based bilevel optimization algorithms for machine learning problems. In particular, we establish convergence rates for some simple approaches to approximate the gradient of the bilevel objective, namely the hypergradient, when the objective is smooth and the lower-level problem consists in finding the fixed point of a contraction map. Leveraging such results, we also prove that the projected inexact hypergradient method achieves a (near) optimal rate of convergence. We establish these results for both the deterministic and stochastic settings. Additionally, we provide an efficient implementation of the methods studied and perform several numerical experiments on hyperparameter optimization, meta-learning, datapoisoning and equilibrium models, which show that our theoretical results are good indicators of the performance in practice

    Analyzing Inexact Hypergradients for Bilevel Learning

    Full text link
    Estimating hyperparameters has been a long-standing problem in machine learning. We consider the case where the task at hand is modeled as the solution to an optimization problem. Here the exact gradient with respect to the hyperparameters cannot be feasibly computed and approximate strategies are required. We introduce a unified framework for computing hypergradients that generalizes existing methods based on the implicit function theorem and automatic differentiation/backpropagation, showing that these two seemingly disparate approaches are actually tightly connected. Our framework is extremely flexible, allowing its subproblems to be solved with any suitable method, to any degree of accuracy. We derive a priori and computable a posteriori error bounds for all our methods, and numerically show that our a posteriori bounds are usually more accurate. Our numerical results also show that, surprisingly, for efficient bilevel optimization, the choice of hypergradient algorithm is at least as important as the choice of lower-level solver.Comment: Accepted to IMA Journal of Applied Mathematic
    • …
    corecore