346 research outputs found
Nonsmooth Implicit Differentiation for Machine Learning and Optimization
In view of training increasingly complex learning architectures, we establish a nonsmooth implicit function theorem with an operational calculus. Our result applies to most practical problems (i.e., definable problems) provided that a nonsmooth form of the classical invertibility condition is fulfilled. This approach allows for formal subdifferentiation: for instance, replacing derivatives by Clarke Jacobians in the usual differentiation formulas is fully justified for a wide class of nonsmooth problems. Moreover this calculus is entirely compatible with algorithmic differentiation (e.g., backpropagation). We provide several applications such as training deep equilibrium networks, training neural nets with conic optimization layers, or hyperparameter-tuning for nonsmooth Lasso-type models. To show the sharpness of our assumptions, we present numerical experiments showcasing the extremely pathological gradient dynamics one can encounter when applying implicit algorithmic differentiation without any hypothesis
Efficient and Modular Implicit Differentiation
Automatic differentiation (autodiff) has revolutionized machine learning. It
allows expressing complex computations by composing elementary ones in creative
ways and removes the burden of computing their derivatives by hand. More
recently, differentiation of optimization problem solutions has attracted
widespread attention with applications such as optimization as a layer, and in
bi-level problems such as hyper-parameter optimization and meta-learning.
However, the formulas for these derivatives often involve case-by-case tedious
mathematical derivations. In this paper, we propose a unified, efficient and
modular approach for implicit differentiation of optimization problems. In our
approach, the user defines (in Python in the case of our implementation) a
function capturing the optimality conditions of the problem to be
differentiated. Once this is done, we leverage autodiff of and implicit
differentiation to automatically differentiate the optimization problem. Our
approach thus combines the benefits of implicit differentiation and autodiff.
It is efficient as it can be added on top of any state-of-the-art solver and
modular as the optimality condition specification is decoupled from the
implicit differentiation mechanism. We show that seemingly simple principles
allow to recover many recently proposed implicit differentiation methods and
create new ones easily. We demonstrate the ease of formulating and solving
bi-level optimization problems using our framework. We also showcase an
application to the sensitivity analysis of molecular dynamics.Comment: V2: some corrections and link to softwar
A Unified Framework for Gradient-based Hyperparameter Optimization and Meta-learning
Machine learning algorithms and systems are progressively becoming part of our societies, leading to a growing need of building a vast multitude of accurate, reliable and interpretable models which should possibly exploit similarities among tasks. Automating segments of machine learning itself seems to be a natural step to undertake to deliver increasingly capable systems able to perform well in both the big-data and the few-shot learning regimes. Hyperparameter optimization (HPO) and meta-learning (MTL) constitute two building blocks of this growing effort. We explore these two topics under a unifying perspective, presenting a mathematical framework linked to bilevel programming that captures existing similarities and translates into procedures of practical interest rooted in algorithmic differentiation. We discuss the derivation, applicability and computational complexity of these methods and establish several approximation properties for a class of objective functions of the underlying bilevel programs. In HPO, these algorithms generalize and extend previous work on gradient-based methods. In MTL, the resulting framework subsumes classic and emerging strategies and provides a starting basis from which to build and analyze novel techniques. A series of examples and numerical simulations offer insight and highlight some limitations of these approaches. Experiments on larger-scale problems show the potential gains of the proposed methods in real-world applications. Finally, we develop two extensions of the basic algorithms apt to optimize a class of discrete hyperparameters (graph edges) in an application to relational learning and to tune online learning rate schedules for training neural network models, an old but crucially important issue in machine learning
Relax and penalize: a new bilevel approach to mixed-binary hyperparameter optimization
In recent years, bilevel approaches have become very popular to efficiently
estimate high-dimensional hyperparameters of machine learning models. However,
to date, binary parameters are handled by continuous relaxation and rounding
strategies, which could lead to inconsistent solutions. In this context, we
tackle the challenging optimization of mixed-binary hyperparameters by
resorting to an equivalent continuous bilevel reformulation based on an
appropriate penalty term. We propose an algorithmic framework that, under
suitable assumptions, is guaranteed to provide mixed-binary solutions.
Moreover, the generality of the method allows to safely use existing continuous
bilevel solvers within the proposed framework. We evaluate the performance of
our approach for a specific machine learning problem, i.e., the estimation of
the group-sparsity structure in regression problems. Reported results clearly
show that our method outperforms state-of-the-art approaches based on
relaxation and roundin
GPSINDy: Data-Driven Discovery of Equations of Motion
In this paper, we consider the problem of discovering dynamical system models
from noisy data. The presence of noise is known to be a significant problem for
symbolic regression algorithms. We combine Gaussian process regression, a
nonparametric learning method, with SINDy, a parametric learning approach, to
identify nonlinear dynamical systems from data. The key advantages of our
proposed approach are its simplicity coupled with the fact that it demonstrates
improved robustness properties with noisy data over SINDy. We demonstrate our
proposed approach on a Lotka-Volterra model and a unicycle dynamic model in
simulation and on an NVIDIA JetRacer system using hardware data. We demonstrate
improved performance over SINDy for discovering the system dynamics and
predicting future trajectories.Comment: Submitted to ICRA 202
Transcending shift-invariance in the paraxial regime via end-to-end inverse design of freeform nanophotonics
Traditional optical elements and conventional metasurfaces obey
shift-invariance in the paraxial regime. For imaging systems obeying paraxial
shift-invariance, a small shift in input angle causes a corresponding shift in
the sensor image. Shift-invariance has deep implications for the design and
functionality of optical devices, such as the necessity of free space between
components (as in compound objectives made of several curved surfaces). We
present a method for nanophotonic inverse design of compact imaging systems
whose resolution is not constrained by paraxial shift-invariance. Our method is
end-to-end, in that it integrates density-based full-Maxwell topology
optimization with a fully iterative elastic-net reconstruction algorithm. By
the design of nanophotonic structures that scatter light in a
non-shift-invariant manner, our optimized nanophotonic imaging system overcomes
the limitations of paraxial shift-invariance, achieving accurate, noise-robust
image reconstruction beyond shift-invariant resolution
The Curse of Unrolling: Rate of Differentiating Through Optimization
Computing the Jacobian of the solution of an optimization problem is a
central problem in machine learning, with applications in hyperparameter
optimization, meta-learning, optimization as a layer, and dataset distillation,
to name a few. Unrolled differentiation is a popular heuristic that
approximates the solution using an iterative solver and differentiates it
through the computational path. This work provides a non-asymptotic
convergence-rate analysis of this approach on quadratic objectives for gradient
descent and the Chebyshev method. We show that to ensure convergence of the
Jacobian, we can either 1) choose a large learning rate leading to a fast
asymptotic convergence but accept that the algorithm may have an arbitrarily
long burn-in phase or 2) choose a smaller learning rate leading to an immediate
but slower convergence. We refer to this phenomenon as the curse of unrolling.
Finally, we discuss open problems relative to this approach, such as deriving a
practical update rule for the optimal unrolling strategy and making novel
connections with the field of Sobolev orthogonal polynomials
- …