104 research outputs found
Bilevel Optimization without Lower-Level Strong Convexity from the Hyper-Objective Perspective
Bilevel optimization reveals the inner structure of otherwise oblique
optimization problems, such as hyperparameter tuning and meta-learning. A
common goal in bilevel optimization is to find stationary points of the
hyper-objective function. Although this hyper-objective approach is widely
used, its theoretical properties have not been thoroughly investigated in cases
where the lower-level functions lack strong convexity. In this work, we take a
step forward and study the hyper-objective approach without the typical
lower-level strong convexity assumption. Our hardness results show that the
hyper-objective of general convex lower-level functions can be intractable
either to evaluate or to optimize. To tackle this challenge, we introduce the
gradient dominant condition, which strictly relaxes the strong convexity
assumption by allowing the lower-level solution set to be non-singleton. Under
the gradient dominant condition, we propose the Inexact Gradient-Free Method
(IGFM), which uses the Switching Gradient Method (SGM) as the zeroth order
oracle, to find an approximate stationary point of the hyper-objective. We also
extend our results to nonsmooth lower-level functions under the weak sharp
minimum condition
On Implicit Bias in Overparameterized Bilevel Optimization
Many problems in machine learning involve bilevel optimization (BLO),
including hyperparameter optimization, meta-learning, and dataset distillation.
Bilevel problems consist of two nested sub-problems, called the outer and inner
problems, respectively. In practice, often at least one of these sub-problems
is overparameterized. In this case, there are many ways to choose among optima
that achieve equivalent objective values. Inspired by recent studies of the
implicit bias induced by optimization algorithms in single-level optimization,
we investigate the implicit bias of gradient-based algorithms for bilevel
optimization. We delineate two standard BLO methods -- cold-start and
warm-start -- and show that the converged solution or long-run behavior depends
to a large degree on these and other algorithmic choices, such as the
hypergradient approximation. We also show that the inner solutions obtained by
warm-start BLO can encode a surprising amount of information about the outer
objective, even when the outer parameters are low-dimensional. We believe that
implicit bias deserves as central a role in the study of bilevel optimization
as it has attained in the study of single-level neural net optimization.Comment: ICML 202
Blockwise Stochastic Variance-Reduced Methods with Parallel Speedup for Multi-Block Bilevel Optimization
In this paper, we consider non-convex multi-block bilevel optimization (MBBO)
problems, which involve lower level problems and have important
applications in machine learning. Designing a stochastic gradient and
controlling its variance is more intricate due to the hierarchical sampling of
blocks and data and the unique challenge of estimating hyper-gradient. We aim
to achieve three nice properties for our algorithm: (a) matching the
state-of-the-art complexity of standard BO problems with a single block; (b)
achieving parallel speedup by sampling blocks and sampling samples for
each sampled block per-iteration; (c) avoiding the computation of the inverse
of a high-dimensional Hessian matrix estimator. However, it is non-trivial to
achieve all of these by observing that existing works only achieve one or two
of these properties. To address the involved challenges for achieving (a, b,
c), we propose two stochastic algorithms by using advanced blockwise
variance-reduction techniques for tracking the Hessian matrices (for
low-dimensional problems) or the Hessian-vector products (for high-dimensional
problems), and prove an iteration complexity of
for finding an -stationary point
under appropriate conditions. We also conduct experiments to verify the
effectiveness of the proposed algorithms comparing with existing MBBO
algorithms
BiERL: A Meta Evolutionary Reinforcement Learning Framework via Bilevel Optimization
Evolutionary reinforcement learning (ERL) algorithms recently raise attention
in tackling complex reinforcement learning (RL) problems due to high
parallelism, while they are prone to insufficient exploration or model collapse
without carefully tuning hyperparameters (aka meta-parameters). In the paper,
we propose a general meta ERL framework via bilevel optimization (BiERL) to
jointly update hyperparameters in parallel to training the ERL model within a
single agent, which relieves the need for prior domain knowledge or costly
optimization procedure before model deployment. We design an elegant meta-level
architecture that embeds the inner-level's evolving experience into an
informative population representation and introduce a simple and feasible
evaluation of the meta-level fitness function to facilitate learning
efficiency. We perform extensive experiments in MuJoCo and Box2D tasks to
verify that as a general framework, BiERL outperforms various baselines and
consistently improves the learning performance for a diversity of ERL
algorithms.Comment: Published as a conference paper at European Conference on Artificial
Intelligence (ECAI) 202
Data Distillation: A Survey
The popularity of deep learning has led to the curation of a vast number of
massive and multifarious datasets. Despite having close-to-human performance on
individual tasks, training parameter-hungry models on large datasets poses
multi-faceted problems such as (a) high model-training time; (b) slow research
iteration; and (c) poor eco-sustainability. As an alternative, data
distillation approaches aim to synthesize terse data summaries, which can serve
as effective drop-in replacements of the original dataset for scenarios like
model training, inference, architecture search, etc. In this survey, we present
a formal framework for data distillation, along with providing a detailed
taxonomy of existing approaches. Additionally, we cover data distillation
approaches for different data modalities, namely images, graphs, and user-item
interactions (recommender systems), while also identifying current challenges
and future research directions.Comment: Accepted at TMLR '23. 21 pages, 4 figure
Principled and Efficient Bilevel Optimization for Machine Learning
Automatic differentiation (AD) is a core element of most modern machine learning
libraries that allows to efficiently compute derivatives of a function from the corresponding program. Thanks to AD, machine learning practitioners have tackled
increasingly complex learning models, such as deep neural networks with up to hundreds of billions of parameters, which are learned using the derivative (or gradient)
of a loss function with respect to those parameters. While in most cases gradients
can be computed exactly and relatively cheaply, in others the exact computation
is either impossible or too expensive and AD must be used in combination with
approximation methods. Some of these challenging scenarios arising for example in
meta-learning or hyperparameter optimization, can be framed as bilevel optimization
problems, where the goal is to minimize an objective function that is evaluated by
first solving another optimization problem, the lower-level problem. In this work, we
study efficient gradient-based bilevel optimization algorithms for machine learning
problems. In particular, we establish convergence rates for some simple approaches
to approximate the gradient of the bilevel objective, namely the hypergradient, when
the objective is smooth and the lower-level problem consists in finding the fixed
point of a contraction map. Leveraging such results, we also prove that the projected
inexact hypergradient method achieves a (near) optimal rate of convergence. We
establish these results for both the deterministic and stochastic settings. Additionally, we provide an efficient implementation of the methods studied and perform
several numerical experiments on hyperparameter optimization, meta-learning, datapoisoning and equilibrium models, which show that our theoretical results are good
indicators of the performance in practice
BOME! Bilevel Optimization Made Easy: A Simple First-Order Approach
Bilevel optimization (BO) is useful for solving a variety of important
machine learning problems including but not limited to hyperparameter
optimization, meta-learning, continual learning, and reinforcement learning.
Conventional BO methods need to differentiate through the low-level
optimization process with implicit differentiation, which requires expensive
calculations related to the Hessian matrix. There has been a recent quest for
first-order methods for BO, but the methods proposed to date tend to be
complicated and impractical for large-scale deep learning applications. In this
work, we propose a simple first-order BO algorithm that depends only on
first-order gradient information, requires no implicit differentiation, and is
practical and efficient for large-scale non-convex functions in deep learning.
We provide non-asymptotic convergence analysis of the proposed method to
stationary points for non-convex objectives and present empirical results that
show its superior practical performance
Contextual Stochastic Bilevel Optimization
We introduce contextual stochastic bilevel optimization (CSBO) -- a
stochastic bilevel optimization framework with the lower-level problem
minimizing an expectation conditioned on some contextual information and the
upper-level decision variable. This framework extends classical stochastic
bilevel optimization when the lower-level decision maker responds optimally not
only to the decision of the upper-level decision maker but also to some side
information and when there are multiple or even infinite many followers. It
captures important applications such as meta-learning, personalized federated
learning, end-to-end learning, and Wasserstein distributionally robust
optimization with side information (WDRO-SI). Due to the presence of contextual
information, existing single-loop methods for classical stochastic bilevel
optimization are unable to converge. To overcome this challenge, we introduce
an efficient double-loop gradient method based on the Multilevel Monte-Carlo
(MLMC) technique and establish its sample and computational complexities. When
specialized to stochastic nonconvex optimization, our method matches existing
lower bounds. For meta-learning, the complexity of our method does not depend
on the number of tasks. Numerical experiments further validate our theoretical
results.Comment: The paper is accepted by NeurIPS 202
A Lower Bound and a Near-Optimal Algorithm for Bilevel Empirical Risk Minimization
Bilevel optimization problems, which are problems where two optimization
problems are nested, have more and more applications in machine learning. In
many practical cases, the upper and the lower objectives correspond to
empirical risk minimization problems and therefore have a sum structure. In
this context, we propose a bilevel extension of the celebrated SARAH algorithm.
We demonstrate that the algorithm requires
gradient computations to achieve
-stationarity with the total number of samples, which
improves over all previous bilevel algorithms. Moreover, we provide a lower
bound on the number of oracle calls required to get an approximate stationary
point of the objective function of the bilevel problem. This lower bound is
attained by our algorithm, which is therefore optimal in terms of sample
complexity
- …