150 research outputs found
LIPIcs, Volume 251, ITCS 2023, Complete Volume
LIPIcs, Volume 251, ITCS 2023, Complete Volum
Learning and Control of Dynamical Systems
Despite the remarkable success of machine learning in various domains in recent years, our understanding of its fundamental limitations remains incomplete. This knowledge gap poses a grand challenge when deploying machine learning methods in critical decision-making tasks, where incorrect decisions can have catastrophic consequences. To effectively utilize these learning-based methods in such contexts, it is crucial to explicitly characterize their performance. Over the years, significant research efforts have been dedicated to learning and control of dynamical systems where the underlying dynamics are unknown or only partially known a priori, and must be inferred from collected data. However, much of these classical results have focused on asymptotic guarantees, providing limited insights into the amount of data required to achieve desired control performance while satisfying operational constraints such as safety and stability, especially in the presence of statistical noise.
In this thesis, we study the statistical complexity of learning and control of unknown dynamical systems. By utilizing recent advances in statistical learning theory, high-dimensional statistics, and control theoretic tools, we aim to establish a fundamental understanding of the number of samples required to achieve desired (i) accuracy in learning the unknown dynamics, (ii) performance in the control of the underlying system, and (iii) satisfaction of the operational constraints such as safety and stability. We provide finite-sample guarantees for these objectives and propose efficient learning and control algorithms that achieve the desired performance at these statistical limits in various dynamical systems. Our investigation covers a broad range of dynamical systems, starting from fully observable linear dynamical systems to partially observable linear dynamical systems, and ultimately, nonlinear systems.
We deploy our learning and control algorithms in various adaptive control tasks in real-world control systems and demonstrate their strong empirical performance along with their learning, robustness, and stability guarantees. In particular, we implement one of our proposed methods, Fourier Adaptive Learning and Control (FALCON), on an experimental aerodynamic testbed under extreme turbulent flow dynamics in a wind tunnel. The results show that FALCON achieves state-of-the-art stabilization performance and consistently outperforms conventional and other learning-based methods by at least 37%, despite using 8 times less data. The superior performance of FALCON arises from its physically and theoretically accurate modeling of the underlying nonlinear turbulent dynamics, which yields rigorous finite-sample learning and performance guarantees. These findings underscore the importance of characterizing the statistical complexity of learning and control of unknown dynamical systems.</p
Exploiting Problem Geometry in Safe Linear Bandits
The safe linear bandit problem is a version of the classic linear bandit
problem where the learner's actions must satisfy an uncertain linear constraint
at all rounds. Due its applicability to many real-world settings, this problem
has received considerable attention in recent years. We find that by exploiting
the geometry of the specific problem setting, we can achieve improved regret
guarantees for both well-separated problem instances and action sets that are
finite star convex sets. Additionally, we propose a novel algorithm for this
setting that chooses problem parameters adaptively and enjoys at least as good
regret guarantees as existing algorithms. Lastly, we introduce a generalization
of the safe linear bandit setting where the constraints are convex and adapt
our algorithms and analyses to this setting by leveraging a novel
convex-analysis based approach. Simulation results show improved performance
over existing algorithms for a variety of randomly sampled settings.Comment: 38 pages, 4 figure
Regret Lower Bounds in Multi-agent Multi-armed Bandit
Multi-armed Bandit motivates methods with provable upper bounds on regret and
also the counterpart lower bounds have been extensively studied in this
context. Recently, Multi-agent Multi-armed Bandit has gained significant
traction in various domains, where individual clients face bandit problems in a
distributed manner and the objective is the overall system performance,
typically measured by regret. While efficient algorithms with regret upper
bounds have emerged, limited attention has been given to the corresponding
regret lower bounds, except for a recent lower bound for adversarial settings,
which, however, has a gap with let known upper bounds. To this end, we herein
provide the first comprehensive study on regret lower bounds across different
settings and establish their tightness. Specifically, when the graphs exhibit
good connectivity properties and the rewards are stochastically distributed, we
demonstrate a lower bound of order for instance-dependent bounds
and for mean-gap independent bounds which are tight. Assuming
adversarial rewards, we establish a lower bound for
connected graphs, thereby bridging the gap between the lower and upper bound in
the prior work. We also show a linear regret lower bound when the graph is
disconnected. While previous works have explored these settings with upper
bounds, we provide a thorough study on tight lower bounds.Comment: 10 page
Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits
Motivated by concerns about making online decisions that incur undue amount
of risk at each time step, in this paper, we formulate the probably
anytime-safe stochastic combinatorial semi-bandits problem. In this problem,
the agent is given the option to select a subset of size at most from a set
of ground items. Each item is associated to a certain mean reward as well
as a variance that represents its risk. To mitigate the risk that the agent
incurs, we require that with probability at least , over the entire
horizon of time , each of the choices that the agent makes should contain
items whose sum of variances does not exceed a certain variance budget. We call
this probably anytime-safe constraint. Under this constraint, we design and
analyze an algorithm {\sc PASCombUCB} that minimizes the regret over the
horizon of time . By developing accompanying information-theoretic lower
bounds, we show that under both the problem-dependent and problem-independent
paradigms, {\sc PASCombUCB} is almost asymptotically optimal. Experiments are
conducted to corroborate our theoretical findings. Our problem setup, the
proposed {\sc PASCombUCB} algorithm, and novel analyses are applicable to
domains such as recommendation systems and transportation in which an agent is
allowed to choose multiple items at a single time step and wishes to control
the risk over the whole time horizon.Comment: To be presented at ICML 2023. 57 pages, 6 figure
Online Joint Assortment-Inventory Optimization under MNL Choices
We study an online joint assortment-inventory optimization problem, in which
we assume that the choice behavior of each customer follows the Multinomial
Logit (MNL) choice model, and the attraction parameters are unknown a priori.
The retailer makes periodic assortment and inventory decisions to dynamically
learn from the realized demands about the attraction parameters while
maximizing the expected total profit over time. In this paper, we propose a
novel algorithm that can effectively balance the exploration and exploitation
in the online decision-making of assortment and inventory. Our algorithm builds
on a new estimator for the MNL attraction parameters, a novel approach to
incentivize exploration by adaptively tuning certain known and unknown
parameters, and an optimization oracle to static single-cycle
assortment-inventory planning problems with given parameters. We establish a
regret upper bound for our algorithm and a lower bound for the online joint
assortment-inventory optimization problem, suggesting that our algorithm
achieves nearly optimal regret rate, provided that the static optimization
oracle is exact. Then we incorporate more practical approximate static
optimization oracles into our algorithm, and bound from above the impact of
static optimization errors on the regret of our algorithm. At last, we perform
numerical studies to demonstrate the effectiveness of our proposed algorithm
Doubly-Optimistic Play for Safe Linear Bandits
The safe linear bandit problem (SLB) is an online approach to linear
programming with unknown objective and unknown round-wise constraints, under
stochastic bandit feedback of rewards and safety risks of actions. We study
aggressive \emph{doubly-optimistic play} in SLBs, and their role in avoiding
the strong assumptions and poor efficacy associated with extant
pessimistic-optimistic solutions.
We first elucidate an inherent hardness in SLBs due the lack of knowledge of
constraints: there exist `easy' instances, for which suboptimal extreme points
have large `gaps', but on which SLB methods must still incur
regret and safety violations due to an inability to refine the location of
optimal actions to arbitrary precision. In a positive direction, we propose and
analyse a doubly-optimistic confidence-bound based strategy for the safe linear
bandit problem, DOSLB, which exploits supreme optimism by using optimistic
estimates of both reward and safety risks to select actions. Using a novel dual
analysis, we show that despite the lack of knowledge of constraints, DOSLB
rarely takes overly risky actions, and obtains tight instance-dependent
bounds on both efficacy regret and net safety violations up to
any finite precision, thus yielding large efficacy gains at a small safety cost
and without strong assumptions. Concretely, we argue that algorithm activates
noisy versions of an `optimal' set of constraints at each round, and activation
of suboptimal sets of constraints is limited by the larger of a safety and
efficacy gap we define.Comment: v2: extensive rewrite, with a much cleaner exposition of the theory,
and improvements in key definition
Contextual Bandits with Budgeted Information Reveal
Contextual bandit algorithms are commonly used in digital health to recommend
personalized treatments. However, to ensure the effectiveness of the
treatments, patients are often requested to take actions that have no immediate
benefit to them, which we refer to as pro-treatment actions. In practice,
clinicians have a limited budget to encourage patients to take these actions
and collect additional information. We introduce a novel optimization and
learning algorithm to address this problem. This algorithm effectively combines
the strengths of two algorithmic approaches in a seamless manner, including 1)
an online primal-dual algorithm for deciding the optimal timing to reach out to
patients, and 2) a contextual bandit learning algorithm to deliver personalized
treatment to the patient. We prove that this algorithm admits a sub-linear
regret bound. We illustrate the usefulness of this algorithm on both synthetic
and real-world data
Dynamical Linear Bandits
In many real-world sequential decision-making problems, an action does not immediately reflect on the feedback and spreads its effects over a long time frame. For instance, in online advertising, investing in a platform produces an instantaneous increase of awareness, but the actual reward, i.e., a conversion, might occur far in the future. Furthermore, whether a conversion takes place depends on: how fast the awareness grows, its vanishing effects, and the synergy or interference with other advertising platforms. Previous work has investigated the Multi-Armed Bandit framework with the possibility of delayed and aggregated feedback, without a particular structure on how an action propagates in the future, disregarding possible dynamical effects. In this paper, we introduce a novel setting, the Dynamical Linear Bandits (DLB), an extension of the linear bandits characterized by a hidden state. When an action is performed, the learner observes a noisy reward whose mean is a linear function of the hidden state and of the action. Then, the hidden state evolves according to linear dynamics, affected by the performed action too. We start by introducing the setting, discussing the notion of optimal policy, and deriving an expected regret lower bound. Then, we provide an optimistic regret minimization algorithm, Dynamical Linear Upper Confidence Bound (DynLin-UCB), that suffers an expected regret of order , where is a measure of the stability of the system, and is the dimension of the action vector. Finally, we conduct a numerical validation on a synthetic environment and on real-world data to show the effectiveness of DynLin-UCB in comparison with several baselines
Sample-Efficient Multi-Agent RL: An Optimization Perspective
We study multi-agent reinforcement learning (MARL) for the general-sum Markov
Games (MGs) under the general function approximation. In order to find the
minimum assumption for sample-efficient learning, we introduce a novel
complexity measure called the Multi-Agent Decoupling Coefficient (MADC) for
general-sum MGs. Using this measure, we propose the first unified algorithmic
framework that ensures sample efficiency in learning Nash Equilibrium, Coarse
Correlated Equilibrium, and Correlated Equilibrium for both model-based and
model-free MARL problems with low MADC. We also show that our algorithm
provides comparable sublinear regret to the existing works. Moreover, our
algorithm combines an equilibrium-solving oracle with a single objective
optimization subprocedure that solves for the regularized payoff of each
deterministic joint policy, which avoids solving constrained optimization
problems within data-dependent constraints (Jin et al. 2020; Wang et al. 2023)
or executing sampling procedures with complex multi-objective optimization
problems (Foster et al. 2023), thus being more amenable to empirical
implementation
- …