286 research outputs found
LIPIcs, Volume 251, ITCS 2023, Complete Volume
LIPIcs, Volume 251, ITCS 2023, Complete Volum
Learning and Control of Dynamical Systems
Despite the remarkable success of machine learning in various domains in recent years, our understanding of its fundamental limitations remains incomplete. This knowledge gap poses a grand challenge when deploying machine learning methods in critical decision-making tasks, where incorrect decisions can have catastrophic consequences. To effectively utilize these learning-based methods in such contexts, it is crucial to explicitly characterize their performance. Over the years, significant research efforts have been dedicated to learning and control of dynamical systems where the underlying dynamics are unknown or only partially known a priori, and must be inferred from collected data. However, much of these classical results have focused on asymptotic guarantees, providing limited insights into the amount of data required to achieve desired control performance while satisfying operational constraints such as safety and stability, especially in the presence of statistical noise.
In this thesis, we study the statistical complexity of learning and control of unknown dynamical systems. By utilizing recent advances in statistical learning theory, high-dimensional statistics, and control theoretic tools, we aim to establish a fundamental understanding of the number of samples required to achieve desired (i) accuracy in learning the unknown dynamics, (ii) performance in the control of the underlying system, and (iii) satisfaction of the operational constraints such as safety and stability. We provide finite-sample guarantees for these objectives and propose efficient learning and control algorithms that achieve the desired performance at these statistical limits in various dynamical systems. Our investigation covers a broad range of dynamical systems, starting from fully observable linear dynamical systems to partially observable linear dynamical systems, and ultimately, nonlinear systems.
We deploy our learning and control algorithms in various adaptive control tasks in real-world control systems and demonstrate their strong empirical performance along with their learning, robustness, and stability guarantees. In particular, we implement one of our proposed methods, Fourier Adaptive Learning and Control (FALCON), on an experimental aerodynamic testbed under extreme turbulent flow dynamics in a wind tunnel. The results show that FALCON achieves state-of-the-art stabilization performance and consistently outperforms conventional and other learning-based methods by at least 37%, despite using 8 times less data. The superior performance of FALCON arises from its physically and theoretically accurate modeling of the underlying nonlinear turbulent dynamics, which yields rigorous finite-sample learning and performance guarantees. These findings underscore the importance of characterizing the statistical complexity of learning and control of unknown dynamical systems.</p
Contextual Bandits and Imitation Learning via Preference-Based Active Queries
We consider the problem of contextual bandits and imitation learning, where
the learner lacks direct knowledge of the executed action's reward. Instead,
the learner can actively query an expert at each round to compare two actions
and receive noisy preference feedback. The learner's objective is two-fold: to
minimize the regret associated with the executed actions, while simultaneously,
minimizing the number of comparison queries made to the expert. In this paper,
we assume that the learner has access to a function class that can represent
the expert's preference model under appropriate link functions, and provide an
algorithm that leverages an online regression oracle with respect to this
function class for choosing its actions and deciding when to query. For the
contextual bandit setting, our algorithm achieves a regret bound that combines
the best of both worlds, scaling as , where
represents the number of interactions, represents the eluder dimension of
the function class, and represents the minimum preference of the
optimal action over any suboptimal action under all contexts. Our algorithm
does not require the knowledge of , and the obtained regret bound is
comparable to what can be achieved in the standard contextual bandits setting
where the learner observes reward signals at each round. Additionally, our
algorithm makes only queries to the expert. We
then extend our algorithm to the imitation learning setting, where the learning
agent engages with an unknown environment in episodes of length each, and
provide similar guarantees for regret and query complexity. Interestingly, our
algorithm for imitation learning can even learn to outperform the underlying
expert, when it is suboptimal, highlighting a practical benefit of
preference-based feedback in imitation learning
Active Coverage for PAC Reinforcement Learning
Collecting and leveraging data with good coverage properties plays a crucial
role in different aspects of reinforcement learning (RL), including reward-free
exploration and offline learning. However, the notion of "good coverage" really
depends on the application at hand, as data suitable for one context may not be
so for another. In this paper, we formalize the problem of active coverage in
episodic Markov decision processes (MDPs), where the goal is to interact with
the environment so as to fulfill given sampling requirements. This framework is
sufficiently flexible to specify any desired coverage property, making it
applicable to any problem that involves online exploration. Our main
contribution is an instance-dependent lower bound on the sample complexity of
active coverage and a simple game-theoretic algorithm, CovGame, that nearly
matches it. We then show that CovGame can be used as a building block to solve
different PAC RL tasks. In particular, we obtain a simple algorithm for PAC
reward-free exploration with an instance-dependent sample complexity that, in
certain MDPs which are "easy to explore", is lower than the minimax one. By
further coupling this exploration algorithm with a new technique to do implicit
eliminations in policy space, we obtain a computationally-efficient algorithm
for best-policy identification whose instance-dependent sample complexity
scales with gaps between policy values.Comment: Accepted at COLT 202
Learning in Repeated Multi-Unit Pay-As-Bid Auctions
Motivated by Carbon Emissions Trading Schemes, Treasury Auctions, and
Procurement Auctions, which all involve the auctioning of homogeneous multiple
units, we consider the problem of learning how to bid in repeated multi-unit
pay-as-bid auctions. In each of these auctions, a large number of (identical)
items are to be allocated to the largest submitted bids, where the price of
each of the winning bids is equal to the bid itself. The problem of learning
how to bid in pay-as-bid auctions is challenging due to the combinatorial
nature of the action space. We overcome this challenge by focusing on the
offline setting, where the bidder optimizes their vector of bids while only
having access to the past submitted bids by other bidders. We show that the
optimal solution to the offline problem can be obtained using a polynomial time
dynamic programming (DP) scheme. We leverage the structure of the DP scheme to
design online learning algorithms with polynomial time and space complexity
under full information and bandit feedback settings. We achieve an upper bound
on regret of and respectively, where is the number of units demanded by the
bidder, is the total number of auctions, and is the size of
the discretized bid space. We accompany these results with a regret lower
bound, which match the linear dependency in . Our numerical results suggest
that when all agents behave according to our proposed no regret learning
algorithms, the resulting market dynamics mainly converge to a welfare
maximizing equilibrium where bidders submit uniform bids. Lastly, our
experiments demonstrate that the pay-as-bid auction consistently generates
significantly higher revenue compared to its popular alternative, the uniform
price auction.Comment: 51 pages, 12 Figure
A Survey on Causal Reinforcement Learning
While Reinforcement Learning (RL) achieves tremendous success in sequential
decision-making problems of many domains, it still faces key challenges of data
inefficiency and the lack of interpretability. Interestingly, many researchers
have leveraged insights from the causality literature recently, bringing forth
flourishing works to unify the merits of causality and address well the
challenges from RL. As such, it is of great necessity and significance to
collate these Causal Reinforcement Learning (CRL) works, offer a review of CRL
methods, and investigate the potential functionality from causality toward RL.
In particular, we divide existing CRL approaches into two categories according
to whether their causality-based information is given in advance or not. We
further analyze each category in terms of the formalization of different
models, ranging from the Markov Decision Process (MDP), Partially Observed
Markov Decision Process (POMDP), Multi-Arm Bandits (MAB), and Dynamic Treatment
Regime (DTR). Moreover, we summarize the evaluation matrices and open sources
while we discuss emerging applications, along with promising prospects for the
future development of CRL.Comment: 29 pages, 20 figure
Online Continuous Hyperparameter Optimization for Contextual Bandits
In stochastic contextual bandits, an agent sequentially makes actions from a
time-dependent action set based on past experience to minimize the cumulative
regret. Like many other machine learning algorithms, the performance of bandits
heavily depends on their multiple hyperparameters, and theoretically derived
parameter values may lead to unsatisfactory results in practice. Moreover, it
is infeasible to use offline tuning methods like cross-validation to choose
hyperparameters under the bandit environment, as the decisions should be made
in real time. To address this challenge, we propose the first online continuous
hyperparameter tuning framework for contextual bandits to learn the optimal
parameter configuration within a search space on the fly. Specifically, we use
a double-layer bandit framework named CDT (Continuous Dynamic Tuning) and
formulate the hyperparameter optimization as a non-stationary continuum-armed
bandit, where each arm represents a combination of hyperparameters, and the
corresponding reward is the algorithmic result. For the top layer, we propose
the Zooming TS algorithm that utilizes Thompson Sampling (TS) for exploration
and a restart technique to get around the switching environment. The proposed
CDT framework can be easily used to tune contextual bandit algorithms without
any pre-specified candidate set for hyperparameters. We further show that it
could achieve sublinear regret in theory and performs consistently better on
both synthetic and real datasets in practice
Recommended from our members
Improved Asymptotics for Multi-armed Bandit Experiments under Optimism-based Policies: Theory and Applications
The classical multi-armed bandit paradigm is a foundational framework for online decision making underlying a wide variety of important applications, e.g., clinical trials, advertising, sequential assignments, assortment optimization, etc. This work will examine two salient aspects of decision making that arise naturally in settings with large action spaces.
The first issue pertains to the division of samples across arms at the level of a trajectory (or sample-path). Traditional bounds at the ensemble-level (or in expectation) only translate to meaningful pathwise guarantees (high probability bounds) when the separation between mean rewards is ``large,'' commonly referred to as the ``well-separated'' regime in the literature. On the other hand, applications with a large action space are intrinsically endowed with smaller separations between arm-means (e.g., multiple products of similar quality in e-retail). As a result, classical ensemble-level guarantees for such problems become vacuous at the sample-path level in several settings. This theoretical gap in the understanding of bandit algorithms in the ``small gap'' regime can be of significant consequence in applications where considerations such as fairness and post hoc inference play an important role. Our work provides the first systematic treatment and analysis of this aspect under the celebrated UCB class of optimism-based bandit algorithms, including a complete diffusion-limit characterization of its regret. The diffusion-scale lens also reveals profound insights and highlights distinctions between UCB and the popular posterior sampling-based method, Thompson Sampling, such as an ``incomplete learning'' phenomenon that is characteristic of the latter.
The second research question studied in this work concerns the complexity of decision making in problems where the action space is endowed with a large number of substitutable alternatives. For example, it is common in e-retail for multiple brands to offer similar products (in terms of quality-of-service) that compete for revenue within a given product segment. We model the platform's decision problem in this example as a bandit with countably many arms, and investigate limits of achievable performance under canonical bandit algorithms adapted to this setting. We also propose novel rate-optimal algorithms that leverage results for the ``small gap'' regime alluded to earlier, and show that these outperform aforementioned conventional adaptations. We extend the countable-armed bandit paradigm to also serve as a basal motif in sequential assignment and dynamic matching problems typical of settings such as online labor markets.
The last chapter of this thesis investigates achievable performance in the countable-armed bandit problem under non-stationarity that is attributable to vanishing arms. This characteristic abstracts away certain attrition and churn processes observable in online markets, e.g., a popular brand may retract its product from a platform owing to under-exposure within its category -- a potential negative externality of the exploration carried out by the platform's policy
Anytime Model Selection in Linear Bandits
Model selection in the context of bandit optimization is a challenging
problem, as it requires balancing exploration and exploitation not only for
action selection, but also for model selection. One natural approach is to rely
on online learning algorithms that treat different models as experts. Existing
methods, however, scale poorly () with the number of models
in terms of their regret. Our key insight is that, for model selection in
linear bandits, we can emulate full-information feedback to the online learner
with a favorable bias-variance trade-off. This allows us to develop ALEXP,
which has an exponentially improved () dependence on for its
regret. ALEXP has anytime guarantees on its regret, and neither requires
knowledge of the horizon , nor relies on an initial purely exploratory
stage. Our approach utilizes a novel time-uniform analysis of the Lasso,
establishing a new connection between online learning and high-dimensional
statistics.Comment: 37 pages, 7 figure
Linear Bandits with Memory: from Rotting to Rising
Nonstationary phenomena, such as satiation effects in recommendation, are a
common feature of sequential decision-making problems. While these phenomena
have been mostly studied in the framework of bandits with finitely many arms,
in many practically relevant cases linear bandits provide a more effective
modeling choice. In this work, we introduce a general framework for the study
of nonstationary linear bandits, where current rewards are influenced by the
learner's past actions in a fixed-size window. In particular, our model
includes stationary linear bandits as a special case. After showing that the
best sequence of actions is NP-hard to compute in our model, we focus on cyclic
policies and prove a regret bound for a variant of the OFUL algorithm that
balances approximation and estimation errors. Our theoretical findings are
supported by experiments (which also include misspecified settings) where our
algorithm is seen to perform well against natural baselines
- …