423 research outputs found
Logarithmic Regret Bound in Partially Observable Linear Dynamical Systems
We study the problem of adaptive control in partially observable linear dynamical systems. We propose a novel algorithm, adaptive control online learning algorithm (AdaptOn), which efficiently explores the environment, estimates the system dynamics episodically and exploits these estimates to design effective controllers to minimize the cumulative costs. Through interaction with the environment, AdaptOn deploys online convex optimization to optimize the controller while simultaneously learning the system dynamics to improve the accuracy of controller updates. We show that when the cost functions are strongly convex, after T times step of agent-environment interaction, AdaptOn achieves regret upper bound of polylog(T). To the best of our knowledge, AdaptOn is the first algorithm which achieves polylog(T) regret in adaptive control of unknown partially observable linear dynamical systems which includes linear quadratic Gaussian (LQG) control
Logarithmic Regret Bound in Partially Observable Linear Dynamical Systems
We study the problem of adaptive control in partially observable linear dynamical systems. We propose a novel algorithm, adaptive control online learning algorithm (AdaptOn), which efficiently explores the environment, estimates the system dynamics episodically and exploits these estimates to design effective controllers to minimize the cumulative costs. Through interaction with the environment, AdaptOn deploys online convex optimization to optimize the controller while simultaneously learning the system dynamics to improve the accuracy of controller updates. We show that when the cost functions are strongly convex, after T times step of agent-environment interaction, AdaptOn achieves regret upper bound of polylog(T). To the best of our knowledge, AdaptOn is the first algorithm which achieves polylog(T) regret in adaptive control of unknown partially observable linear dynamical systems which includes linear quadratic Gaussian (LQG) control
Regret Minimization in Partially Observable Linear Quadratic Control
We study the problem of regret minimization in partially observable linear quadratic control systems when the model dynamics are unknown a priori. We propose ExpCommit, an explore-then-commit algorithm that learns the model Markov parameters and then follows the principle of optimism in the face of uncertainty to design a controller. We propose a novel way to decompose the regret and provide an end-to-end sublinear regret upper bound for partially observable linear quadratic control. Finally, we provide stability guarantees and establish a regret upper bound of O(T^(2/3)) for ExpCommit, where T is the time horizon of the problem
Optimal Rates for Bandit Nonstochastic Control
Linear Quadratic Regulator (LQR) and Linear Quadratic Gaussian (LQG) control
are foundational and extensively researched problems in optimal control. We
investigate LQR and LQG problems with semi-adversarial perturbations and
time-varying adversarial bandit loss functions. The best-known sublinear regret
algorithm of~\cite{gradu2020non} has a time horizon
dependence, and its authors posed an open question about whether a tight rate
of could be achieved. We answer in the affirmative, giving an
algorithm for bandit LQR and LQG which attains optimal regret (up to
logarithmic factors) for both known and unknown systems. A central component of
our method is a new scheme for bandit convex optimization with memory, which is
of independent interest
Dynamical Linear Bandits
In many real-world sequential decision-making problems, an action does not immediately reflect on the feedback and spreads its effects over a long time frame. For instance, in online advertising, investing in a platform produces an instantaneous increase of awareness, but the actual reward, i.e., a conversion, might occur far in the future. Furthermore, whether a conversion takes place depends on: how fast the awareness grows, its vanishing effects, and the synergy or interference with other advertising platforms. Previous work has investigated the Multi-Armed Bandit framework with the possibility of delayed and aggregated feedback, without a particular structure on how an action propagates in the future, disregarding possible dynamical effects. In this paper, we introduce a novel setting, the Dynamical Linear Bandits (DLB), an extension of the linear bandits characterized by a hidden state. When an action is performed, the learner observes a noisy reward whose mean is a linear function of the hidden state and of the action. Then, the hidden state evolves according to linear dynamics, affected by the performed action too. We start by introducing the setting, discussing the notion of optimal policy, and deriving an expected regret lower bound. Then, we provide an optimistic regret minimization algorithm, Dynamical Linear Upper Confidence Bound (DynLin-UCB), that suffers an expected regret of order , where is a measure of the stability of the system, and is the dimension of the action vector. Finally, we conduct a numerical validation on a synthetic environment and on real-world data to show the effectiveness of DynLin-UCB in comparison with several baselines
Dynamical Linear Bandits
In many real-world sequential decision-making problems, an action does not
immediately reflect on the feedback and spreads its effects over a long time
frame. For instance, in online advertising, investing in a platform produces an
instantaneous increase of awareness, but the actual reward, i.e., a conversion,
might occur far in the future. Furthermore, whether a conversion takes place
depends on: how fast the awareness grows, its vanishing effects, and the
synergy or interference with other advertising platforms. Previous work has
investigated the Multi-Armed Bandit framework with the possibility of delayed
and aggregated feedback, without a particular structure on how an action
propagates in the future, disregarding possible dynamical effects. In this
paper, we introduce a novel setting, the Dynamical Linear Bandits (DLB), an
extension of the linear bandits characterized by a hidden state. When an action
is performed, the learner observes a noisy reward whose mean is a linear
function of the hidden state and of the action. Then, the hidden state evolves
according to linear dynamics, affected by the performed action too. We start by
introducing the setting, discussing the notion of optimal policy, and deriving
an expected regret lower bound. Then, we provide an optimistic regret
minimization algorithm, Dynamical Linear Upper Confidence Bound (DynLin-UCB),
that suffers an expected regret of order , where is a
measure of the stability of the system, and is the dimension of the action
vector. Finally, we conduct a numerical validation on a synthetic environment
and on real-world data to show the effectiveness of DynLin-UCB in comparison
with several baselines
Introduction to Online Nonstochastic Control
This text presents an introduction to an emerging paradigm in control of
dynamical systems and differentiable reinforcement learning called online
nonstochastic control. The new approach applies techniques from online convex
optimization and convex relaxations to obtain new methods with provable
guarantees for classical settings in optimal and robust control.
The primary distinction between online nonstochastic control and other
frameworks is the objective. In optimal control, robust control, and other
control methodologies that assume stochastic noise, the goal is to perform
comparably to an offline optimal strategy. In online nonstochastic control,
both the cost functions as well as the perturbations from the assumed dynamical
model are chosen by an adversary. Thus the optimal policy is not defined a
priori. Rather, the target is to attain low regret against the best policy in
hindsight from a benchmark class of policies.
This objective suggests the use of the decision making framework of online
convex optimization as an algorithmic methodology. The resulting methods are
based on iterative mathematical optimization algorithms, and are accompanied by
finite-time regret and computational complexity guarantees.Comment: Draft; comments/suggestions welcome at
[email protected]
- β¦