38,865 research outputs found
Learning-based Control of Unknown Linear Systems with Thompson Sampling
We propose a Thompson sampling-based learning algorithm for the Linear
Quadratic (LQ) control problem with unknown system parameters. The algorithm is
called Thompson sampling with dynamic episodes (TSDE) where two stopping
criteria determine the lengths of the dynamic episodes in Thompson sampling.
The first stopping criterion controls the growth rate of episode length. The
second stopping criterion is triggered when the determinant of the sample
covariance matrix is less than half of the previous value. We show under some
conditions on the prior distribution that the expected (Bayesian) regret of
TSDE accumulated up to time T is bounded by O(\sqrt{T}). Here O(.) hides
constants and logarithmic factors. This is the first O(\sqrt{T} ) bound on
expected regret of learning in LQ control. By introducing a reinitialization
schedule, we also show that the algorithm is robust to time-varying drift in
model parameters. Numerical simulations are provided to illustrate the
performance of TSDE
Posterior Sampling for Large Scale Reinforcement Learning
We propose a practical non-episodic PSRL algorithm that unlike recent
state-of-the-art PSRL algorithms uses a deterministic, model-independent
episode switching schedule. Our algorithm termed deterministic schedule PSRL
(DS-PSRL) is efficient in terms of time, sample, and space complexity. We prove
a Bayesian regret bound under mild assumptions. Our result is more generally
applicable to multiple parameters and continuous state action problems. We
compare our algorithm with state-of-the-art PSRL algorithms on standard
discrete and continuous problems from the literature. Finally, we show how the
assumptions of our algorithm satisfy a sensible parametrization for a large
class of problems in sequential recommendations
Learning to Optimize via Information-Directed Sampling
We propose information-directed sampling -- a new approach to online
optimization problems in which a decision-maker must balance between
exploration and exploitation while learning from partial feedback. Each action
is sampled in a manner that minimizes the ratio between squared expected
single-period regret and a measure of information gain: the mutual information
between the optimal action and the next observation. We establish an expected
regret bound for information-directed sampling that applies across a very
general class of models and scales with the entropy of the optimal action
distribution. We illustrate through simple analytic examples how
information-directed sampling accounts for kinds of information that
alternative approaches do not adequately address and that this can lead to
dramatic performance gains. For the widely studied Bernoulli, Gaussian, and
linear bandit problems, we demonstrate state-of-the-art simulation performance.Comment: arXiv admin note: substantial text overlap with arXiv:1403.534
A Practical Method for Solving Contextual Bandit Problems Using Decision Trees
Many efficient algorithms with strong theoretical guarantees have been
proposed for the contextual multi-armed bandit problem. However, applying these
algorithms in practice can be difficult because they require domain expertise
to build appropriate features and to tune their parameters. We propose a new
method for the contextual bandit problem that is simple, practical, and can be
applied with little or no domain expertise. Our algorithm relies on decision
trees to model the context-reward relationship. Decision trees are
non-parametric, interpretable, and work well without hand-crafted features. To
guide the exploration-exploitation trade-off, we use a bootstrapping approach
which abstracts Thompson sampling to non-Bayesian settings. We also discuss
several computational heuristics and demonstrate the performance of our method
on several datasets.Comment: Proceedings of the 33rd Conference on Uncertainty in Artificial
Intelligence (UAI 2017
Generalized Thompson Sampling for Contextual Bandits
Thompson Sampling, one of the oldest heuristics for solving multi-armed
bandits, has recently been shown to demonstrate state-of-the-art performance.
The empirical success has led to great interests in theoretical understanding
of this heuristic. In this paper, we approach this problem in a way very
different from existing efforts. In particular, motivated by the connection
between Thompson Sampling and exponentiated updates, we propose a new family of
algorithms called Generalized Thompson Sampling in the expert-learning
framework, which includes Thompson Sampling as a special case. Similar to most
expert-learning algorithms, Generalized Thompson Sampling uses a loss function
to adjust the experts' weights. General regret bounds are derived, which are
also instantiated to two important loss functions: square loss and logarithmic
loss. In contrast to existing bounds, our results apply to quite general
contextual bandits. More importantly, they quantify the effect of the "prior"
distribution on the regret bounds
Scalable Coordinated Exploration in Concurrent Reinforcement Learning
We consider a team of reinforcement learning agents that concurrently operate
in a common environment, and we develop an approach to efficient coordinated
exploration that is suitable for problems of practical scale. Our approach
builds on seed sampling (Dimakopoulou and Van Roy, 2018) and randomized value
function learning (Osband et al., 2016). We demonstrate that, for simple
tabular contexts, the approach is competitive with previously proposed tabular
model learning methods (Dimakopoulou and Van Roy, 2018). With a
higher-dimensional problem and a neural network value function representation,
the approach learns quickly with far fewer agents than alternative exploration
schemes.Comment: NIPS 201
Bayesian model predictive control: Efficient model exploration and regret bounds using posterior sampling
Tight performance specifications in combination with operational constraints
make model predictive control (MPC) the method of choice in various industries.
As the performance of an MPC controller depends on a sufficiently accurate
objective and prediction model of the process, a significant effort in the MPC
design procedure is dedicated to modeling and identification. Driven by the
increasing amount of available system data and advances in the field of machine
learning, data-driven MPC techniques have been developed to facilitate the MPC
controller design. While these methods are able to leverage available data,
they typically do not provide principled mechanisms to automatically trade off
exploitation of available data and exploration to improve and update the
objective and prediction model. To this end, we present a learning-based MPC
formulation using posterior sampling techniques, which provides finite-time
regret bounds on the learning performance while being simple to implement using
off-the-shelf MPC software and algorithms. The performance analysis of the
method is based on posterior sampling theory and its practical efficiency is
illustrated using a numerical example of a highly nonlinear dynamical
car-trailer system
Estimation Considerations in Contextual Bandits
Contextual bandit algorithms are sensitive to the estimation method of the
outcome model as well as the exploration method used, particularly in the
presence of rich heterogeneity or complex outcome models, which can lead to
difficult estimation problems along the path of learning. We study a
consideration for the exploration vs. exploitation framework that does not
arise in multi-armed bandits but is crucial in contextual bandits; the way
exploration and exploitation is conducted in the present affects the bias and
variance in the potential outcome model estimation in subsequent stages of
learning. We develop parametric and non-parametric contextual bandits that
integrate balancing methods from the causal inference literature in their
estimation to make it less prone to problems of estimation bias. We provide the
first regret bound analyses for contextual bandits with balancing in the domain
of linear contextual bandits that match the state of the art regret bounds. We
demonstrate the strong practical advantage of balanced contextual bandits on a
large number of supervised learning datasets and on a synthetic example that
simulates model mis-specification and prejudice in the initial training data.
Additionally, we develop contextual bandits with simpler assignment policies by
leveraging sparse model estimation methods from the econometrics literature and
demonstrate empirically that in the early stages they can improve the rate of
learning and decrease regret
Note on Thompson sampling for large decision problems
There is increasing interest in using streaming data to inform decision
making across a wide range of application domains including mobile health, food
safety, security, and resource management. A decision support system formalizes
online decision making as a map from up-to-date information to a recommended
decision. Online estimation of an optimal decision strategy from streaming data
requires simultaneous estimation of components of the underlying system
dynamics as well as the optimal decision strategy given these dynamics; thus,
there is an inherent trade-off between choosing decisions that lead to improved
estimates and choosing decisions that appear to be optimal based on current
estimates. Thompson (1933) was among the first to formalize this trade-off in
the context of choosing between two treatments for a stream of patients; he
proposed a simple heuristic wherein a treatment is selected randomly at each
time point with selection probability proportional to the posterior probability
that it is optimal. We consider a variant of Thompson sampling that is simple
to implement and can be applied to large and complex decision problems. We show
that the proposed Thompson sampling estimator is consistent for the optimal
decision support system and provide rates of convergence and finite sample
error bounds. The proposed algorithm is illustrated using an agent-based model
of the spread of influenza on a network and management of mallard populations
in the United States
A Tour of Reinforcement Learning: The View from Continuous Control
This manuscript surveys reinforcement learning from the perspective of
optimization and control with a focus on continuous control applications. It
surveys the general formulation, terminology, and typical experimental
implementations of reinforcement learning and reviews competing solution
paradigms. In order to compare the relative merits of various techniques, this
survey presents a case study of the Linear Quadratic Regulator (LQR) with
unknown dynamics, perhaps the simplest and best-studied problem in optimal
control. The manuscript describes how merging techniques from learning theory
and control can provide non-asymptotic characterizations of LQR performance and
shows that these characterizations tend to match experimental behavior. In
turn, when revisiting more complex applications, many of the observed phenomena
in LQR persist. In particular, theory and experiment demonstrate the role and
importance of models and the cost of generality in reinforcement learning
algorithms. This survey concludes with a discussion of some of the challenges
in designing learning systems that safely and reliably interact with complex
and uncertain environments and how tools from reinforcement learning and
control might be combined to approach these challenges.Comment: minor revision with a few clarifying passages and corrected typo
- …