19 research outputs found
Bayesian Optimal Control of Smoothly Parameterized Systems: The Lazy Posterior Sampling Algorithm
We study Bayesian optimal control of a general class of smoothly
parameterized Markov decision problems. Since computing the optimal control is
computationally expensive, we design an algorithm that trades off performance
for computational efficiency. The algorithm is a lazy posterior sampling method
that maintains a distribution over the unknown parameter. The algorithm changes
its policy only when the variance of the distribution is reduced sufficiently.
Importantly, we analyze the algorithm and show the precise nature of the
performance vs. computation tradeoff. Finally, we show the effectiveness of the
method on a web server control application
Learning Unknown Markov Decision Processes: A Thompson Sampling Approach
We consider the problem of learning an unknown Markov Decision Process (MDP)
that is weakly communicating in the infinite horizon setting. We propose a
Thompson Sampling-based reinforcement learning algorithm with dynamic episodes
(TSDE). At the beginning of each episode, the algorithm generates a sample from
the posterior distribution over the unknown model parameters. It then follows
the optimal stationary policy for the sampled model for the rest of the
episode. The duration of each episode is dynamically determined by two stopping
criteria. The first stopping criterion controls the growth rate of episode
length. The second stopping criterion happens when the number of visits to any
state-action pair is doubled. We establish bounds on
expected regret under a Bayesian setting, where and are the sizes of
the state and action spaces, is time, and is the bound of the span.
This regret bound matches the best available bound for weakly communicating
MDPs. Numerical results show it to perform better than existing algorithms for
infinite horizon MDPs.Comment: Accepted to NIPS 201
Posterior Sampling for Large Scale Reinforcement Learning
We propose a practical non-episodic PSRL algorithm that unlike recent
state-of-the-art PSRL algorithms uses a deterministic, model-independent
episode switching schedule. Our algorithm termed deterministic schedule PSRL
(DS-PSRL) is efficient in terms of time, sample, and space complexity. We prove
a Bayesian regret bound under mild assumptions. Our result is more generally
applicable to multiple parameters and continuous state action problems. We
compare our algorithm with state-of-the-art PSRL algorithms on standard
discrete and continuous problems from the literature. Finally, we show how the
assumptions of our algorithm satisfy a sensible parametrization for a large
class of problems in sequential recommendations
Learning-based Control of Unknown Linear Systems with Thompson Sampling
We propose a Thompson sampling-based learning algorithm for the Linear
Quadratic (LQ) control problem with unknown system parameters. The algorithm is
called Thompson sampling with dynamic episodes (TSDE) where two stopping
criteria determine the lengths of the dynamic episodes in Thompson sampling.
The first stopping criterion controls the growth rate of episode length. The
second stopping criterion is triggered when the determinant of the sample
covariance matrix is less than half of the previous value. We show under some
conditions on the prior distribution that the expected (Bayesian) regret of
TSDE accumulated up to time T is bounded by O(\sqrt{T}). Here O(.) hides
constants and logarithmic factors. This is the first O(\sqrt{T} ) bound on
expected regret of learning in LQ control. By introducing a reinitialization
schedule, we also show that the algorithm is robust to time-varying drift in
model parameters. Numerical simulations are provided to illustrate the
performance of TSDE
Context-Dependent Upper-Confidence Bounds for Directed Exploration
Directed exploration strategies for reinforcement learning are critical for
learning an optimal policy in a minimal number of interactions with the
environment. Many algorithms use optimism to direct exploration, either through
visitation estimates or upper confidence bounds, as opposed to data-inefficient
strategies like \epsilon-greedy that use random, undirected exploration. Most
data-efficient exploration methods require significant computation, typically
relying on a learned model to guide exploration. Least-squares methods have the
potential to provide some of the data-efficiency benefits of model-based
approaches -- because they summarize past interactions -- with the computation
closer to that of model-free approaches. In this work, we provide a novel,
computationally efficient, incremental exploration strategy, leveraging this
property of least-squares temporal difference learning (LSTD). We derive upper
confidence bounds on the action-values learned by LSTD, with context-dependent
(or state-dependent) noise variance. Such context-dependent noise focuses
exploration on a subset of variable states, and allows for reduced exploration
in other states. We empirically demonstrate that our algorithm can converge
more quickly than other incremental exploration strategies using confidence
estimates on action-values.Comment: Neural Information Processing Systems 201
A Tour of Reinforcement Learning: The View from Continuous Control
This manuscript surveys reinforcement learning from the perspective of
optimization and control with a focus on continuous control applications. It
surveys the general formulation, terminology, and typical experimental
implementations of reinforcement learning and reviews competing solution
paradigms. In order to compare the relative merits of various techniques, this
survey presents a case study of the Linear Quadratic Regulator (LQR) with
unknown dynamics, perhaps the simplest and best-studied problem in optimal
control. The manuscript describes how merging techniques from learning theory
and control can provide non-asymptotic characterizations of LQR performance and
shows that these characterizations tend to match experimental behavior. In
turn, when revisiting more complex applications, many of the observed phenomena
in LQR persist. In particular, theory and experiment demonstrate the role and
importance of models and the cost of generality in reinforcement learning
algorithms. This survey concludes with a discussion of some of the challenges
in designing learning systems that safely and reliably interact with complex
and uncertain environments and how tools from reinforcement learning and
control might be combined to approach these challenges.Comment: minor revision with a few clarifying passages and corrected typo
Posterior sampling for reinforcement learning: worst-case regret bounds
We present an algorithm based on posterior sampling (aka Thompson sampling)
that achieves near-optimal worst-case regret bounds when the underlying Markov
Decision Process (MDP) is communicating with a finite, though unknown,
diameter. Our main result is a high probability regret upper bound of
for any communicating MDP with states, actions
and diameter . Here, regret compares the total reward achieved by the
algorithm to the total expected reward of an optimal infinite-horizon
undiscounted average reward policy, in time horizon . This result closely
matches the known lower bound of . Our techniques involve
proving some novel results about the anti-concentration of Dirichlet
distribution, which may be of independent interest.Comment: This revision fixes an error due to use of some incorrect results
(Lemma C.1 and Lemma C.2) in the earlier version. The regret bounds in this
version are worse by a factor of sqrt(S) as compared to the previous versio
Efficient Exploration through Bayesian Deep Q-Networks
We study reinforcement learning (RL) in high dimensional episodic Markov
decision processes (MDP). We consider value-based RL when the optimal Q-value
is a linear function of d-dimensional state-action feature representation. For
instance, in deep-Q networks (DQN), the Q-value is a linear function of the
feature representation layer (output layer). We propose two algorithms, one
based on optimism, LINUCB, and another based on posterior sampling, LINPSRL. We
guarantee frequentist and Bayesian regret upper bounds of O(d sqrt{T}) for
these two algorithms, where T is the number of episodes. We extend these
methods to deep RL and propose Bayesian deep Q-networks (BDQN), which uses an
efficient Thompson sampling algorithm for high dimensional RL. We deploy the
double DQN (DDQN) approach, and instead of learning the last layer of Q-network
using linear regression, we use Bayesian linear regression, resulting in an
approximated posterior over Q-function. This allows us to directly incorporate
the uncertainty over the Q-function and deploy Thompson sampling on the learned
posterior distribution resulting in efficient exploration/exploitation
trade-off. We empirically study the behavior of BDQN on a wide range of Atari
games. Since BDQN carries out more efficient exploration and exploitation, it
is able to reach higher return substantially faster compared to DDQN
On Online Learning in Kernelized Markov Decision Processes
We develop algorithms with low regret for learning episodic Markov decision
processes based on kernel approximation techniques. The algorithms are based on
both the Upper Confidence Bound (UCB) as well as Posterior or Thompson Sampling
(PSRL) philosophies, and work in the general setting of continuous state and
action spaces when the true unknown transition dynamics are assumed to have
smoothness induced by an appropriate Reproducing Kernel Hilbert Space (RKHS).Comment: arXiv admin note: text overlap with arXiv:1805.0805
Model-Free Linear Quadratic Control via Reduction to Expert Prediction
Model-free approaches for reinforcement learning (RL) and continuous control
find policies based only on past states and rewards, without fitting a model of
the system dynamics. They are appealing as they are general purpose and easy to
implement; however, they also come with fewer theoretical guarantees than
model-based RL. In this work, we present a new model-free algorithm for
controlling linear quadratic (LQ) systems, and show that its regret scales as
for any small if time horizon satisfies
for a constant . The algorithm is based on a reduction of control of Markov
decision processes to an expert prediction problem. In practice, it corresponds
to a variant of policy iteration with forced exploration, where the policy in
each phase is greedy with respect to the average of all previous value
functions. This is the first model-free algorithm for adaptive control of LQ
systems that provably achieves sublinear regret and has a polynomial
computation cost. Empirically, our algorithm dramatically outperforms standard
policy iteration, but performs worse than a model-based approach