15,783 research outputs found
Incremental Truncated LSTD
Balancing between computational efficiency and sample efficiency is an
important goal in reinforcement learning. Temporal difference (TD) learning
algorithms stochastically update the value function, with a linear time
complexity in the number of features, whereas least-squares temporal difference
(LSTD) algorithms are sample efficient but can be quadratic in the number of
features. In this work, we develop an efficient incremental low-rank
LSTD({\lambda}) algorithm that progresses towards the goal of better balancing
computation and sample efficiency. The algorithm reduces the computation and
storage complexity to the number of features times the chosen rank parameter
while summarizing past samples efficiently to nearly obtain the sample
complexity of LSTD. We derive a simulation bound on the solution given by
truncated low-rank approximation, illustrating a bias- variance trade-off
dependent on the choice of rank. We demonstrate that the algorithm effectively
balances computational complexity and sample efficiency for policy evaluation
in a benchmark task and a high-dimensional energy allocation domain.Comment: Accepted to IJCAI 201
Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping
We consider the problem of efficiently learning optimal control policies and
value functions over large state spaces in an online setting in which estimates
must be available after each interaction with the world. This paper develops an
explicitly model-based approach extending the Dyna architecture to linear
function approximation. Dynastyle planning proceeds by generating imaginary
experience from the world model and then applying model-free reinforcement
learning algorithms to the imagined state transitions. Our main results are to
prove that linear Dyna-style planning converges to a unique solution
independent of the generating distribution, under natural conditions. In the
policy evaluation setting, we prove that the limit point is the least-squares
(LSTD) solution. An implication of our results is that prioritized-sweeping can
be soundly extended to the linear approximation case, backing up to preceding
features rather than to preceding states. We introduce two versions of
prioritized sweeping with linear Dyna and briefly illustrate their performance
empirically on the Mountain Car and Boyan Chain problems.Comment: Appears in Proceedings of the Twenty-Fourth Conference on Uncertainty
in Artificial Intelligence (UAI2008
Least Squares Policy Iteration with Instrumental Variables vs. Direct Policy Search: Comparison Against Optimal Benchmarks Using Energy Storage
This paper studies approximate policy iteration (API) methods which use
least-squares Bellman error minimization for policy evaluation. We address
several of its enhancements, namely, Bellman error minimization using
instrumental variables, least-squares projected Bellman error minimization, and
projected Bellman error minimization using instrumental variables. We prove
that for a general discrete-time stochastic control problem, Bellman error
minimization using instrumental variables is equivalent to both variants of
projected Bellman error minimization. An alternative to these API methods is
direct policy search based on knowledge gradient. The practical performance of
these three approximate dynamic programming methods are then investigated in
the context of an application in energy storage, integrated with an
intermittent wind energy supply to fully serve a stochastic time-varying
electricity demand. We create a library of test problems using real-world data
and apply value iteration to find their optimal policies. These benchmarks are
then used to compare the developed policies. Our analysis indicates that API
with instrumental variables Bellman error minimization prominently outperforms
API with least-squares Bellman error minimization. However, these approaches
underperform our direct policy search implementation.Comment: 37 pages, 9 figure
Reinforcement Learning
Reinforcement learning (RL) is a general framework for adaptive control,
which has proven to be efficient in many domains, e.g., board games, video
games or autonomous vehicles. In such problems, an agent faces a sequential
decision-making problem where, at every time step, it observes its state,
performs an action, receives a reward and moves to a new state. An RL agent
learns by trial and error a good policy (or controller) based on observations
and numeric reward feedback on the previously performed action. In this
chapter, we present the basic framework of RL and recall the two main families
of approaches that have been developed to learn a good policy. The first one,
which is value-based, consists in estimating the value of an optimal policy,
value from which a policy can be recovered, while the other, called policy
search, directly works in a policy space. Actor-critic methods can be seen as a
policy search technique where the policy value that is learned guides the
policy improvement. Besides, we give an overview of some extensions of the
standard RL framework, notably when risk-averse behavior needs to be taken into
account or when rewards are not available or not known.Comment: Chapter in "A Guided Tour of Artificial Intelligence Research",
Springe
Shallow Updates for Deep Reinforcement Learning
Deep reinforcement learning (DRL) methods such as the Deep Q-Network (DQN)
have achieved state-of-the-art results in a variety of challenging,
high-dimensional domains. This success is mainly attributed to the power of
deep neural networks to learn rich domain representations for approximating the
value function or policy. Batch reinforcement learning methods with linear
representations, on the other hand, are more stable and require less hyper
parameter tuning. Yet, substantial feature engineering is necessary to achieve
good results. In this work we propose a hybrid approach -- the Least Squares
Deep Q-Network (LS-DQN), which combines rich feature representations learned by
a DRL algorithm with the stability of a linear least squares method. We do this
by periodically re-training the last hidden layer of a DRL network with a batch
least squares update. Key to our approach is a Bayesian regularization term for
the least squares update, which prevents over-fitting to the more recent data.
We tested LS-DQN on five Atari games and demonstrate significant improvement
over vanilla DQN and Double-DQN. We also investigated the reasons for the
superior performance of our method. Interestingly, we found that the
performance improvement can be attributed to the large batch size used by the
LS method when optimizing the last layer
Concentration bounds for temporal difference learning with linear function approximation: The case of batch data and uniform sampling
We propose a stochastic approximation (SA) based method with randomization of
samples for policy evaluation using the least squares temporal difference
(LSTD) algorithm. Our proposed scheme is equivalent to running regular temporal
difference learning with linear function approximation, albeit with samples
picked uniformly from a given dataset. Our method results in an
improvement in complexity in comparison to LSTD, where is the dimension of
the data. We provide non-asymptotic bounds for our proposed method, both in
high probability and in expectation, under the assumption that the matrix
underlying the LSTD solution is positive definite. The latter assumption can be
easily satisfied for the pathwise LSTD variant proposed in [23]. Moreover, we
also establish that using our method in place of LSTD does not impact the rate
of convergence of the approximate value function to the true value function.
These rate results coupled with the low computational complexity of our method
make it attractive for implementation in big data settings, where is large.
A similar low-complexity alternative for least squares regression is well-known
as the stochastic gradient descent (SGD) algorithm. We provide finite-time
bounds for SGD. We demonstrate the practicality of our method as an efficient
alternative for pathwise LSTD empirically by combining it with the least
squares policy iteration (LSPI) algorithm in a traffic signal control
application. We also conduct another set of experiments that combines the SA
based low-complexity variant for least squares regression with the LinUCB
algorithm for contextual bandits, using the large scale news recommendation
dataset from Yahoo
Randomised Bayesian Least-Squares Policy Iteration
We introduce Bayesian least-squares policy iteration (BLSPI), an off-policy,
model-free, policy iteration algorithm that uses the Bayesian least-squares
temporal-difference (BLSTD) learning algorithm to evaluate policies. An online
variant of BLSPI has been also proposed, called randomised BLSPI (RBLSPI), that
improves its policy based on an incomplete policy evaluation step. In online
setting, the exploration-exploitation dilemma should be addressed as we try to
discover the optimal policy by using samples collected by ourselves. RBLSPI
exploits the advantage of BLSTD to quantify our uncertainty about the value
function. Inspired by Thompson sampling, RBLSPI first samples a value function
from a posterior distribution over value functions, and then selects actions
based on the sampled value function. The effectiveness and the exploration
abilities of RBLSPI are demonstrated experimentally in several environments.Comment: European Workshop on Reinforcement Learning 14, October 2018, Lille,
Franc
Lambda-Policy Iteration: A Review and a New Implementation
In this paper we discuss \l-policy iteration, a method for exact and
approximate dynamic programming. It is intermediate between the classical value
iteration (VI) and policy iteration (PI) methods, and it is closely related to
optimistic (also known as modified) PI, whereby each policy evaluation is done
approximately, using a finite number of VI. We review the theory of the method
and associated questions of bias and exploration arising in simulation-based
cost function approximation. We then discuss various implementations, which
offer advantages over well-established PI methods that use LSPE(\l),
LSTD(\l), or TD(\l) for policy evaluation with cost function approximation.
One of these implementations is based on a new simulation scheme, called
geometric sampling, which uses multiple short trajectories rather than a single
infinitely long trajectory
Vision-based reinforcement learning using approximate policy iteration
A major issue for reinforcement learning (RL) applied to robotics is the time required to learn a new skill. While RL has been used to learn mobile robot control in many simulated domains, applications involving learning on real
robots are still relatively rare. In this paper, the Least-Squares Policy Iteration (LSPI) reinforcement learning algorithm and a new model-based algorithm Least-Squares Policy Iteration with Prioritized Sweeping (LSPI+), are implemented on a mobile robot to acquire new skills quickly and efficiently. LSPI+ combines the benefits of LSPI and prioritized sweeping, which uses all previous experience to focus the computational effort on the most āinterestingā or dynamic parts of the state space.
The proposed algorithms are tested on a household vacuum
cleaner robot for learning a docking task using vision as the only sensor modality. In experiments these algorithms are compared to other model-based and model-free RL algorithms. The results show that the number of trials required to learn the docking task is significantly reduced using LSPI compared to the other RL algorithms investigated, and that LSPI+ further improves on the performance of LSPI
Policy Evaluation with Variance Related Risk Criteria in Markov Decision Processes
In this paper we extend temporal difference policy evaluation algorithms to
performance criteria that include the variance of the cumulative reward. Such
criteria are useful for risk management, and are important in domains such as
finance and process control. We propose both TD(0) and LSTD(lambda) variants
with linear function approximation, prove their convergence, and demonstrate
their utility in a 4-dimensional continuous state space problem
- ā¦