115,697 research outputs found
Representation Policy Iteration
This paper addresses a fundamental issue central to approximation methods for
solving large Markov decision processes (MDPs): how to automatically learn the
underlying representation for value function approximation? A novel
theoretically rigorous framework is proposed that automatically generates
geometrically customized orthonormal sets of basis functions, which can be used
with any approximate MDP solver like least squares policy iteration (LSPI). The
key innovation is a coordinate-free representation of value functions, using
the theory of smooth functions on a Riemannian manifold. Hodge theory yields a
constructive method for generating basis functions for approximating value
functions based on the eigenfunctions of the self-adjoint (Laplace-Beltrami)
operator on manifolds. In effect, this approach performs a global Fourier
analysis on the state space graph to approximate value functions, where the
basis functions reflect the largescale topology of the underlying state space.
A new class of algorithms called Representation Policy Iteration (RPI) are
presented that automatically learn both basis functions and approximately
optimal policies. Illustrative experiments compare the performance of RPI with
that of LSPI using two handcoded basis functions (RBF and polynomial state
encodings).Comment: Appears in Proceedings of the Twenty-First Conference on Uncertainty
in Artificial Intelligence (UAI2005
Representation Learning on Graphs: A Reinforcement Learning Application
In this work, we study value function approximation in reinforcement learning
(RL) problems with high dimensional state or action spaces via a generalized
version of representation policy iteration (RPI). We consider the limitations
of proto-value functions (PVFs) at accurately approximating the value function
in low dimensions and we highlight the importance of features learning for an
improved low-dimensional value function approximation. Then, we adopt different
representation learning algorithm on graphs to learn the basis functions that
best represent the value function. We empirically show that node2vec, an
algorithm for scalable feature learning in networks, and the Variational Graph
Auto-Encoder constantly outperform the commonly used smooth proto-value
functions in low-dimensional feature space
Approximation Benefits of Policy Gradient Methods with Aggregated States
Folklore suggests that policy gradient can be more robust to misspecification
than its relative, approximate policy iteration. This paper studies the case of
state-aggregation, where the state space is partitioned and either the policy
or value function approximation is held constant over partitions. This paper
shows a policy gradient method converges to a policy whose regret per-period is
bounded by , the largest difference between two elements of the
state-action value function belonging to a common partition. With the same
representation, both approximate policy iteration and approximate value
iteration can produce policies whose per-period regret scales as
, where is a discount factor. Theoretical results
synthesize recent analysis of policy gradient methods with insights of Van Roy
(2006) into the critical role of state-relevance weights in approximate dynamic
programming
Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition
This paper presents the MAXQ approach to hierarchical reinforcement learning
based on decomposing the target Markov decision process (MDP) into a hierarchy
of smaller MDPs and decomposing the value function of the target MDP into an
additive combination of the value functions of the smaller MDPs. The paper
defines the MAXQ hierarchy, proves formal results on its representational
power, and establishes five conditions for the safe use of state abstractions.
The paper presents an online model-free learning algorithm, MAXQ-Q, and proves
that it converges wih probability 1 to a kind of locally-optimal policy known
as a recursively optimal policy, even in the presence of the five kinds of
state abstraction. The paper evaluates the MAXQ representation and MAXQ-Q
through a series of experiments in three domains and shows experimentally that
MAXQ-Q (with state abstractions) converges to a recursively optimal policy much
faster than flat Q learning. The fact that MAXQ learns a representation of the
value function has an important benefit: it makes it possible to compute and
execute an improved, non-hierarchical policy via a procedure similar to the
policy improvement step of policy iteration. The paper demonstrates the
effectiveness of this non-hierarchical execution experimentally. Finally, the
paper concludes with a comparison to related work and a discussion of the
design tradeoffs in hierarchical reinforcement learning.Comment: 63 pages, 15 figure
Rollout Sampling Approximate Policy Iteration
Several researchers have recently investigated the connection between
reinforcement learning and classification. We are motivated by proposals of
approximate policy iteration schemes without value functions which focus on
policy representation using classifiers and address policy learning as a
supervised learning problem. This paper proposes variants of an improved policy
iteration scheme which addresses the core sampling problem in evaluating a
policy through simulation as a multi-armed bandit machine. The resulting
algorithm offers comparable performance to the previous algorithm achieved,
however, with significantly less computational effort. An order of magnitude
improvement is demonstrated experimentally in two standard reinforcement
learning domains: inverted pendulum and mountain-car.Comment: 18 pages, 2 figures, to appear in Machine Learning 72(3). Presented
at EWRL08, to be presented at ECML 200
Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming
We consider the classical nite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for fi nding the optimal Q-factors. Instead of policy evaluation by solving
a linear system of equations, our algorithm requires (possibly inexact) solution of a nonlinear system of equations, involving estimates of state costs as well as Q-factors. This is Bellman's equation for an optimal
stopping problem that can be solved with simple Q-learning iterations, in the case where a lookup table representation is used; it can also be solved with the Q-learning algorithm of Tsitsiklis and Van Roy [TsV99], in the case where feature-based Q-factor approximations are used. In exact/lookup table representation form, our algorithm admits asynchronous and stochastic iterative implementations, in the spirit of
asynchronous/modi ed policy iteration, with lower overhead and more reliable convergence advantages over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal di erence implementations are used, our algorithm resolves e ffectively the inherent difficulties of existing schemes due to inadequate exploration
- …