429 research outputs found
Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path
We consider the problem of finding a near-optimal policy in continuous space, discounted Markovian Decision Problems given the trajectory of some behaviour policy. We study the policy iteration algorithm where in successive iterations the action-value functions of the intermediate policies are obtained by picking a function from some fixed function set (chosen by the user) that minimizes an unbiased finite-sample approximation to a novel loss function that upper-bounds the unmodified Bellman-residual criterion. The main result is a finite-sample, high-probability bound on the performance of the resulting policy that depends on the mixing rate of the trajectory, the capacity of the function set as measured by a novel capacity concept that we call the VC-crossing dimension, the approximation power of the function set and the discounted-average concentrability of the future-state distribution. To the best of our knowledge this is the first theoretical reinforcement learning result for off-policy control learning over continuous state-spaces using a single trajectory
Is the Bellman residual a bad proxy?
This paper aims at theoretically and empirically comparing two standard
optimization criteria for Reinforcement Learning: i) maximization of the mean
value and ii) minimization of the Bellman residual. For that purpose, we place
ourselves in the framework of policy search algorithms, that are usually
designed to maximize the mean value, and derive a method that minimizes the
residual over policies. A theoretical analysis
shows how good this proxy is to policy optimization, and notably that it is
better than its value-based counterpart. We also propose experiments on
randomly generated generic Markov decision processes, specifically designed for
studying the influence of the involved concentrability coefficient. They show
that the Bellman residual is generally a bad proxy to policy optimization and
that directly maximizing the mean value is much better, despite the current
lack of deep theoretical analysis. This might seem obvious, as directly
addressing the problem of interest is usually better, but given the prevalence
of (projected) Bellman residual minimization in value-based reinforcement
learning, we believe that this question is worth to be considered.Comment: Final NIPS 2017 version (title, among other things, changed
Q-learning with Nearest Neighbors
We consider model-free reinforcement learning for infinite-horizon discounted
Markov Decision Processes (MDPs) with a continuous state space and unknown
transition kernel, when only a single sample path under an arbitrary policy of
the system is available. We consider the Nearest Neighbor Q-Learning (NNQL)
algorithm to learn the optimal Q function using nearest neighbor regression
method. As the main contribution, we provide tight finite sample analysis of
the convergence rate. In particular, for MDPs with a -dimensional state
space and the discounted factor , given an arbitrary sample
path with "covering time" , we establish that the algorithm is guaranteed
to output an -accurate estimate of the optimal Q-function using
samples. For instance, for a
well-behaved MDP, the covering time of the sample path under the purely random
policy scales as so the sample
complexity scales as Indeed, we
establish a lower bound that argues that the dependence of is necessary.Comment: Accepted to NIPS 201
On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes
We consider infinite-horizon stationary -discounted Markov Decision
Processes, for which it is known that there exists a stationary optimal policy.
Using Value and Policy Iteration with some error at each iteration,
it is well-known that one can compute stationary policies that are
-optimal. After arguing that this
guarantee is tight, we develop variations of Value and Policy Iteration for
computing non-stationary policies that can be up to
-optimal, which constitutes a significant
improvement in the usual situation when is close to 1. Surprisingly,
this shows that the problem of "computing near-optimal non-stationary policies"
is much simpler than that of "computing near-optimal stationary policies"
Batch Policy Learning under Constraints
When learning policies for real-world domains, two important questions arise:
(i) how to efficiently use pre-collected off-policy, non-optimal behavior data;
and (ii) how to mediate among different competing objectives and constraints.
We thus study the problem of batch policy learning under multiple constraints,
and offer a systematic solution. We first propose a flexible meta-algorithm
that admits any batch reinforcement learning and online learning procedure as
subroutines. We then present a specific algorithmic instantiation and provide
performance guarantees for the main objective and all constraints. To certify
constraint satisfaction, we propose a new and simple method for off-policy
policy evaluation (OPE) and derive PAC-style bounds. Our algorithm achieves
strong empirical results in different domains, including in a challenging
problem of simulated car driving subject to multiple constraints such as lane
keeping and smooth driving. We also show experimentally that our OPE method
outperforms other popular OPE techniques on a standalone basis, especially in a
high-dimensional setting
Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view
We investigate projection methods, for evaluating a linear approximation of
the value function of a policy in a Markov Decision Process context. We
consider two popular approaches, the one-step Temporal Difference fix-point
computation (TD(0)) and the Bellman Residual (BR) minimization. We describe
examples, where each method outperforms the other. We highlight a simple
relation between the objective function they minimize, and show that while BR
enjoys a performance guarantee, TD(0) does not in general. We then propose a
unified view in terms of oblique projections of the Bellman equation, which
substantially simplifies and extends the characterization of (schoknecht,2002)
and the recent analysis of (Yu & Bertsekas, 2008). Eventually, we describe some
simulations that suggest that if the TD(0) solution is usually slightly better
than the BR solution, its inherent numerical instability makes it very bad in
some cases, and thus worse on average
Finite-Sample Analysis of Bellman Residual Minimization
International audienceWe consider the Bellman residual minimization approach for solving discounted Markov decision problems, where we assume that a generative model of the dynamics and rewards is available. At each policy iteration step, an approximation of the value function for the current policy is obtained by minimizing an empirical Bellman residual defined on a set of n states drawn i.i.d. from a distribution, the immediate rewards, and the next states sampled from the model. Our main result is a generalization bound for the Bellman residual in linear approximation spaces. In particular, we prove that the empirical Bellman residual approaches the true (quadratic) Bellman residual with a rate of order O(1/sqrt((n)). This result implies that minimizing the empirical residual is indeed a sound approach for the minimization of the true Bellman residual which guarantees a good approximation of the value function for each policy. Finally, we derive performance bounds for the resulting approximate policy iteration algorithm in terms of the number of samples n and a measure of how well the function space is able to approximate the sequence of value functions.
- …