12,273 research outputs found
Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view
We investigate projection methods, for evaluating a linear approximation of
the value function of a policy in a Markov Decision Process context. We
consider two popular approaches, the one-step Temporal Difference fix-point
computation (TD(0)) and the Bellman Residual (BR) minimization. We describe
examples, where each method outperforms the other. We highlight a simple
relation between the objective function they minimize, and show that while BR
enjoys a performance guarantee, TD(0) does not in general. We then propose a
unified view in terms of oblique projections of the Bellman equation, which
substantially simplifies and extends the characterization of (schoknecht,2002)
and the recent analysis of (Yu & Bertsekas, 2008). Eventually, we describe some
simulations that suggest that if the TD(0) solution is usually slightly better
than the BR solution, its inherent numerical instability makes it very bad in
some cases, and thus worse on average
Approximate Policy Iteration Schemes: A Comparison
We consider the infinite-horizon discounted optimal control problem
formalized by Markov Decision Processes. We focus on several approximate
variations of the Policy Iteration algorithm: Approximate Policy Iteration,
Conservative Policy Iteration (CPI), a natural adaptation of the Policy Search
by Dynamic Programming algorithm to the infinite-horizon case (PSDP),
and the recently proposed Non-Stationary Policy iteration (NSPI(m)). For all
algorithms, we describe performance bounds, and make a comparison by paying a
particular attention to the concentrability constants involved, the number of
iterations and the memory required. Our analysis highlights the following
points: 1) The performance guarantee of CPI can be arbitrarily better than that
of API/API(), but this comes at the cost of a relative---exponential in
---increase of the number of iterations. 2) PSDP
enjoys the best of both worlds: its performance guarantee is similar to that of
CPI, but within a number of iterations similar to that of API. 3) Contrary to
API that requires a constant memory, the memory needed by CPI and PSDP
is proportional to their number of iterations, which may be problematic when
the discount factor is close to 1 or the approximation error
is close to ; we show that the NSPI(m) algorithm allows to make
an overall trade-off between memory and performance. Simulations with these
schemes confirm our analysis.Comment: ICML (2014
On the Performance Bounds of some Policy Search Dynamic Programming Algorithms
We consider the infinite-horizon discounted optimal control problem
formalized by Markov Decision Processes. We focus on Policy Search algorithms,
that compute an approximately optimal policy by following the standard Policy
Iteration (PI) scheme via an -approximate greedy operator (Kakade and Langford,
2002; Lazaric et al., 2010). We describe existing and a few new performance
bounds for Direct Policy Iteration (DPI) (Lagoudakis and Parr, 2003; Fern et
al., 2006; Lazaric et al., 2010) and Conservative Policy Iteration (CPI)
(Kakade and Langford, 2002). By paying a particular attention to the
concentrability constants involved in such guarantees, we notably argue that
the guarantee of CPI is much better than that of DPI, but this comes at the
cost of a relative--exponential in -- increase of time
complexity. We then describe an algorithm, Non-Stationary Direct Policy
Iteration (NSDPI), that can either be seen as 1) a variation of Policy Search
by Dynamic Programming by Bagnell et al. (2003) to the infinite horizon
situation or 2) a simplified version of the Non-Stationary PI with growing
period of Scherrer and Lesner (2012). We provide an analysis of this algorithm,
that shows in particular that it enjoys the best of both worlds: its
performance guarantee is similar to that of CPI, but within a time complexity
similar to that of DPI
Policy Search: Any Local Optimum Enjoys a Global Performance Guarantee
Local Policy Search is a popular reinforcement learning approach for handling
large state spaces. Formally, it searches locally in a paramet erized policy
space in order to maximize the associated value function averaged over some
predefined distribution. It is probably commonly b elieved that the best one
can hope in general from such an approach is to get a local optimum of this
criterion. In this article, we show th e following surprising result:
\emph{any} (approximate) \emph{local optimum} enjoys a \emph{global performance
guarantee}. We compare this g uarantee with the one that is satisfied by Direct
Policy Iteration, an approximate dynamic programming algorithm that does some
form of Poli cy Search: if the approximation error of Local Policy Search may
generally be bigger (because local search requires to consider a space of s
tochastic policies), we argue that the concentrability coefficient that appears
in the performance bound is much nicer. Finally, we discuss several practical
and theoretical consequences of our analysis
- …