Search CORE

14,235 research outputs found

Improved and generalized upper bounds on the complexity of policy iteration

Author: Scherrer Bruno
Publication venue: 'Institute for Operations Research and the Management Sciences (INFORMS)'
Publication date: 09/02/2016
Field of study

Markov decision processes ; Dynamic Programming ; Analysis of AlgorithmsInternational audienceGiven a Markov Decision Process (MDP) with

n

states and a totalnumber

m

of actions, we study the number of iterations needed byPolicy Iteration (PI) algorithms to converge to the optimal

\gamma

-discounted policy. We consider two variations of PI: Howard'sPI that changes the actions in all states with a positive advantage,and Simplex-PI that only changes the action in the state with maximaladvantage. We show that Howard's PI terminates after at most

O\left(\frac{m}{1-\gamma}\log\left(\frac{1}{1-\gamma}\right)\right)

iterations, improving by a factor

O(\log n)

a result by Hansen etal., while Simplex-PI terminates after at most

O\left(\frac{nm}{1-\gamma}\log\left(\frac{1}{1-\gamma}\right)\right)

iterations, improving by a factor

O(\log n)

a result by Ye. Undersome structural properties of the MDP, we then consider bounds thatare independent of the discount factor~

\gamma

: quantities ofinterest are bounds

\tau_t

and

\tau_r

---uniform on all states andpolicies---respectively on the \emph{expected time spent in transientstates} and \emph{the inverse of the frequency of visits in recurrentstates} given that the process starts from the uniform distribution.Indeed, we show that Simplex-PI terminates after at most

\tilde O\left( n^3 m^2 \tau_t \tau_r \right)

iterations. This extends arecent result for deterministic MDPs by Post \& Ye, in which

\tau_t\le 1

and

\tau_r \le n

; in particular it shows that Simplex-PI isstrongly polynomial for a much larger class of MDPs. We explain whysimilar results seem hard to derive for Howard's PI. Finally, underthe additional (restrictive) assumption that the state space ispartitioned in two sets, respectively states that are transient andrecurrent for all policies, we show that both Howard's PI andSimplex-PI terminate after at most

\tilde O(m(n^2\tau_t+n\tau_r))

iterations

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Perseus: Randomized Point-based Value Iteration for POMDPs

Author: Spaan M. T. J.
Vlassis N.
Publication venue: 'AI Access Foundation'
Publication date: 09/09/2011
Field of study

Partially observable Markov decision processes (POMDPs) form an attractive and principled framework for agent planning under uncertainty. Point-based approximate techniques for POMDPs compute a policy based on a finite set of points collected in advance from the agents belief space. We present a randomized point-based value iteration algorithm called Perseus. The algorithm performs approximate value backup stages, ensuring that in each backup stage the value of each point in the belief set is improved; the key observation is that a single backup may improve the value of many belief points. Contrary to other point-based methods, Perseus backs up only a (randomly selected) subset of points in the belief set, sufficient for improving the value of each belief point in the set. We show how the same idea can be extended to dealing with continuous action spaces. Experimental results show the potential of Perseus in large scale POMDP problems

arXiv.org e-Print Archive

Crossref