5 research outputs found

    Improved and generalized upper bounds on the complexity of policy iteration

    Get PDF
    Markov decision processes ; Dynamic Programming ; Analysis of AlgorithmsInternational audienceGiven a Markov Decision Process (MDP) with nn states and a totalnumber mm of actions, we study the number of iterations needed byPolicy Iteration (PI) algorithms to converge to the optimalγ\gamma-discounted policy. We consider two variations of PI: Howard'sPI that changes the actions in all states with a positive advantage,and Simplex-PI that only changes the action in the state with maximaladvantage. We show that Howard's PI terminates after at most O(m1γlog(11γ))O\left(\frac{m}{1-\gamma}\log\left(\frac{1}{1-\gamma}\right)\right)iterations, improving by a factor O(logn)O(\log n) a result by Hansen etal., while Simplex-PI terminates after at most O(nm1γlog(11γ))O\left(\frac{nm}{1-\gamma}\log\left(\frac{1}{1-\gamma}\right)\right)iterations, improving by a factor O(logn)O(\log n) a result by Ye. Undersome structural properties of the MDP, we then consider bounds thatare independent of the discount factor~γ\gamma: quantities ofinterest are bounds τt\tau_t and τr\tau_r---uniform on all states andpolicies---respectively on the \emph{expected time spent in transientstates} and \emph{the inverse of the frequency of visits in recurrentstates} given that the process starts from the uniform distribution.Indeed, we show that Simplex-PI terminates after at most O~(n3m2τtτr)\tilde O\left( n^3 m^2 \tau_t \tau_r \right) iterations. This extends arecent result for deterministic MDPs by Post \& Ye, in which τt1\tau_t\le 1 and τrn\tau_r \le n; in particular it shows that Simplex-PI isstrongly polynomial for a much larger class of MDPs. We explain whysimilar results seem hard to derive for Howard's PI. Finally, underthe additional (restrictive) assumption that the state space ispartitioned in two sets, respectively states that are transient andrecurrent for all policies, we show that both Howard's PI andSimplex-PI terminate after at most O~(m(n2τt+nτr))\tilde O(m(n^2\tau_t+n\tau_r))iterations

    Improved bound on the worst case complexity of Policy Iteration

    Get PDF
    Solving Markov Decision Processes is a recurrent task in engineering which can be performed efficiently in practice using the Policy Iteration algorithm. Regarding its complexity, both lower and upper bounds are known to be exponential (but far apart) in the size of the problem. In this work, we provide the first improvement over the now standard upper bound from Mansour and Singh (1999). We also show that this bound is tight for a natural relaxation of the problem