Search CORE

2 research outputs found

Asymptotic Optimality of Finite Approximations to Markov Decision Processes with Borel Spaces

Author: Linder Tamás
Saldi Naci
Yüksel Serdar
Publication venue
Publication date: 22/09/2016
Field of study

Calculating optimal policies is known to be computationally difficult for Markov decision processes (MDPs) with Borel state and action spaces. This paper studies finite-state approximations of discrete time Markov decision processes with Borel state and action spaces, for both discounted and average costs criteria. The stationary policies thus obtained are shown to approximate the optimal stationary policy with arbitrary precision under quite general conditions for discounted cost and more restrictive conditions for average cost. For compact-state MDPs, we obtain explicit rate of convergence bounds quantifying how the approximation improves as the size of the approximating finite state space increases. Using information theoretic arguments, the order optimality of the obtained convergence rates is established for a large class of problems. We also show that, as a pre-processing step the action space can also be finitely approximated with sufficiently large number points; thereby, well known algorithms, such as value or policy iteration, Q-learning, etc., can be used to calculate near optimal policies.Comment: 41 page

arXiv.org e-Print Archive

c ○ 20xx Society for Industrial and Applied Mathematics SIMULATION-BASED UNIFORM VALUE FUNCTION ESTIMATES OF MARKOV DECISION PROCESSES ∗

Author: P. Varaiya
Pravin
Rahul Jain
Publication venue
Publication date: 01/01/2006
Field of study

Abstract. The value function of a Markov decision process assigns to each policy its expected discounted reward. This expected reward can be estimated as the empirical average of the reward over many independent simulation runs. We derive bounds on the number of runs needed for the uniform convergence of the empirical average to the expected reward for a class of policies, in terms of the VC or P-dimension of the policy class. Further, we show through a counterexample that whether we get uniform convergence or not for an MDP depends on the simulation method used. Uniform convergence results are also obtained for the average reward case, for partially observed Markov decision processes, and can be easily extended to Markov games. The results can be viewed as a contribution to empirical process theory and as an extension of the probably approximately correct (PAC) learning theory for partially observable MDPs and Markov games. Key words. Markov decision processes; Markov games; Empirical process theory; PAC learning; Value function estimation; Uniform rate of convergence. AMS subject classifications. 1. Introduction. W

CiteSeerX