Article thumbnail
Location of Repository

Simulation-based uniform value function estimates of discounted and average-reward MDPs

By Rahul Jain, Pravin and P. Varaiya

Abstract

Abstract. The value function of a Markov decision process (MDP) assigns to each policy its expected discounted reward. This expected reward can be estimated as the empirical average of the reward over many independent simulation runs. We derive bounds on the number of runs needed for the uniform convergence of the empirical average to the expected reward for a class of policies, in terms of the Vapnik-Chervonenkis or P-dimension of the policy class. Further, we show through a counterexample that whether we get uniform convergence or not for an MDP depends on the simulation method used. Uniform convergence results are also obtained for the average-reward case, for partially observed Markov decision processes, and can be easily extended to Markov games. The results can be viewed as a contribution to empirical process theory and as an extension of the probably approximately correct (PAC) learning theory for partially observable MDPs and Markov games. Key words. Markov decision processes, Markov games, empirical process theory, PAC learning, value function estimation, uniform rate of convergenc

Year: 2004
OAI identifier: oai:CiteSeerX.psu:10.1.1.122.7522
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://citeseerx.ist.psu.edu/v... (external link)
  • http://rjain31.googlepages.com... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.