Search CORE

78 research outputs found

On the Sample Complexity of Reinforcement Learning with a Generative Model

Author: Azar Mohammad Gheshlaghi
Kappen Hilbert
Munos Rémi
Publication venue: HAL CCSD
Publication date: 01/01/2012
Field of study

International audienceWe consider the problem of learning the optimal action-value function in the discounted-reward Markov decision processes (MDPs). We prove a new PAC bound on the sample-complexity of model-based value iteration algorithm in the presence of the generative model, which indicates that for an MDP with N state-action pairs and the discount factor \gamma\in[0,1) only O(N\log(N/\delta)/((1-\gamma)^3\epsilon^2)) samples are required to find an \epsilon-optimal estimation of the action-value function with the probability 1-\delta. We also prove a matching lower bound of \Theta (N\log(N/\delta)/((1-\gamma)^3\epsilon^2)) on the sample complexity of estimating the optimal action-value function by every RL algorithm. To the best of our knowledge, this is the first matching result on the sample complexity of estimating the optimal (action-) value function in which the upper bound matches the lower bound of RL in terms of N, \epsilon, \delta and 1/(1-\gamma). Also, both our lower bound and our upper bound significantly improve on the state-of-the-art in terms of 1/(1-\gamma)

arXiv.org e-Print Archive

HAL - Lille 3

CiteSeerX

INRIA a CCSD electronic archive server

Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model

Author: Azar Mohammad Gheshlaghi
Kappen Hilbert
Munos Rémi
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

International audienceWe consider the problem of learning the optimal action-value function in discounted-reward Markov decision processes (MDPs). We prove new PAC bounds on the sample-complexity of two well-known model-based reinforcement learning (RL) algorithms in the presence of a generative model of the MDP: value iteration and policy iteration. The first result indicates that for an MDP with

N

state-action pairs and the discount factor γin[0, 1) only

O(N log(N/δ)/ [(1 - γ)3 \epsilon^2])

state-transition samples are required to find an

\epsilon

-optimal estimation of the action-value function with the probability (w.p.) 1-δ. Further, we prove that, for small values of

\epsilon

, an order of

O(N log(N/δ)/ [(1 - γ)3 \epsilon^2])

samples is required to find an

\epsilon

-optimal policy w.p. 1-δ. We also prove a matching lower bound of

\Omega(N log(N/δ)/ [(1 - γ)3\epsilon2])

on the sample complexity of estimating the optimal action-value function. To the best of our knowledge, this is the first minimax result on the sample complexity of RL: The upper bound matches the lower bound interms of

N

\epsilon

, δ and 1/(1 -γ) up to a constant factor. Also, both our lower bound and upper bound improve on the state-of-the-art in terms of their dependence on 1/(1-γ)

HAL - Lille 3

Crossref

INRIA a CCSD electronic archive server

The Sample-Complexity of General Reinforcement Learning

Author: Hutter Marcus
Lattimore Tor
Sunehag Peter
Publication venue
Publication date: 01/06/2013
Field of study

We present a new algorithm for general reinforcement learning where the true environment is known to belong to a finite class of N arbitrary models. The algorithm is shown to be near-optimal for all but O(N log^2 N) time-steps with high probability. Infinite classes are also considered where we show that compactness is a key criterion for determining the existence of uniform sample-complexity bounds. A matching lower bound is given for the finite case.Comment: 16 page

arXiv.org e-Print Archive

CiteSeerX

The Australian National University