We consider the problem of finding a near-optimal policy in continuous space, discounted Markovian Decision Problems given the trajectory of some behaviour policy. We study the policy iteration algorithm where in successive iterations the action-value functions of the intermediate policies are obtained by picking a function from some fixed function set (chosen by the user) that minimizes an unbiased finite-sample approximation to a novel loss function that upper-bounds the unmodified Bellman-residual criterion. The main result is a finite-sample, high-probability bound on the performance of the resulting policy that depends on the mixing rate of the trajectory, the capacity of the function set as measured by a novel capacity concept that we call the VC-crossing dimension, the approximation power of the function set and the discounted-average concentrability of the future-state distribution. To the best of our knowledge this is the first theoretical reinforcement   learning result for off-policy control learning over continuous state-spaces using a single trajectory

A. Antos

A. Nobel

András Antos

B. Yu

Csaba Szepesvári

D. Ernst

D. Haussler

D. Ormoneit

D. P. Bertsekas

D. Pollard

E. Cheney

G. Gordon

J. N. Tsitsiklis

L. Devroye

L. Györfi

M. Anthony

M. Carrasco

M. Kuczma

M. Lagoudakis

P. Doukhan

P. Schweitzer

R. A. Howard

R. Bellman

R. Meir

R. Sutton

Rémi Munos

S. Bradtke

S. Meyn

S. Murphy

T. G. Dietterich

Y. Baraud

Y. Davidov

Antos, András

Szepesvári, Csaba

Munos, Rémi

SZTAKI Publication Repository

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

International audienceWe consider the problem of finding a near-optimal policy using value-function methods in continuous space, discounted Markovian Decision Problems (MDP) when only a single trajectory underlying some policy can be used as the input. Since the state-space is continuous, one must resort to the use of function approximation. In this paper we study a policy iteration algorithm iterating over action-value functions where the iterates are obtained by empirical risk minimization, where the loss function used penalizes high magnitudes of the Bellman-residual. It turns out that when a linear parameterization is used the algorithm is equivalent to least-squares policy iteration. Our main result is a finite-sample, high-probability bound on the performance of the computed policy that depends on the mixing rate of the trajectory, the capacity of the function set as measured by a novel capacity concept (the VC-crossing dimension), the approximation power of the function set and the controllability properties of the MDP. To the best of our knowledge this is the first theoretical result for off-policy control learning over continuous state-spaces using a single trajectory

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

Abstract

Similar works

Full text

Available Versions

SZTAKI Publication Repository

HAL - Lille 3

HAL: Hyper Article en Ligne

Crossref

INRIA a CCSD electronic archive server