Search CORE

7 research outputs found

Approximate dynamic programming for two-player zero-sum Markov games

Author: Perolat Julien
Pietquin Olivier
Piot Bilal
Scherrer Bruno
Publication venue: HAL CCSD
Publication date: 06/07/2015
Field of study

International audienceThis paper provides an analysis of error propagation in Approximate Dynamic Programming applied to zero-sum two-player Stochastic Games. We provide a novel and unified error propagation analysis in L p-norm of three well-known algorithms adapted to Stochastic Games (namely Approximate Value Iteration, Approximate Policy Iteration and Approximate Generalized Policy Iteratio,n). We show that we can achieve a stationary policy which is 2γ+ (1−γ) 2-optimal, where is the value function approximation error and is the approximate greedy operator error. In addition , we provide a practical algorithm (AGPI-Q) to solve infinite horizon γ-discounted two-player zero-sum Stochastic Games in a batch setting. It is an extension of the Fitted-Q algorithm (which solves Markov Decisions Processes from data) and can be non-parametric. Finally, we demonstrate experimentally the performance of AGPI-Q on a simultaneous two-player game, namely Alesia

HAL - Lille 3

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

HAL-Rennes 1

Approximate Modified Policy Iteration

Author: Gabillon Victor
Geist Matthieu
Ghavamzadeh Mohammad
Scherrer Bruno
Publication venue
Publication date: 01/01/2012
Field of study

Modified policy iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated policy and value iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its approximation form which is used when the state and/or action spaces are large or infinite. In this paper, we propose three implementations of approximate MPI (AMPI) that are extensions of well-known approximate DP algorithms: fitted-value iteration, fitted-Q iteration, and classification-based policy iteration. We provide error propagation analyses that unify those for approximate policy and value iteration. On the last classification-based implementation, we develop a finite-sample analysis that shows that MPI's main parameter allows to control the balance between the estimation error of the classifier and the overall value function approximation

arXiv.org e-Print Archive

HAL-CentraleSupelec

CiteSeerX

HAL - Lille 3

HAL - Université de Franche-Comté

INRIA a CCSD electronic archive server

HAL-Rennes 1

Finite-Sample Analysis of Least-Squares Policy Iteration

Author: Ghavamzadeh Mohammad
Lazaric Alessandro
Munos Rémi
Publication venue: Microtome Publishing
Publication date: 01/10/2012
Field of study

International audienceIn this paper, we report a performance bound for the widely used least-squares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the least-squares temporal-difference (LSTD) learning method, and report finite-sample analysis for this algorithm. To do so, we first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is

\beta

-mixing. Finally, we analyze how the error at each policy evaluation step is propagated through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm

HAL - Lille 3

INRIA a CCSD electronic archive server

HAL-Rennes 1

Analysis of Classification-based Policy Iteration Algorithms

Author: Ghavamzadeh Mohammad
Lazaric Alessandro
Munos Rémi
Publication venue: Microtome Publishing
Publication date: 01/01/2016
Field of study

International audienceWe introduce a variant of the classification-based approach to policy iteration which uses a cost-sensitive loss function weighting each classification mistake by its actual regret, that is, the difference between the action-value of the greedy action and of the action chosen by the classifier. For this algorithm, we provide a full finite-sample analysis. Our results state a performance bound in terms of the number of policy improvement steps, the number of rollouts used in each iteration, the capacity of the considered policy space (classifier), and a capacity measure which indicates how well the policy space can approximate policies that are greedy with respect to any of its members. The analysis reveals a tradeoff between the estimation and approximation errors in this classification-based policy iteration setting. Furthermore it confirms the intuition that classification-based policy iteration algorithms could be favorably compared to value-based approaches when the policies can be approximated more easily than their corresponding value functions. We also study the consistency of the algorithm when there exists a sequence of policy spaces with increasing capacity

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

algorithmes budgétisés d’itération sur les politiques obtenues par classification

Author: Gabillon Victor
Publication venue: HAL CCSD
Publication date: 12/06/2014
Field of study

This dissertation is motivated by the study of a class of reinforcement learning (RL) algorithms, called classification-based policy iteration (CBPI). Contrary to the standard RL methods, CBPI do not use an explicit representation for value function. Instead, they use rollouts and estimate the action-value function of the current policy at a collection of states. Using a training set built from these rollout estimates, the greedy policy is learned as the output of a classifier. Thus, the policy generated at each iteration of the algorithm, is no longer defined by a (approximated) value function, but instead by a classifier. In this thesis, we propose new algorithms that improve the performance of the existing CBPI methods, especially when they have a fixed budget of interaction with the environment. Our improvements are based on the following two shortcomings of the existing CBPI algorithms: 1) The rollouts that are used to estimate the action-value functions should be truncated and their number is limited, and thus, we have to deal with bias-variance tradeoff in estimating the rollouts, and 2) The rollouts are allocated uniformly over the states in the rollout set and the available actions, while a smarter allocation strategy could guarantee a more accurate training set for the classifier. We propose CBPI algorithms that address these issues, respectively, by: 1) the use of a value function approximation to improve the accuracy (balancing the bias and variance) of the rollout estimates, and 2) adaptively sampling the rollouts over the state-action pairs.Cette thèse étudie une classe d’algorithmesd’apprentissage par renforcement (RL), appelée « itération sur les politiques obtenuespar classification » (CBPI). Contrairement aux méthodes standards de RL, CBPIn’utilise pas de représentation explicite de la fonction valeur. CBPI réalise des déroulés(des trajectoires) et estime la fonction action-valeur de la politique courante pour unnombre limité d’états et d’actions. En utilisant un ensemble d’apprentissage con-struit à partir de ces estimations, la politique gloutonne est apprise comme le produitd’un classificateur. La politique ainsi produite à chaque itération de l’algorithme,n’est plus définie par une fonction valeur (approximée), mais par un classificateur.Dans cette thèse, nous proposons de nouveaux algorithmes qui améliorent les perfor-mances des méthodes CBPI existantes, spécialement lorsque le nombre d’interactionsavec l’environnement est limité. Nos améliorations se portent sur les deux limitationsde CBPI suivantes : 1) les déroulés utilisés pour estimer les fonctions action-valeurdoivent être tronqués et leur nombre est limité, créant un compromis entre le biais etla variance dans ces estimations, et 2) les déroulés sont répartis de manière uniformeentre les états déroulés et les actions disponibles, alors qu’une stratégie plus évoluéepourrait garantir un ensemble d’apprentissage plus précis. Nous proposons des algo-rithmes CBPI qui répondent à ces limitations, respectivement : 1) en utilisant uneapproximation de la fonction valeur pour améliorer la précision (en équilibrant biais etvariance) des estimations, et 2) en échantillonnant de manière adaptative les déroulésparmi les paires d’état-action

HAL - Lille 3

Thèses en Ligne

INRIA a CCSD electronic archive server

Classification-based policy iteration with a critic

Author: Alessandro Lazaric
Bruno Scherrer
Mohammad Ghavamzadeh
Victor Gabillon
Publication venue
Publication date: 01/01/2011
Field of study

In this paper, we study the effect of adding a value function approximation component (critic) to rollout classification-based policy iteration (RCPI) algorithms. The idea is to use a critic to approximate the return after we truncate the rollout trajectories. This allows us to control the bias and variance of the rolloutestimatesofthe action-valuefunction. Therefore, the introduction of a critic can improve the accuracy of the rollout estimates, and as a result, enhance the performance of the RCPI algorithm. We present a new RCPI algorithm, called direct policy iteration with critic (DPI-Critic), and provide its finite-sample analysis when the critic is based on the LSTD method. We empirically evaluate the performance of DPI-Critic and compare it with DPI and LSPI in two benchmark reinforcement learning problems. 1

CiteSeerX

HAL - Lille 3

INRIA a CCSD electronic archive server