7 research outputs found

    Approximate dynamic programming for two-player zero-sum Markov games

    Get PDF
    International audienceThis paper provides an analysis of error propagation in Approximate Dynamic Programming applied to zero-sum two-player Stochastic Games. We provide a novel and unified error propagation analysis in L p-norm of three well-known algorithms adapted to Stochastic Games (namely Approximate Value Iteration, Approximate Policy Iteration and Approximate Generalized Policy Iteratio,n). We show that we can achieve a stationary policy which is 2γ+ (1−γ) 2-optimal, where is the value function approximation error and is the approximate greedy operator error. In addition , we provide a practical algorithm (AGPI-Q) to solve infinite horizon γ-discounted two-player zero-sum Stochastic Games in a batch setting. It is an extension of the Fitted-Q algorithm (which solves Markov Decisions Processes from data) and can be non-parametric. Finally, we demonstrate experimentally the performance of AGPI-Q on a simultaneous two-player game, namely Alesia

    Approximate Modified Policy Iteration

    Get PDF
    Modified policy iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated policy and value iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its approximation form which is used when the state and/or action spaces are large or infinite. In this paper, we propose three implementations of approximate MPI (AMPI) that are extensions of well-known approximate DP algorithms: fitted-value iteration, fitted-Q iteration, and classification-based policy iteration. We provide error propagation analyses that unify those for approximate policy and value iteration. On the last classification-based implementation, we develop a finite-sample analysis that shows that MPI's main parameter allows to control the balance between the estimation error of the classifier and the overall value function approximation

    Finite-Sample Analysis of Least-Squares Policy Iteration

    Get PDF
    International audienceIn this paper, we report a performance bound for the widely used least-squares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the least-squares temporal-difference (LSTD) learning method, and report finite-sample analysis for this algorithm. To do so, we first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is ÎČ\beta-mixing. Finally, we analyze how the error at each policy evaluation step is propagated through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm

    Analysis of Classification-based Policy Iteration Algorithms

    Get PDF
    International audienceWe introduce a variant of the classification-based approach to policy iteration which uses a cost-sensitive loss function weighting each classification mistake by its actual regret, that is, the difference between the action-value of the greedy action and of the action chosen by the classifier. For this algorithm, we provide a full finite-sample analysis. Our results state a performance bound in terms of the number of policy improvement steps, the number of rollouts used in each iteration, the capacity of the considered policy space (classifier), and a capacity measure which indicates how well the policy space can approximate policies that are greedy with respect to any of its members. The analysis reveals a tradeoff between the estimation and approximation errors in this classification-based policy iteration setting. Furthermore it confirms the intuition that classification-based policy iteration algorithms could be favorably compared to value-based approaches when the policies can be approximated more easily than their corresponding value functions. We also study the consistency of the algorithm when there exists a sequence of policy spaces with increasing capacity

    algorithmes budgĂ©tisĂ©s d’itĂ©ration sur les politiques obtenues par classification

    Get PDF
    This dissertation is motivated by the study of a class of reinforcement learning (RL) algorithms, called classification-based policy iteration (CBPI). Contrary to the standard RL methods, CBPI do not use an explicit representation for value function. Instead, they use rollouts and estimate the action-value function of the current policy at a collection of states. Using a training set built from these rollout estimates, the greedy policy is learned as the output of a classifier. Thus, the policy generated at each iteration of the algorithm, is no longer defined by a (approximated) value function, but instead by a classifier. In this thesis, we propose new algorithms that improve the performance of the existing CBPI methods, especially when they have a fixed budget of interaction with the environment. Our improvements are based on the following two shortcomings of the existing CBPI algorithms: 1) The rollouts that are used to estimate the action-value functions should be truncated and their number is limited, and thus, we have to deal with bias-variance tradeoff in estimating the rollouts, and 2) The rollouts are allocated uniformly over the states in the rollout set and the available actions, while a smarter allocation strategy could guarantee a more accurate training set for the classifier. We propose CBPI algorithms that address these issues, respectively, by: 1) the use of a value function approximation to improve the accuracy (balancing the bias and variance) of the rollout estimates, and 2) adaptively sampling the rollouts over the state-action pairs.Cette thĂšse Ă©tudie une classe d’algorithmesd’apprentissage par renforcement (RL), appelĂ©e « itĂ©ration sur les politiques obtenuespar classification » (CBPI). Contrairement aux mĂ©thodes standards de RL, CBPIn’utilise pas de reprĂ©sentation explicite de la fonction valeur. CBPI rĂ©alise des dĂ©roulĂ©s(des trajectoires) et estime la fonction action-valeur de la politique courante pour unnombre limitĂ© d’états et d’actions. En utilisant un ensemble d’apprentissage con-struit Ă  partir de ces estimations, la politique gloutonne est apprise comme le produitd’un classificateur. La politique ainsi produite Ă  chaque itĂ©ration de l’algorithme,n’est plus dĂ©finie par une fonction valeur (approximĂ©e), mais par un classificateur.Dans cette thĂšse, nous proposons de nouveaux algorithmes qui amĂ©liorent les perfor-mances des mĂ©thodes CBPI existantes, spĂ©cialement lorsque le nombre d’interactionsavec l’environnement est limitĂ©. Nos amĂ©liorations se portent sur les deux limitationsde CBPI suivantes : 1) les dĂ©roulĂ©s utilisĂ©s pour estimer les fonctions action-valeurdoivent ĂȘtre tronquĂ©s et leur nombre est limitĂ©, crĂ©ant un compromis entre le biais etla variance dans ces estimations, et 2) les dĂ©roulĂ©s sont rĂ©partis de maniĂšre uniformeentre les Ă©tats dĂ©roulĂ©s et les actions disponibles, alors qu’une stratĂ©gie plus Ă©voluĂ©epourrait garantir un ensemble d’apprentissage plus prĂ©cis. Nous proposons des algo-rithmes CBPI qui rĂ©pondent Ă  ces limitations, respectivement : 1) en utilisant uneapproximation de la fonction valeur pour amĂ©liorer la prĂ©cision (en Ă©quilibrant biais etvariance) des estimations, et 2) en Ă©chantillonnant de maniĂšre adaptative les dĂ©roulĂ©sparmi les paires d’état-action

    Classification-based policy iteration with a critic

    Get PDF
    In this paper, we study the effect of adding a value function approximation component (critic) to rollout classification-based policy iteration (RCPI) algorithms. The idea is to use a critic to approximate the return after we truncate the rollout trajectories. This allows us to control the bias and variance of the rolloutestimatesofthe action-valuefunction. Therefore, the introduction of a critic can improve the accuracy of the rollout estimates, and as a result, enhance the performance of the RCPI algorithm. We present a new RCPI algorithm, called direct policy iteration with critic (DPI-Critic), and provide its finite-sample analysis when the critic is based on the LSTD method. We empirically evaluate the performance of DPI-Critic and compare it with DPI and LSPI in two benchmark reinforcement learning problems. 1
    corecore