60,517 research outputs found

    Least-squares methods for policy iteration

    Get PDF
    Approximate reinforcement learning deals with the essential problem of applying reinforcement learning in large and continuous state-action spaces, by using function approximators to represent the solution. This chapter reviews least-squares methods for policy iteration, an important class of algorithms for approximate reinforcement learning. We discuss three techniques for solving the core, policy evaluation component of policy iteration, called: least-squares temporal difference, least-squares policy evaluation, and Bellman residual minimization. We introduce these techniques starting from their general mathematical principles and detailing them down to fully specified algorithms. We pay attention to online variants of policy iteration, and provide a numerical example highlighting the behavior of representative offline and online methods. For the policy evaluation component as well as for the overall resulting approximate policy iteration, we provide guarantees on the performance obtained asymptotically, as the number of samples processed and iterations executed grows to infinity. We also provide finite-sample results, which apply when a finite number of samples and iterations are considered. Finally, we outline several extensions and improvements to the techniques and methods reviewed

    Finite-Sample Analysis of Least-Squares Policy Iteration

    Get PDF
    In this paper, we report a performance bound for the widely used least-squares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, i.e., learning the value function of a fixed policy, using the least-squares temporal-difference (LSTD) learning method, and report finite-sample analysis for this algorithm. To do so, we first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is β\beta-mixing. Finally, we analyze how the error at each policy evaluation step is propagated through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm

    On recursive temporal difference and eligibility traces

    Get PDF
    This work studies a new reinforcement learning method in the framework of Recursive Least-Squares Temporal Difference (RLS-TD). Differently from the standard mechanism of eligibility traces, leading to RLS-TD(λ), in this work we show that the forgetting factor commonly used in gradient-based estimation has a similar role to the mechanism of eligibility traces. We adopt an instrumental variable perspective to illustrate this point and we propose a new algorithm, namely - RLS-TD with forgetting factor (RLS-TD-f). We test the proposed algorithm in a Policy Iteration setting, i.e. when the performance of an initially stabilizing controller must be improved. We take the cart-pole benchmark as experimental platform: extensive experiments show that the proposed RLS-TD algorithm exhibits larger performance improvements in the largest portion of the state space
    corecore