20 research outputs found

    A Finite Time Analysis of Two Time-Scale Actor Critic Methods

    Full text link
    Actor-critic (AC) methods have exhibited great empirical success compared with other reinforcement learning algorithms, where the actor uses the policy gradient to improve the learning policy and the critic uses temporal difference learning to estimate the policy gradient. Under the two time-scale learning rate schedule, the asymptotic convergence of AC has been well studied in the literature. However, the non-asymptotic convergence and finite sample complexity of actor-critic methods are largely open. In this work, we provide a non-asymptotic analysis for two time-scale actor-critic methods under non-i.i.d. setting. We prove that the actor-critic method is guaranteed to find a first-order stationary point (i.e., J(θ)22ϵ\|\nabla J(\boldsymbol{\theta})\|_2^2 \le \epsilon) of the non-concave performance function J(θ)J(\boldsymbol{\theta}), with O~(ϵ2.5)\mathcal{\tilde{O}}(\epsilon^{-2.5}) sample complexity. To the best of our knowledge, this is the first work providing finite-time analysis and sample complexity bound for two time-scale actor-critic methods.Comment: 45 page

    Improved Sample Complexity Analysis of Natural Policy Gradient Algorithm with General Parameterization for Infinite Horizon Discounted Reward Markov Decision Processes

    Full text link
    We consider the problem of designing sample efficient learning algorithms for infinite horizon discounted reward Markov Decision Process. Specifically, we propose the Accelerated Natural Policy Gradient (ANPG) algorithm that utilizes an accelerated stochastic gradient descent process to obtain the natural policy gradient. ANPG achieves O(ϵ2)\mathcal{O}({\epsilon^{-2}}) sample complexity and O(ϵ1)\mathcal{O}(\epsilon^{-1}) iteration complexity with general parameterization where ϵ\epsilon defines the optimality error. This improves the state-of-the-art sample complexity by a log(1ϵ)\log(\frac{1}{\epsilon}) factor. ANPG is a first-order algorithm and unlike some existing literature, does not require the unverifiable assumption that the variance of importance sampling (IS) weights is upper bounded. In the class of Hessian-free and IS-free algorithms, ANPG beats the best-known sample complexity by a factor of O(ϵ12)\mathcal{O}(\epsilon^{-\frac{1}{2}}) and simultaneously matches their state-of-the-art iteration complexity

    Distributed Reinforcement Learning in Multi-Agent Networked Systems

    Get PDF
    We study distributed reinforcement learning (RL) for a network of agents. The objective is to find localized policies that maximize the (discounted) global reward. In general, scalability is a challenge in this setting because the size of the global state/action space can be exponential in the number of agents. Scalable algorithms are only known in cases where dependencies are local, e.g., between neighbors. In this work, we propose a Scalable Actor Critic framework that applies in settings where the dependencies are non-local and provide a finite-time error bound that shows how the convergence rate depends on the depth of the dependencies in the network. Additionally, as a byproduct of our analysis, we obtain novel finite-time convergence results for a general stochastic approximation scheme and for temporal difference learning with state aggregation that apply beyond the setting of RL in networked systems
    corecore