Search CORE

20 research outputs found

A Finite Time Analysis of Two Time-Scale Actor Critic Methods

Author: Gu Quanquan
Wu Yue
Xu Pan
Zhang Weitong
Publication venue
Publication date: 14/06/2020
Field of study

Actor-critic (AC) methods have exhibited great empirical success compared with other reinforcement learning algorithms, where the actor uses the policy gradient to improve the learning policy and the critic uses temporal difference learning to estimate the policy gradient. Under the two time-scale learning rate schedule, the asymptotic convergence of AC has been well studied in the literature. However, the non-asymptotic convergence and finite sample complexity of actor-critic methods are largely open. In this work, we provide a non-asymptotic analysis for two time-scale actor-critic methods under non-i.i.d. setting. We prove that the actor-critic method is guaranteed to find a first-order stationary point (i.e.,

\|\nabla J(\boldsymbol{\theta})\|_2^2 \le \epsilon

) of the non-concave performance function

J(\boldsymbol{\theta})

, with

\mathcal{\tilde{O}}(\epsilon^{-2.5})

sample complexity. To the best of our knowledge, this is the first work providing finite-time analysis and sample complexity bound for two time-scale actor-critic methods.Comment: 45 page

arXiv.org e-Print Archive

Improved Sample Complexity Analysis of Natural Policy Gradient Algorithm with General Parameterization for Infinite Horizon Discounted Reward Markov Decision Processes

Author: Aggarwal Vaneet
Mondal Washim Uddin
Publication venue
Publication date: 17/10/2023
Field of study

We consider the problem of designing sample efficient learning algorithms for infinite horizon discounted reward Markov Decision Process. Specifically, we propose the Accelerated Natural Policy Gradient (ANPG) algorithm that utilizes an accelerated stochastic gradient descent process to obtain the natural policy gradient. ANPG achieves

\mathcal{O}({\epsilon^{-2}})

sample complexity and

\mathcal{O}(\epsilon^{-1})

iteration complexity with general parameterization where

\epsilon

defines the optimality error. This improves the state-of-the-art sample complexity by a

\log(\frac{1}{\epsilon})

factor. ANPG is a first-order algorithm and unlike some existing literature, does not require the unverifiable assumption that the variance of importance sampling (IS) weights is upper bounded. In the class of Hessian-free and IS-free algorithms, ANPG beats the best-known sample complexity by a factor of

\mathcal{O}(\epsilon^{-\frac{1}{2}})

and simultaneously matches their state-of-the-art iteration complexity

arXiv.org e-Print Archive

Distributed Reinforcement Learning in Multi-Agent Networked Systems

Author: Huang Longbo
Lin Yiheng
Qu Guannan
Wierman Adam
Publication venue
Publication date: 11/06/2020
Field of study

We study distributed reinforcement learning (RL) for a network of agents. The objective is to find localized policies that maximize the (discounted) global reward. In general, scalability is a challenge in this setting because the size of the global state/action space can be exponential in the number of agents. Scalable algorithms are only known in cases where dependencies are local, e.g., between neighbors. In this work, we propose a Scalable Actor Critic framework that applies in settings where the dependencies are non-local and provide a finite-time error bound that shows how the convergence rate depends on the depth of the dependencies in the network. Additionally, as a byproduct of our analysis, we obtain novel finite-time convergence results for a general stochastic approximation scheme and for temporal difference learning with state aggregation that apply beyond the setting of RL in networked systems