In this paper, we propose a novel reinforcement- learning algorithm
consisting in a stochastic variance-reduced version of policy gradient for
solving Markov Decision Processes (MDPs). Stochastic variance-reduced gradient
(SVRG) methods have proven to be very successful in supervised learning.
However, their adaptation to policy gradient is not straightforward and needs
to account for I) a non-concave objective func- tion; II) approximations in the
full gradient com- putation; and III) a non-stationary sampling pro- cess. The
result is SVRPG, a stochastic variance- reduced policy gradient algorithm that
leverages on importance weights to preserve the unbiased- ness of the gradient
estimate. Under standard as- sumptions on the MDP, we provide convergence
guarantees for SVRPG with a convergence rate that is linear under increasing
batch sizes. Finally, we suggest practical variants of SVRPG, and we
empirically evaluate them on continuous MDPs

Binaghi, Damiano

Canonaco, Giuseppe

Papini, Matteo

Pirotta, Matteo

Restelli, Marcello

English

arXiv

International audienceIn this paper, we propose a novel reinforcement-learning algorithm consisting in a stochastic variance-reduced version of policy gradient for solving Markov Decision Processes (MDPs). Stochastic variance-reduced gradient (SVRG) methods have proven to be very successful in supervised learning. However, their adaptation to policy gradient is not straightforward and needs to account for I) a non-concave objective function; II) approximations in the full gradient computation; and III) a non-stationary sampling process. The result is SVRPG, a stochastic variance-reduced policy gradient algorithm that leverages on importance weights to preserve the unbiasedness of the gradient estimate. Under standard assumptions on the MDP, we provide convergence guarantees for SVRPG with a convergence rate that is linear under increasing batch sizes. Finally, we suggest practical variants of SVRPG, and we empirically evaluate them on continuous MDPs

HAL Descartes

Stochastic Variance-Reduced Policy Gradient

Hal-Diderot

In this paper, we propose a novel reinforcement-learning algorithm consisting in a stochastic variance-reduced version of policy gradient for solving Markov Decision Processes (MDPs). Stochastic variance-reduced gradient (SVRG) methods have proven to be very successful in supervised learning. However, their adaptation to policy gradient is not straightforward and needs to account for I) a non-concave objective function; II) approximations in the full gradient computation; and III) a non-stationary sampling process. The result is SVRPG, a stochastic variance-reduced policy gradient algorithm that leverages on importance weights to preserve the unbiasedness of the gradient estimate. Under standard assumptions on the MDP, we provide convergence guarantees for SVRPG with a convergence rate that is linear under increasing batch sizes. Finally, we suggest practical variants of SVRPG, and we empirically evaluate them on continuous MDPs

Stochastic Variance-Reduced Policy Gradient

Abstract

Similar works

Full text

Available Versions

HAL Descartes

Hal-Diderot

Archivio istituzionale della ricerca - Politecnico di Milano

INRIA a CCSD electronic archive server