Search CORE

8,501 research outputs found

An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning

Author: Mahmood A. Rupam
Sutton Richard S.
White Martha
Publication venue
Publication date: 20/04/2015
Field of study

In this paper we introduce the idea of improving the performance of parametric temporal-difference (TD) learning algorithms by selectively emphasizing or de-emphasizing their updates on different time steps. In particular, we show that varying the emphasis of linear TD(

\lambda

)'s updates in a particular way causes its expected update to become stable under off-policy training. The only prior model-free TD methods to achieve this with per-step computation linear in the number of function approximation parameters are the gradient-TD family of methods including TDC, GTD(

\lambda

), and GQ(

\lambda

). Compared to these methods, our _emphatic TD(

\lambda

)_ is simpler and easier to use; it has only one learned parameter vector and one step-size parameter. Our treatment includes general state-dependent discounting and bootstrapping functions, and a way of specifying varying degrees of interest in accurately valuing different states.Comment: 29 pages This is a significant revision based on the first set of reviews. The most important change was to signal early that the main result is about stability, not convergenc

arXiv.org e-Print Archive

CiteSeerX

Generalized Off-Policy Actor-Critic

Author: Boehmer Wendelin
Whiteson Shimon
Zhang Shangtong
Publication venue
Publication date: 28/10/2019
Field of study

We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuing reinforcement learning (RL) setting. Compared to the commonly used excursion objective, which can be misleading about the performance of the target policy when deployed, our new objective better predicts such performance. We prove the Generalized Off-Policy Policy Gradient Theorem to compute the policy gradient of the counterfactual objective and use an emphatic approach to get an unbiased sample from this policy gradient, yielding the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm. We demonstrate the merits of Geoff-PAC over existing algorithms in Mujoco robot simulation tasks, the first empirical success of emphatic algorithms in prevailing deep RL benchmarks.Comment: NeurIPS 201

arXiv.org e-Print Archive

Oxford University Research Archive