Search CORE

22,146 research outputs found

Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis

Author: Hallak Assaf
Mannor Shie
Munos Remi
Tamar Aviv
Publication venue
Publication date: 27/11/2015
Field of study

We consider the off-policy evaluation problem in Markov decision processes with function approximation. We propose a generalization of the recently introduced \emph{emphatic temporal differences} (ETD) algorithm \citep{SuttonMW15}, which encompasses the original ETD(

\lambda

), as well as several other off-policy evaluation algorithms as special cases. We call this framework \ETD, where our introduced parameter

\beta

controls the decay rate of an importance-sampling term. We study conditions under which the projected fixed-point equation underlying \ETD\ involves a contraction operator, allowing us to present the first asymptotic error bounds (bias) for \ETD. Our results show that the original ETD algorithm always involves a contraction operator, and its bias is bounded. Moreover, by controlling

\beta

, our proposed generalization allows trading-off bias for variance reduction, thereby achieving a lower total error.Comment: arXiv admin note: text overlap with arXiv:1508.0341

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning

Author: Mahmood A. Rupam
Sutton Richard S.
White Martha
Publication venue
Publication date: 20/04/2015
Field of study

In this paper we introduce the idea of improving the performance of parametric temporal-difference (TD) learning algorithms by selectively emphasizing or de-emphasizing their updates on different time steps. In particular, we show that varying the emphasis of linear TD(

\lambda

)'s updates in a particular way causes its expected update to become stable under off-policy training. The only prior model-free TD methods to achieve this with per-step computation linear in the number of function approximation parameters are the gradient-TD family of methods including TDC, GTD(

\lambda

), and GQ(

\lambda

). Compared to these methods, our _emphatic TD(

\lambda

)_ is simpler and easier to use; it has only one learned parameter vector and one step-size parameter. Our treatment includes general state-dependent discounting and bootstrapping functions, and a way of specifying varying degrees of interest in accurately valuing different states.Comment: 29 pages This is a significant revision based on the first set of reviews. The most important change was to signal early that the main result is about stability, not convergenc

arXiv.org e-Print Archive

CiteSeerX

Provably Convergent Two-Timescale Off-Policy Actor-Critic with Function Approximation

Author: Liu Bo
Whiteson Shimon
Yao Hengshuai
Zhang Shangtong
Publication venue
Publication date: 01/01/2020
Field of study

We present the first provably convergent two-timescale off-policy actor-critic algorithm (COF-PAC) with function approximation. Key to COF-PAC is the introduction of a new critic, the emphasis critic, which is trained via Gradient Emphasis Learning (GEM), a novel combination of the key ideas of Gradient Temporal Difference Learning and Emphatic Temporal Difference Learning. With the help of the emphasis critic and the canonical value function critic, we show convergence for COF-PAC, where the critics are linear and the actor can be nonlinear.Comment: ICML 202

arXiv.org e-Print Archive

Oxford University Research Archive