22,146 research outputs found

    Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis

    Full text link
    We consider the off-policy evaluation problem in Markov decision processes with function approximation. We propose a generalization of the recently introduced \emph{emphatic temporal differences} (ETD) algorithm \citep{SuttonMW15}, which encompasses the original ETD(λ\lambda), as well as several other off-policy evaluation algorithms as special cases. We call this framework \ETD, where our introduced parameter β\beta controls the decay rate of an importance-sampling term. We study conditions under which the projected fixed-point equation underlying \ETD\ involves a contraction operator, allowing us to present the first asymptotic error bounds (bias) for \ETD. Our results show that the original ETD algorithm always involves a contraction operator, and its bias is bounded. Moreover, by controlling β\beta, our proposed generalization allows trading-off bias for variance reduction, thereby achieving a lower total error.Comment: arXiv admin note: text overlap with arXiv:1508.0341

    An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning

    Full text link
    In this paper we introduce the idea of improving the performance of parametric temporal-difference (TD) learning algorithms by selectively emphasizing or de-emphasizing their updates on different time steps. In particular, we show that varying the emphasis of linear TD(λ\lambda)'s updates in a particular way causes its expected update to become stable under off-policy training. The only prior model-free TD methods to achieve this with per-step computation linear in the number of function approximation parameters are the gradient-TD family of methods including TDC, GTD(λ\lambda), and GQ(λ\lambda). Compared to these methods, our _emphatic TD(λ\lambda)_ is simpler and easier to use; it has only one learned parameter vector and one step-size parameter. Our treatment includes general state-dependent discounting and bootstrapping functions, and a way of specifying varying degrees of interest in accurately valuing different states.Comment: 29 pages This is a significant revision based on the first set of reviews. The most important change was to signal early that the main result is about stability, not convergenc

    Provably Convergent Two-Timescale Off-Policy Actor-Critic with Function Approximation

    Full text link
    We present the first provably convergent two-timescale off-policy actor-critic algorithm (COF-PAC) with function approximation. Key to COF-PAC is the introduction of a new critic, the emphasis critic, which is trained via Gradient Emphasis Learning (GEM), a novel combination of the key ideas of Gradient Temporal Difference Learning and Emphatic Temporal Difference Learning. With the help of the emphasis critic and the canonical value function critic, we show convergence for COF-PAC, where the critics are linear and the actor can be nonlinear.Comment: ICML 202
    corecore