22,146 research outputs found
Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis
We consider the off-policy evaluation problem in Markov decision processes
with function approximation. We propose a generalization of the recently
introduced \emph{emphatic temporal differences} (ETD) algorithm
\citep{SuttonMW15}, which encompasses the original ETD(), as well as
several other off-policy evaluation algorithms as special cases. We call this
framework \ETD, where our introduced parameter controls the decay rate
of an importance-sampling term. We study conditions under which the projected
fixed-point equation underlying \ETD\ involves a contraction operator, allowing
us to present the first asymptotic error bounds (bias) for \ETD. Our results
show that the original ETD algorithm always involves a contraction operator,
and its bias is bounded. Moreover, by controlling , our proposed
generalization allows trading-off bias for variance reduction, thereby
achieving a lower total error.Comment: arXiv admin note: text overlap with arXiv:1508.0341
An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning
In this paper we introduce the idea of improving the performance of
parametric temporal-difference (TD) learning algorithms by selectively
emphasizing or de-emphasizing their updates on different time steps. In
particular, we show that varying the emphasis of linear TD()'s updates
in a particular way causes its expected update to become stable under
off-policy training. The only prior model-free TD methods to achieve this with
per-step computation linear in the number of function approximation parameters
are the gradient-TD family of methods including TDC, GTD(), and
GQ(). Compared to these methods, our _emphatic TD()_ is
simpler and easier to use; it has only one learned parameter vector and one
step-size parameter. Our treatment includes general state-dependent discounting
and bootstrapping functions, and a way of specifying varying degrees of
interest in accurately valuing different states.Comment: 29 pages This is a significant revision based on the first set of
reviews. The most important change was to signal early that the main result
is about stability, not convergenc
Provably Convergent Two-Timescale Off-Policy Actor-Critic with Function Approximation
We present the first provably convergent two-timescale off-policy
actor-critic algorithm (COF-PAC) with function approximation. Key to COF-PAC is
the introduction of a new critic, the emphasis critic, which is trained via
Gradient Emphasis Learning (GEM), a novel combination of the key ideas of
Gradient Temporal Difference Learning and Emphatic Temporal Difference
Learning. With the help of the emphasis critic and the canonical value function
critic, we show convergence for COF-PAC, where the critics are linear and the
actor can be nonlinear.Comment: ICML 202
- …