Search CORE

78,908 research outputs found

Breaking the Deadly Triad with a Target Network

Author: Whiteson Shimon
Yao Hengshuai
Zhang Shangtong
Publication venue
Publication date: 21/07/2021
Field of study

The deadly triad refers to the instability of a reinforcement learning algorithm when it employs off-policy learning, function approximation, and bootstrapping simultaneously. In this paper, we investigate the target network as a tool for breaking the deadly triad, providing theoretical support for the conventional wisdom that a target network stabilizes training. We first propose and analyze a novel target network update rule which augments the commonly used Polyak-averaging style update with two projections. We then apply the target network and ridge regularization in several divergent algorithms and show their convergence to regularized TD fixed points. Those algorithms are off-policy with linear function approximation and bootstrapping, spanning both policy evaluation and control, as well as both discounted and average-reward settings. In particular, we provide the first convergent linear

Q

-learning algorithms under nonrestrictive and changing behavior policies without bi-level optimization.Comment: ICML 202

arXiv.org e-Print Archive

Oxford University Research Archive

A Dantzig Selector Approach to Temporal Difference Learning

Author: Geist Matthieu
Ghavamzadeh Mohammad
Lazaric Alessandro
Scherrer Bruno
Publication venue
Publication date: 26/06/2012
Field of study

LSTD is a popular algorithm for value function approximation. Whenever the number of features is larger than the number of samples, it must be paired with some form of regularization. In particular, L1-regularization methods tend to perform feature selection by promoting sparsity, and thus, are well-suited for high-dimensional problems. However, since LSTD is not a simple regression algorithm, but it solves a fixed--point problem, its integration with L1-regularization is not straightforward and might come with some drawbacks (e.g., the P-matrix assumption for LASSO-TD). In this paper, we introduce a novel algorithm obtained by integrating LSTD with the Dantzig Selector. We investigate the performance of the proposed algorithm and its relationship with the existing regularized approaches, and show how it addresses some of their drawbacks.Comment: Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012

arXiv.org e-Print Archive

HAL-CentraleSupelec

HAL - Lille 3

INRIA a CCSD electronic archive server

Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis

Author: Hallak Assaf
Mannor Shie
Munos Remi
Tamar Aviv
Publication venue
Publication date: 27/11/2015
Field of study

We consider the off-policy evaluation problem in Markov decision processes with function approximation. We propose a generalization of the recently introduced \emph{emphatic temporal differences} (ETD) algorithm \citep{SuttonMW15}, which encompasses the original ETD(

\lambda

), as well as several other off-policy evaluation algorithms as special cases. We call this framework \ETD, where our introduced parameter

\beta

controls the decay rate of an importance-sampling term. We study conditions under which the projected fixed-point equation underlying \ETD\ involves a contraction operator, allowing us to present the first asymptotic error bounds (bias) for \ETD. Our results show that the original ETD algorithm always involves a contraction operator, and its bias is bounded. Moreover, by controlling

\beta

, our proposed generalization allows trading-off bias for variance reduction, thereby achieving a lower total error.Comment: arXiv admin note: text overlap with arXiv:1508.0341

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications