3 research outputs found
A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants
This paper develops an unified framework to study finite-sample convergence
guarantees of a large class of value-based asynchronous Reinforcement Learning
(RL) algorithms. We do this by first reformulating the RL algorithms as
Markovian Stochastic Approximation (SA) algorithms to solve fixed-point
equations. We then develop a Lyapunov analysis and derive mean-square error
bounds on the convergence of the Markovian SA. Based on this central result, we
establish finite-sample mean-square convergence bounds for asynchronous RL
algorithms such as -learning, -step TD, TD, and off-policy TD
algorithms including V-trace. As a by-product, by analyzing the performance
bounds of the TD (and -step TD) algorithm for general
(and ), we demonstrate a bias-variance trade-off, i.e., efficiency of
bootstrapping in RL. This was first posed as an open problem in [37]