11 research outputs found
Baird Counterexample is Solved: with an example of How to Debug a Two-time-scale Algorithm
Baird counterexample was proposed by Leemon Baird in 1995, first used to show
that the Temporal Difference (TD(0)) algorithm diverges on this example. Since
then, it is often used to test and compare off-policy learning algorithms.
Gradient TD algorithms solved the divergence issue of TD on Baird
counterexample. However, their convergence on this example is still very slow,
and the nature of the slowness is not well understood, e.g., see (Sutton and
Barto 2018).
This note is to understand in particular, why TDC is slow on this example,
and provide a debugging analysis to understand this behavior. Our debugging
technique can be used to study the convergence behavior of two-time-scale
stochastic approximation algorithms. We also provide empirical results of the
recent Impression GTD algorithm on this example, showing the convergence is
very fast, in fact, in a linear rate. We conclude that Baird counterexample is
solved, by an algorithm with the convergence guarantee to the TD solution in
general, and a fast convergence rate
Convergence of Least Squares Temporal Difference Methods Under General Conditions
We consider approximate policy evaluation for finite state and action Markov decision processes (MDP) in the off-policy learning context and with the simulation-based least squares temporal difference algorithm, LSTD(λ). We establish for the discounted cost criterion that the off-policy LSTD(λ) converges almost surely under mild, minimal conditions. We also analyze other convergence and boundedness properties of the iterates involved in the algorithm, and based on them, we suggest a modification in its practical implementation. Our analysis uses theories of both finite space Markov chains and Markov chains on topological spaces, in particular, the e-chains
Direct Gradient Temporal Difference Learning
Off-policy learning enables a reinforcement learning (RL) agent to reason
counterfactually about policies that are not executed and is one of the most
important ideas in RL. It, however, can lead to instability when combined with
function approximation and bootstrapping, two arguably indispensable
ingredients for large-scale reinforcement learning. This is the notorious
deadly triad. Gradient Temporal Difference (GTD) is one powerful tool to solve
the deadly triad. Its success results from solving a doubling sampling issue
indirectly with weight duplication or Fenchel duality. In this paper, we
instead propose a direct method to solve the double sampling issue by simply
using two samples in a Markovian data stream with an increasing gap. The
resulting algorithm is as computationally efficient as GTD but gets rid of
GTD's extra weights. The only price we pay is a logarithmically increasing
memory as time progresses. We provide both asymptotic and finite sample
analysis, where the convergence rate is on-par with the canonical on-policy
temporal difference learning. Key to our analysis is a novel refined
discretization of limiting ODEs.Comment: Submitted to JMLR in Apr 202
Least Squares Temporal Difference Methods : An Analysis Under General Conditions
This technical report is a revised and extended version of the technical report C-2010-1. It contains simplified and improved proofs, as well as extensions of some of the earlier results.We consider approximate policy evaluation for finite state and action Markov decision processes (MDP) with the least squares temporal difference algorithm, LSTD(λ), in an explorationenhanced off-policy learning context. We establish for the discounted cost criterion that the off-policy LSTD(λ) converges almost surely under mild, minimal conditions. We also analyze other convergence and boundedness properties of the iterates involved in the algorithm. Our analysis draws on theories of both finite space Markov chains and weak Feller Markov chains on topological spaces. Our results can be applied to other temporal difference algorithms and MDP models. As examples, we give a convergence analysis of an off-policy TD(λ) algorithm and extensions to MDP with compact action and state spaces
A new Gradient TD Algorithm with only One Step-size: Convergence Rate Analysis using - Smoothness
Gradient Temporal Difference (GTD) algorithms (Sutton et al., 2008, 2009) are
the first ( is the number features) algorithms that have convergence
guarantees for off-policy learning with linear function approximation. Liu et
al. (2015) and Dalal et. al. (2018) proved the convergence rates of GTD, GTD2
and TDC are for some . This bound is tight
(Dalal et al., 2020), and slower than . GTD algorithms also have
two step-size parameters, which are difficult to tune. In literature, there is
a "single-time-scale" formulation of GTD. However, this formulation still has
two step-size parameters.
This paper presents a truly single-time-scale GTD algorithm for minimizing
the Norm of Expected td Update (NEU) objective, and it has only one step-size
parameter. We prove that the new algorithm, called Impression GTD, converges at
least as fast as . Furthermore, based on a generalization of the
expected smoothness (Gower et al. 2019), called - smoothness, we
are able to prove that the new GTD converges even faster, in fact, with a
linear rate. Our rate actually also improves Gower et al.'s result with a
tighter bound under a weaker assumption. Besides Impression GTD, we also prove
the rates of three other GTD algorithms, one by Yao and Liu (2008), another
called A-transpose-TD (Sutton et al., 2008), and a counterpart of
A-transpose-TD. The convergence rates of all the four GTD algorithms are proved
in a single generic GTD framework to which - smoothness applies.
Empirical results on Random walks, Boyan chain, and Baird counterexample show
that Impression GTD converges much faster than existing GTD algorithms for both
on-policy and off-policy learning problems, with well-performing step-sizes in
a big range
Approximate policy iteration: A survey and some new methods
We consider the classical policy iteration method of dynamic programming (DP), where approximations and simulation are used to deal with the curse of dimensionality. We survey a number of issues: convergence and rate of convergence of approximate policy evaluation methods, singularity and susceptibility to simulation noise of policy evaluation, exploration issues, constrained and enhanced policy iteration, policy oscillation and chattering, and optimistic and distributed policy iteration. Our discussion of policy evaluation is couched in general terms and aims to unify the available methods in the light of recent research developments and to compare the two main policy evaluation approaches: projected equations and temporal differences (TD), and aggregation. In the context of these approaches, we survey two different types of simulation-based algorithms: matrix inversion methods, such as least-squares temporal difference (LSTD), and iterative methods, such as least-squares policy evaluation (LSPE) and TD (λ), and their scaled variants. We discuss a recent method, based on regression and regularization, which rectifies the unreliability of LSTD for nearly singular projected Bellman equations. An iterative version of this method belongs to the LSPE class of methods and provides the connecting link between LSTD and LSPE. Our discussion of policy improvement focuses on the role of policy oscillation and its effect on performance guarantees. We illustrate that policy evaluation when done by the projected equation/TD approach may lead to policy oscillation, but when done by aggregation it does not. This implies better error bounds and more regular performance for aggregation, at the expense of some loss of generality in cost function representation capability. Hard aggregation provides the connecting link between projected equation/TD-based and aggregation-based policy evaluation, and is characterized by favorable error bounds.National Science Foundation (U.S.) (No.ECCS-0801549)Los Alamos National Laboratory. Information Science and Technology InstituteUnited States. Air Force (No.FA9550-10-1-0412