252 research outputs found
Direct Gradient Temporal Difference Learning
Off-policy learning enables a reinforcement learning (RL) agent to reason
counterfactually about policies that are not executed and is one of the most
important ideas in RL. It, however, can lead to instability when combined with
function approximation and bootstrapping, two arguably indispensable
ingredients for large-scale reinforcement learning. This is the notorious
deadly triad. Gradient Temporal Difference (GTD) is one powerful tool to solve
the deadly triad. Its success results from solving a doubling sampling issue
indirectly with weight duplication or Fenchel duality. In this paper, we
instead propose a direct method to solve the double sampling issue by simply
using two samples in a Markovian data stream with an increasing gap. The
resulting algorithm is as computationally efficient as GTD but gets rid of
GTD's extra weights. The only price we pay is a logarithmically increasing
memory as time progresses. We provide both asymptotic and finite sample
analysis, where the convergence rate is on-par with the canonical on-policy
temporal difference learning. Key to our analysis is a novel refined
discretization of limiting ODEs.Comment: Submitted to JMLR in Apr 202
- …