52,197 research outputs found
On Convergence of Emphatic Temporal-Difference Learning
We consider emphatic temporal-difference learning algorithms for policy
evaluation in discounted Markov decision processes with finite spaces. Such
algorithms were recently proposed by Sutton, Mahmood, and White (2015) as an
improved solution to the problem of divergence of off-policy
temporal-difference learning with linear function approximation. We present in
this paper the first convergence proofs for two emphatic algorithms,
ETD() and ELSTD(). We prove, under general off-policy
conditions, the convergence in for ELSTD() iterates, and the
almost sure convergence of the approximate value functions calculated by both
algorithms using a single infinitely long trajectory. Our analysis involves new
techniques with applications beyond emphatic algorithms leading, for example,
to the first proof that standard TD() also converges under off-policy
training for sufficiently large.Comment: A minor correction is made (see page 1 for details). 45 pages. A
shorter 28-page article based on the first version appeared at the 28th
Annual Conference on Learning Theory (COLT), 201
Finite Sample Analyses for TD(0) with Function Approximation
TD(0) is one of the most commonly used algorithms in reinforcement learning.
Despite this, there is no existing finite sample analysis for TD(0) with
function approximation, even for the linear case. Our work is the first to
provide such results. Existing convergence rates for Temporal Difference (TD)
methods apply only to somewhat modified versions, e.g., projected variants or
ones where stepsizes depend on unknown problem parameters. Our analyses obviate
these artificial alterations by exploiting strong properties of TD(0). We
provide convergence rates both in expectation and with high-probability. The
two are obtained via different approaches that use relatively unknown, recently
developed stochastic approximation techniques
A Convergent Off-Policy Temporal Difference Algorithm
Learning the value function of a given policy (target policy) from the data
samples obtained from a different policy (behavior policy) is an important
problem in Reinforcement Learning (RL). This problem is studied under the
setting of off-policy prediction. Temporal Difference (TD) learning algorithms
are a popular class of algorithms for solving the prediction problem. TD
algorithms with linear function approximation are shown to be convergent when
the samples are generated from the target policy (known as on-policy
prediction). However, it has been well established in the literature that
off-policy TD algorithms under linear function approximation diverge. In this
work, we propose a convergent on-line off-policy TD algorithm under linear
function approximation. The main idea is to penalize the updates of the
algorithm in a way as to ensure convergence of the iterates. We provide a
convergence analysis of our algorithm. Through numerical evaluations, we
further demonstrate the effectiveness of our algorithm
Value Function Approximation in Zero-Sum Markov Games
This paper investigates value function approximation in the context of
zero-sum Markov games, which can be viewed as a generalization of the Markov
decision process (MDP) framework to the two-agent case. We generalize error
bounds from MDPs to Markov games and describe generalizations of reinforcement
learning algorithms to Markov games. We present a generalization of the optimal
stopping problem to a two-player simultaneous move Markov game. For this
special problem, we provide stronger bounds and can guarantee convergence for
LSTD and temporal difference learning with linear value function approximation.
We demonstrate the viability of value function approximation for Markov games
by using the Least squares policy iteration (LSPI) algorithm to learn good
policies for a soccer domain and a flow control problem.Comment: Appears in Proceedings of the Eighteenth Conference on Uncertainty in
Artificial Intelligence (UAI2002
Geometric Insights into the Convergence of Nonlinear TD Learning
While there are convergence guarantees for temporal difference (TD) learning
when using linear function approximators, the situation for nonlinear models is
far less understood, and divergent examples are known. Here we take a first
step towards extending theoretical convergence guarantees to TD learning with
nonlinear function approximation. More precisely, we consider the expected
learning dynamics of the TD(0) algorithm for value estimation. As the step-size
converges to zero, these dynamics are defined by a nonlinear ODE which depends
on the geometry of the space of function approximators, the structure of the
underlying Markov chain, and their interaction. We find a set of function
approximators that includes ReLU networks and has geometry amenable to TD
learning regardless of environment, so that the solution performs about as well
as linear TD in the worst case. Then, we show how environments that are more
reversible induce dynamics that are better for TD learning and prove global
convergence to the true value function for well-conditioned function
approximators. Finally, we generalize a divergent counterexample to a family of
divergent problems to demonstrate how the interaction between approximator and
environment can go wrong and to motivate the assumptions needed to prove
convergence.Comment: ICLR 202
Non-Asymptotic Analysis for Two Time-scale TDC with General Smooth Function Approximation
Temporal-difference learning with gradient correction (TDC) is a two
time-scale algorithm for policy evaluation in reinforcement learning. This
algorithm was initially proposed with linear function approximation, and was
later extended to the one with general smooth function approximation. The
asymptotic convergence for the on-policy setting with general smooth function
approximation was established in [bhatnagar2009convergent], however, the
finite-sample analysis remains unsolved due to challenges in the non-linear and
two-time-scale update structure, non-convex objective function and the
time-varying projection onto a tangent plane. In this paper, we develop novel
techniques to explicitly characterize the finite-sample error bound for the
general off-policy setting with i.i.d.\ or Markovian samples, and show that it
converges as fast as (up to a factor of ). Our approach can be applied to a wide range of value-based
reinforcement learning algorithms with general smooth function approximation.Comment: Accepted by NeurIPS 202
Byzantine-Resilient Decentralized TD Learning with Linear Function Approximation
This paper considers the policy evaluation problem in a multi-agent
reinforcement learning (MARL) environment over decentralized and directed
networks. The focus is on decentralized temporal difference (TD) learning with
linear function approximation in the presence of unreliable or even malicious
agents, termed as Byzantine agents. In order to evaluate the quality of a fixed
policy in a common environment, agents usually run decentralized TD()
collaboratively. However, when some Byzantine agents behave adversarially,
decentralized TD() is unable to learn an accurate linear approximation
for the true value function. We propose a trimmed-mean based
Byzantine-resilient decentralized TD() algorithm to perform policy
evaluation in this setting. We establish the finite-time convergence rate, as
well as the asymptotic learning error in the presence of Byzantine agents.
Numerical experiments corroborate the robustness of the proposed algorithm
Target-Based Temporal Difference Learning
The use of target networks has been a popular and key component of recent
deep Q-learning algorithms for reinforcement learning, yet little is known from
the theory side. In this work, we introduce a new family of target-based
temporal difference (TD) learning algorithms and provide theoretical analysis
on their convergences. In contrast to the standard TD-learning, target-based TD
algorithms maintain two separate learning parameters-the target variable and
online variable. Particularly, we introduce three members in the family, called
the averaging TD, double TD, and periodic TD, where the target variable is
updated through an averaging, symmetric, or periodic fashion, mirroring those
techniques used in deep Q-learning practice.
We establish asymptotic convergence analyses for both averaging TD and double
TD and a finite sample analysis for periodic TD. In addition, we also provide
some simulation results showing potentially superior convergence of these
target-based TD algorithms compared to the standard TD-learning. While this
work focuses on linear function approximation and policy evaluation setting, we
consider this as a meaningful step towards the theoretical understanding of
deep Q-learning variants with target networks
Conditions on Features for Temporal Difference-Like Methods to Converge
The convergence of many reinforcement learning (RL) algorithms with linear
function approximation has been investigated extensively but most proofs assume
that these methods converge to a unique solution. In this paper, we provide a
complete characterization of non-uniqueness issues for a large class of
reinforcement learning algorithms, simultaneously unifying many
counter-examples to convergence in a theoretical framework. We achieve this by
proving a new condition on features that can determine whether the convergence
assumptions are valid or non-uniqueness holds. We consider a general class of
RL methods, which we call natural algorithms, whose solutions are characterized
as the fixed point of a projected Bellman equation (when it exists); notably,
bootstrapped temporal difference-based methods such as and
are natural algorithms. Our main result proves that natural
algorithms converge to the correct solution if and only if all the value
functions in the approximation space satisfy a certain shape. This implies that
natural algorithms are, in general, inherently prone to converge to the wrong
solution for most feature choices even if the value function can be represented
exactly. Given our results, we show that state aggregation based features are a
safe choice for natural algorithms and we also provide a condition for finding
convergent algorithms under other feature constructions.Comment: 13 pages, 6 figure
A Multi-Agent Off-Policy Actor-Critic Algorithm for Distributed Reinforcement Learning
This paper extends off-policy reinforcement learning to the multi-agent case
in which a set of networked agents communicating with their neighbors according
to a time-varying graph collaboratively evaluates and improves a target policy
while following a distinct behavior policy. To this end, the paper develops a
multi-agent version of emphatic temporal difference learning for off-policy
policy evaluation, and proves convergence under linear function approximation.
The paper then leverages this result, in conjunction with a novel multi-agent
off-policy policy gradient theorem and recent work in both multi-agent
on-policy and single-agent off-policy actor-critic methods, to develop and give
convergence guarantees for a new multi-agent off-policy actor-critic algorithm
- …