5,141 research outputs found
A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants
This paper develops an unified framework to study finite-sample convergence
guarantees of a large class of value-based asynchronous Reinforcement Learning
(RL) algorithms. We do this by first reformulating the RL algorithms as
Markovian Stochastic Approximation (SA) algorithms to solve fixed-point
equations. We then develop a Lyapunov analysis and derive mean-square error
bounds on the convergence of the Markovian SA. Based on this central result, we
establish finite-sample mean-square convergence bounds for asynchronous RL
algorithms such as -learning, -step TD, TD, and off-policy TD
algorithms including V-trace. As a by-product, by analyzing the performance
bounds of the TD (and -step TD) algorithm for general
(and ), we demonstrate a bias-variance trade-off, i.e., efficiency of
bootstrapping in RL. This was first posed as an open problem in [37]
Efficient Reinforcement Learning Using Recursive Least-Squares Methods
The recursive least-squares (RLS) algorithm is one of the most well-known
algorithms used in adaptive filtering, system identification and adaptive
control. Its popularity is mainly due to its fast convergence speed, which is
considered to be optimal in practice. In this paper, RLS methods are used to
solve reinforcement learning problems, where two new reinforcement learning
algorithms using linear value function approximators are proposed and analyzed.
The two algorithms are called RLS-TD(lambda) and Fast-AHC (Fast Adaptive
Heuristic Critic), respectively. RLS-TD(lambda) can be viewed as the extension
of RLS-TD(0) from lambda=0 to general lambda within interval [0,1], so it is a
multi-step temporal-difference (TD) learning algorithm using RLS methods. The
convergence with probability one and the limit of convergence of RLS-TD(lambda)
are proved for ergodic Markov chains. Compared to the existing LS-TD(lambda)
algorithm, RLS-TD(lambda) has advantages in computation and is more suitable
for online learning. The effectiveness of RLS-TD(lambda) is analyzed and
verified by learning prediction experiments of Markov chains with a wide range
of parameter settings. The Fast-AHC algorithm is derived by applying the
proposed RLS-TD(lambda) algorithm in the critic network of the adaptive
heuristic critic method. Unlike conventional AHC algorithm, Fast-AHC makes use
of RLS methods to improve the learning-prediction efficiency in the critic.
Learning control experiments of the cart-pole balancing and the acrobot
swing-up problems are conducted to compare the data efficiency of Fast-AHC with
conventional AHC. From the experimental results, it is shown that the data
efficiency of learning control can also be improved by using RLS methods in the
learning-prediction process of the critic. The performance of Fast-AHC is also
compared with that of the AHC method using LS-TD(lambda). Furthermore, it is
demonstrated in the experiments that different initial values of the variance
matrix in RLS-TD(lambda) are required to get better performance not only in
learning prediction but also in learning control. The experimental results are
analyzed based on the existing theoretical work on the transient phase of
forgetting factor RLS methods
On the Convergence of Stochastic Iterative Dynamic Programming Algorithms
Recent developments in the area of reinforcement learning have yielded a number of new algorithms for the prediction and control of Markovian environments. These algorithms, including the TD(lambda) algorithm of Sutton (1988) and the Q-learning algorithm of Watkins (1989), can be motivated heuristically as approximations to dynamic programming (DP). In this paper we provide a rigorous proof of convergence of these DP-based learning algorithms by relating them to the powerful techniques of stochastic approximation theory via a new convergence theorem. The theorem establishes a general class of convergent algorithms to which both TD(lambda) and Q-learning belong
SMIX(): Enhancing Centralized Value Functions for Cooperative Multi-Agent Reinforcement Learning
Learning a stable and generalizable centralized value function (CVF) is a
crucial but challenging task in multi-agent reinforcement learning (MARL), as
it has to deal with the issue that the joint action space increases
exponentially with the number of agents in such scenarios. This paper proposes
an approach, named SMIX(), to address the issue using an efficient
off-policy centralized training method within a flexible learner search space.
As importance sampling for such off-policy training is both computationally
costly and numerically unstable, we proposed to use the -return as a
proxy to compute the TD error. With this new loss function objective, we adopt
a modified QMIX network structure as the base to train our model. By further
connecting it with the approach from an unified expectation
correction viewpoint, we show that the proposed SMIX() is equivalent
to and hence shares its convergence properties, while without
being suffered from the aforementioned curse of dimensionality problem inherent
in MARL. Experiments on the StarCraft Multi-Agent Challenge (SMAC) benchmark
demonstrate that our approach not only outperforms several state-of-the-art
MARL methods by a large margin, but also can be used as a general tool to
improve the overall performance of other CTDE-type algorithms by enhancing
their CVFs
On the Convergence of Alternating Least Squares Optimisation in Tensor Format Representations
The approximation of tensors is important for the efficient numerical
treatment of high dimensional problems, but it remains an extremely challenging
task. One of the most popular approach to tensor approximation is the
alternating least squares method. In our study, the convergence of the
alternating least squares algorithm is considered. The analysis is done for
arbitrary tensor format representations and based on the multiliearity of the
tensor format. In tensor format representation techniques, tensors are
approximated by multilinear combinations of objects lower dimensionality. The
resulting reduction of dimensionality not only reduces the amount of required
storage but also the computational effort.Comment: arXiv admin note: text overlap with arXiv:1503.0543
- …