5,141 research outputs found

    A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants

    Full text link
    This paper develops an unified framework to study finite-sample convergence guarantees of a large class of value-based asynchronous Reinforcement Learning (RL) algorithms. We do this by first reformulating the RL algorithms as Markovian Stochastic Approximation (SA) algorithms to solve fixed-point equations. We then develop a Lyapunov analysis and derive mean-square error bounds on the convergence of the Markovian SA. Based on this central result, we establish finite-sample mean-square convergence bounds for asynchronous RL algorithms such as QQ-learning, nn-step TD, TD(λ)(\lambda), and off-policy TD algorithms including V-trace. As a by-product, by analyzing the performance bounds of the TD(λ)(\lambda) (and nn-step TD) algorithm for general λ\lambda (and nn), we demonstrate a bias-variance trade-off, i.e., efficiency of bootstrapping in RL. This was first posed as an open problem in [37]

    Efficient Reinforcement Learning Using Recursive Least-Squares Methods

    Full text link
    The recursive least-squares (RLS) algorithm is one of the most well-known algorithms used in adaptive filtering, system identification and adaptive control. Its popularity is mainly due to its fast convergence speed, which is considered to be optimal in practice. In this paper, RLS methods are used to solve reinforcement learning problems, where two new reinforcement learning algorithms using linear value function approximators are proposed and analyzed. The two algorithms are called RLS-TD(lambda) and Fast-AHC (Fast Adaptive Heuristic Critic), respectively. RLS-TD(lambda) can be viewed as the extension of RLS-TD(0) from lambda=0 to general lambda within interval [0,1], so it is a multi-step temporal-difference (TD) learning algorithm using RLS methods. The convergence with probability one and the limit of convergence of RLS-TD(lambda) are proved for ergodic Markov chains. Compared to the existing LS-TD(lambda) algorithm, RLS-TD(lambda) has advantages in computation and is more suitable for online learning. The effectiveness of RLS-TD(lambda) is analyzed and verified by learning prediction experiments of Markov chains with a wide range of parameter settings. The Fast-AHC algorithm is derived by applying the proposed RLS-TD(lambda) algorithm in the critic network of the adaptive heuristic critic method. Unlike conventional AHC algorithm, Fast-AHC makes use of RLS methods to improve the learning-prediction efficiency in the critic. Learning control experiments of the cart-pole balancing and the acrobot swing-up problems are conducted to compare the data efficiency of Fast-AHC with conventional AHC. From the experimental results, it is shown that the data efficiency of learning control can also be improved by using RLS methods in the learning-prediction process of the critic. The performance of Fast-AHC is also compared with that of the AHC method using LS-TD(lambda). Furthermore, it is demonstrated in the experiments that different initial values of the variance matrix in RLS-TD(lambda) are required to get better performance not only in learning prediction but also in learning control. The experimental results are analyzed based on the existing theoretical work on the transient phase of forgetting factor RLS methods

    On the Convergence of Stochastic Iterative Dynamic Programming Algorithms

    Get PDF
    Recent developments in the area of reinforcement learning have yielded a number of new algorithms for the prediction and control of Markovian environments. These algorithms, including the TD(lambda) algorithm of Sutton (1988) and the Q-learning algorithm of Watkins (1989), can be motivated heuristically as approximations to dynamic programming (DP). In this paper we provide a rigorous proof of convergence of these DP-based learning algorithms by relating them to the powerful techniques of stochastic approximation theory via a new convergence theorem. The theorem establishes a general class of convergent algorithms to which both TD(lambda) and Q-learning belong

    SMIX(λ\lambda): Enhancing Centralized Value Functions for Cooperative Multi-Agent Reinforcement Learning

    Full text link
    Learning a stable and generalizable centralized value function (CVF) is a crucial but challenging task in multi-agent reinforcement learning (MARL), as it has to deal with the issue that the joint action space increases exponentially with the number of agents in such scenarios. This paper proposes an approach, named SMIX(λ{\lambda}), to address the issue using an efficient off-policy centralized training method within a flexible learner search space. As importance sampling for such off-policy training is both computationally costly and numerically unstable, we proposed to use the λ{\lambda}-return as a proxy to compute the TD error. With this new loss function objective, we adopt a modified QMIX network structure as the base to train our model. By further connecting it with the Q(λ){Q(\lambda)} approach from an unified expectation correction viewpoint, we show that the proposed SMIX(λ{\lambda}) is equivalent to Q(λ){Q(\lambda)} and hence shares its convergence properties, while without being suffered from the aforementioned curse of dimensionality problem inherent in MARL. Experiments on the StarCraft Multi-Agent Challenge (SMAC) benchmark demonstrate that our approach not only outperforms several state-of-the-art MARL methods by a large margin, but also can be used as a general tool to improve the overall performance of other CTDE-type algorithms by enhancing their CVFs

    On the Convergence of Alternating Least Squares Optimisation in Tensor Format Representations

    Full text link
    The approximation of tensors is important for the efficient numerical treatment of high dimensional problems, but it remains an extremely challenging task. One of the most popular approach to tensor approximation is the alternating least squares method. In our study, the convergence of the alternating least squares algorithm is considered. The analysis is done for arbitrary tensor format representations and based on the multiliearity of the tensor format. In tensor format representation techniques, tensors are approximated by multilinear combinations of objects lower dimensionality. The resulting reduction of dimensionality not only reduces the amount of required storage but also the computational effort.Comment: arXiv admin note: text overlap with arXiv:1503.0543
    corecore