A Unified Lyapunov Framework for Finite-Sample Analysis of Reinforcement Learning Algorithms

Abstract

Reinforcement learning is a framework for solving sequential decision-making problems without requiring the environmental model, and is viewed as a promising approach to achieve artificial intelligence. However, there is a huge gap between the empirical successes and the theoretical understanding of reinforcement learning. In this thesis, we make an effort to bridging such gap. More formally, this thesis focuses on designing data-efficient reinforcement learning algorithms and establishing their finite-sample guarantees. Specifically, we aim at answering the following question: suppose we carry out some reinforcement learning algorithm with finite amount of samples (or with finite number of iterations), then what can we say about the performance of the output of the algorithm? The more detailed motivation and the research background are presented in Chapter 1. Part I: Stochastic Approximation. The main body of this thesis is divided into three parts. In the first part of the thesis, we focus on studying the stochastic approximation method. Stochastic approximation is the major workhorse for large-scale optimization and machine learning, and is widely used in reinforcement learning for both algorithm design and algorithm analysis. Therefore, understanding the behavior of SA algorithms is of fundamental interest to the analysis of RL algorithms. In Chapter 2 and Chapter 3, we consider Markovian stochastic approximation under a contractive operator and under a strongly pseudo-monotone operator, and establish their finite-sample guarantees. These two results on stochastic approximation are used in later parts of the thesis to study reinforcement learning algorithms with a tabular representation and with linear function approximation. The main technique we use to analyze those stochastic approximation algorithms is the Lyapunov-drift method. Specifically, we construct novel Lyapunov functions (e.g., generalized Moreau envelope in the case of stochastic approximation under a contraction assumption) to capture the dynamics of the corresponding stochastic approximation algorithms, and control the discretization error and the stochastic error. This enables us to derive the one-step drift inequality, which can be repeatedly used to establish the finite-sample bounds. In Chapter 4, we switch our focus from finite-sample analysis to asymptotic analysis, and characterize the stationary distribution of the centered-scaled iterates of several popular stochastic approximation algorithms. Specifically, we show that for stochastic gradient descent, linear stochastic approximation, and contractive stochastic approximation, the stationary distribution of the centered iterates (after proper scaling) is a Gaussian distribution with mean zero and a covariance matrix being the unique solution of an appropriate Lyapunov equation. For stochastic approximation beyond these three types, we numerically demonstrate that the stationary distribution may not be Gaussian in general. The main technique we used for such asymptotic analysis is also Lyapunov method, where the characteristic function was used as the test function. Part II: Reinforcement Learning with a Tabular Representation. In the second part of this thesis, we focus on reinforcement learning with a tabular representation. The preliminaries of reinforcement learning are presented in Chapter 5. In Chapter 6 and Chapter 7, we consider the TD-learning algorithm for solving the policy evaluation problem, which refers to the problem of estimating the performance of a given policy. Solving the policy evaluation problem is an important intermediate step in the popular actor-critic framework for ultimately finding an optimal policy. More specifically, we consider on-policy TD-learning algorithms such as nn-step TD and TD(λ)(\lambda) in Chapter 6. By establishing finite-sample guarantees of nn-step TD and TD(λ)(\lambda) as explicit functions of the parameters nn and λ\lambda, we provide theoretical insight into the open problem about the efficiency of bootstrapping, which is about how to choose the parameters nn and λ\lambda so that nn-step TD and TD(λ)(\lambda) achieve their best performance. In Chapter 7, we study the problem of policy evaluation using off-policy sampling, where the policy used to collect samples and the policy whose value function we aim at estimating is different. We provide finite-sample analysis of a generic off-policy multi-step TD-learning algorithm, which subsumes several popular existing algorithms such as Qπ(λ)Q^\pi(\lambda), Tree-Backup(λ)(\lambda), Retrace(λ)(\lambda), and VV-trace as its special cases. In addition, our finite-sample bounds demonstrate a trade-off between the variance (which arises due to the product of the importance sampling ratios) and the bias in the limit point (which arises due to various modifications to the importance sampling ratios). Understanding such bias-variance trade-off is at the heart of off-policy learning. In Chapter 8, we consider the QQ-learning algorithm for directly finding an optimal policy and present its finite-sample guarantees. The finite-sample bounds imply an \Tilde{\mathcal{O}}(\epsilon^{-2}) sample complexity, which is known to be optimal up to a logarithmic factor. In addition, our finite-sample bounds also capture the dependence on other importance parameters of the reinforcement learning problem, such as the size of the state-action space and the effective horizon. Part III: Reinforcement Learning with Linear Function Approximation. In the last part of this thesis, to overcome the curse of dimensionality in reinforcement learning, we consider reinforcement learning with linear function approximation. Specifically, we focus on the off-policy setting, where the deadly triad is present, and can result in instability of reinforcement learning algorithms. In Chapter 9, we consider off-policy TD-learning with linear function approximation, where the deadly triad appears. We design a single time-scale off-policy TD-learning using generalized importance sampling ratios and multi-step bootstrapping, and establish its finite-sample guarantees. The algorithm is provably convergent in the presence of the deadly triad, and does not suffer from the high variance in existing off-policy learning algorithms. The TD-learning algorithm proposed in Chapter 9 is later used in Chapter 10 to solve the policy evaluation sub-problem in the general policy-based framework with various policy update rules, including approximate policy iteration and natural policy gradient. By only exploiting the contraction property and the monotonicity property of the Bellman operator, we establish an overall \Tilde{\mathcal{O}}(\epsilon^{-2}) sample complexity for a wide class of policy-based methods using off-policy sampling and linear function approximation. In Chapter 11, we focus on QQ-learning with linear function approximation (where the deadly triad naturally appears), and establish its finite-sample bounds under an assumption on the discount factor of the problem. In particular, we show that when the discount factor is sufficiently small, the deadly triad challenge can be overcome. In Chapter 12, we further remove the restriction on the discount factor by designing a convergent variant of QQ-learning with linear function approximation using target network and truncation. This is the first variant of QQ-learning with linear function approximation that uses a single trajectory of Markovian samples, and is provably stable without requiring strong assumptions. In addition, the algorithm achieves the optimal O(ϵ2)\mathcal{O}(\epsilon^{-2}) sample complexity (which matches with QQ-learning in the tabular setting) up to a function approximation error.Ph.D

    Similar works