    Finite-Time Analysis of Asynchronous Stochastic Approximation and Q-Learning

    We consider a general asynchronous Stochastic Approximation (SA) scheme featuring a weighted infinity-norm contractive operator, and prove a bound on its finite-time convergence rate on a single trajectory. Additionally, we specialize the result to asynchronou

    Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model

    We investigate the sample efficiency of reinforcement learning in a γ\gamma-discounted infinite-horizon Markov decision process (MDP) with state space S\mathcal{S} and action space A\mathcal{A}, assuming access to a generative model. Despite a number of prior work tackling this problem, a complete picture of the trade-offs between sample complexity and statistical accuracy is yet to be determined. In particular, prior results suffer from a sample size barrier, in the sense that their claimed statistical guarantees hold only when the sample size exceeds at least SA(1γ)2\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^2} (up to some log factor). The current paper overcomes this barrier by certifying the minimax optimality of model-based reinforcement learning as soon as the sample size exceeds the order of SA1γ\frac{|\mathcal{S}||\mathcal{A}|}{1-\gamma} (modulo some log factor). More specifically, a perturbed model-based planning algorithm provably finds an ε\varepsilon-optimal policy with an order of SA(1γ)3ε2logSA(1γ)ε\frac{|\mathcal{S}||\mathcal{A}| }{(1-\gamma)^3\varepsilon^2}\log\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)\varepsilon} samples for any ε(0,11γ]\varepsilon \in (0, \frac{1}{1-\gamma}]. Along the way, we derive improved (instance-dependent) guarantees for model-based policy evaluation. To the best of our knowledge, this work provides the first minimax-optimal guarantee in a generative model that accommodates the entire range of sample sizes (beyond which finding a meaningful policy is information theoretically impossible)

    Distributed Reinforcement Learning in Multi-Agent Networked Systems

    We study distributed reinforcement learning (RL) for a network of agents. The objective is to find localized policies that maximize the (discounted) global reward. In general, scalability is a challenge in this setting because the size of the global state/action space can be exponential in the number of agents. Scalable algorithms are only known in cases where dependencies are local, e.g., between neighbors. In this work, we propose a Scalable Actor Critic framework that applies in settings where the dependencies are non-local and provide a finite-time error bound that shows how the convergence rate depends on the depth of the dependencies in the network. Additionally, as a byproduct of our analysis, we obtain novel finite-time convergence results for a general stochastic approximation scheme and for temporal difference learning with state aggregation that apply beyond the setting of RL in networked systems

    Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction

    Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP), based on a single trajectory of Markovian samples induced by a behavior policy. Focusing on a γ\gamma-discounted MDP with state space S\mathcal{S} and action space A\mathcal{A}, we demonstrate that the \ell_{\infty}-based sample complexity of classical asynchronous Q-learning -- namely, the number of samples needed to yield an entrywise ε\varepsilon-accurate estimate of the Q-function -- is at most on the order of \begin{equation*} \frac{1}{\mu_{\mathsf{min}}(1-\gamma)^5\varepsilon^2}+ \frac{t_{\mathsf{mix}}}{\mu_{\mathsf{min}}(1-\gamma)} \end{equation*} up to some logarithmic factor, provided that a proper constant learning rate is adopted. Here, tmixt_{\mathsf{mix}} and μmin\mu_{\mathsf{min}} denote respectively the mixing time and the minimum state-action occupancy probability of the sample trajectory. The first term of this bound matches the complexity in the case with independent samples drawn from the stationary distribution of the trajectory. The second term reflects the expense taken for the empirical distribution of the Markovian trajectory to reach a steady state, which is incurred at the very beginning and becomes amortized as the algorithm runs. Encouragingly, the above bound improves upon the state-of-the-art result by a factor of at least SA|\mathcal{S}||\mathcal{A}|. Further, the scaling on the discount complexity can be improved by means of variance reduction.Comment: accepted in part to Neural Information Processing Systems (NeurIPS) 202

    Oracle-free Reinforcement Learning in Mean-Field Games along a Single Sample Path

    We consider online reinforcement learning in Mean-Field Games (MFGs). Unlike traditional approaches, we alleviate the need for a mean-field oracle by developing an algorithm that approximates the Mean-Field Equilibrium (MFE) using the single sample path of the generic agent. We call this {\it Sandbox Learning}, as it can be used as a warm-start for any agent learning in a multi-agent non-cooperative setting. We adopt a two time-scale approach in which an online fixed-point recursion for the mean-field operates on a slower time-scale, in tandem with a control policy update on a faster time-scale for the generic agent. Given that the underlying Markov Decision Process (MDP) of the agent is communicating, we provide finite sample convergence guarantees in terms of convergence of the mean-field and control policy to the mean-field equilibrium. The sample complexity of the Sandbox learning algorithm is O~(ϵ4)\tilde{\mathcal{O}}(\epsilon^{-4}) where ϵ\epsilon is the MFE approximation error. This is similar to works which assume access to oracle. Finally, we empirically demonstrate the effectiveness of the sandbox learning algorithm in diverse scenarios, including those where the MDP does not necessarily have a single communicating class.Comment: Accepted for publication in AISTATS 202

    Is Pessimism Provably Efficient for Offline RL?

    We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori. Due to the lack of further interactions with the environment, offline RL suffers from the insufficient coverage of the dataset, which eludes most existing theoretical analysis. In this paper, we propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function. Such a penalty function simply flips the sign of the bonus function for promoting exploration in online RL, which makes it easily implementable and compatible with general function approximators. Without assuming the sufficient coverage of the dataset, we establish a data-dependent upper bound on the suboptimality of PEVI for general Markov decision processes (MDPs). When specialized to linear MDPs, it matches the information-theoretic lower bound up to multiplicative factors of the dimension and horizon. In other words, pessimism is not only provably efficient but also minimax optimal. In particular, given the dataset, the learned policy serves as the ``best effort'' among all policies, as no other policies can do better. Our theoretical analysis identifies the critical role of pessimism in eliminating a notion of spurious correlation, which emerges from the ``irrelevant'' trajectories that are less covered by the dataset and not informative for the optimal policy.Comment: 53 pages, 3 figure

    A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants

    This paper develops an unified framework to study finite-sample convergence guarantees of a large class of value-based asynchronous Reinforcement Learning (RL) algorithms. We do this by first reformulating the RL algorithms as Markovian Stochastic Approximation (SA) algorithms to solve fixed-point equations. We then develop a Lyapunov analysis and derive mean-square error bounds on the convergence of the Markovian SA. Based on this central result, we establish finite-sample mean-square convergence bounds for asynchronous RL algorithms such as QQ-learning, nn-step TD, TD(λ)(\lambda), and off-policy TD algorithms including V-trace. As a by-product, by analyzing the performance bounds of the TD(λ)(\lambda) (and nn-step TD) algorithm for general λ\lambda (and nn), we demonstrate a bias-variance trade-off, i.e., efficiency of bootstrapping in RL. This was first posed as an open problem in [37]