27 research outputs found

    Optimal Sample Complexity for Average Reward Markov Decision Processes

    Full text link
    We settle the sample complexity of policy learning for the maximization of the long run average reward associated with a uniformly ergodic Markov decision process (MDP), assuming a generative model. In this context, the existing literature provides a sample complexity upper bound of O~(∣S∣∣A∣tmix2ϵ−2)\widetilde O(|S||A|t_{\text{mix}}^2 \epsilon^{-2}) and a lower bound of Ω(∣S∣∣A∣tmixϵ−2)\Omega(|S||A|t_{\text{mix}} \epsilon^{-2}). In these expressions, ∣S∣|S| and ∣A∣|A| denote the cardinalities of the state and action spaces respectively, tmixt_{\text{mix}} serves as a uniform upper limit for the total variation mixing times, and ϵ\epsilon signifies the error tolerance. Therefore, a notable gap of tmixt_{\text{mix}} still remains to be bridged. Our primary contribution is to establish an estimator for the optimal policy of average reward MDPs with a sample complexity of O~(∣S∣∣A∣tmixϵ−2)\widetilde O(|S||A|t_{\text{mix}}\epsilon^{-2}), effectively reaching the lower bound in the literature. This is achieved by combining algorithmic ideas in Jin and Sidford (2021) with those of Li et al. (2020)

    Finite-Time Analysis of Asynchronous Stochastic Approximation and Q-Learning

    Get PDF
    We consider a general asynchronous Stochastic Approximation (SA) scheme featuring a weighted infinity-norm contractive operator, and prove a bound on its finite-time convergence rate on a single trajectory. Additionally, we specialize the result to asynchronou

    Momentum in Reinforcement Learning

    Get PDF
    We adapt the optimization's concept of momentum to reinforcement learning. Seeing the state-action value functions as an analog to the gradients in optimization, we interpret momentum as an average of consecutive qq-functions. We derive Momentum Value Iteration (MoVI), a variation of Value Iteration that incorporates this momentum idea. Our analysis shows that this allows MoVI to average errors over successive iterations. We show that the proposed approach can be readily extended to deep learning. Specifically, we propose a simple improvement on DQN based on MoVI, and experiment it on Atari games.Comment: AISTATS 202

    Optimal Convergence Rate for Exact Policy Mirror Descent in Discounted Markov Decision Processes

    Full text link
    Policy Mirror Descent (PMD) is a general family of algorithms that covers a wide range of novel and fundamental methods in reinforcement learning. Motivated by the instability of policy iteration (PI) with inexact policy evaluation, unregularised PMD algorithmically regularises the policy improvement step of PI without regularising the objective function. With exact policy evaluation, PI is known to converge linearly with a rate given by the discount factor γ\gamma of a Markov Decision Process. In this work, we bridge the gap between PI and PMD with exact policy evaluation and show that the dimension-free γ\gamma-rate of PI can be achieved by the general family of unregularised PMD algorithms under an adaptive step-size. We show that both the rate and step-size are unimprovable for PMD: we provide matching lower bounds that demonstrate that the γ\gamma-rate is optimal for PMD methods as well as PI and that the adaptive step-size is necessary to achieve it. Our work is the first to relate PMD to rate-optimality and step-size necessity. Our study of the convergence of PMD avoids the use of the performance difference lemma, which leads to a direct analysis of independent interest. We also extend the analysis to the inexact setting and establish the first dimension-optimal sample complexity for unregularised PMD under a generative model, improving upon the best-known result.Comment: 33 pages, 1 figure. New result (Theorem 3) on necessity of adaptive step-siz

    Optimal Sample Complexity of Reinforcement Learning for Uniformly Ergodic Discounted Markov Decision Processes

    Full text link
    We consider the optimal sample complexity theory of tabular reinforcement learning (RL) for controlling the infinite horizon discounted reward in a Markov decision process (MDP). Optimal min-max complexity results have been developed for tabular RL in this setting, leading to a sample complexity dependence on γ\gamma and ϵ\epsilon of the form Θ~((1−γ)−3ϵ−2)\tilde \Theta((1-\gamma)^{-3}\epsilon^{-2}), where γ\gamma is the discount factor and ϵ\epsilon is the tolerance solution error. However, in many applications of interest, the optimal policy (or all policies) will induce mixing. We show that in these settings the optimal min-max complexity is Θ~(tminorize(1−γ)−2ϵ−2)\tilde \Theta(t_{\text{minorize}}(1-\gamma)^{-2}\epsilon^{-2}), where tminorizet_{\text{minorize}} is a measure of mixing that is within an equivalent factor of the total variation mixing time. Our analysis is based on regeneration-type ideas, that, we believe are of independent interest since they can be used to study related problems for general state space MDPs

    Optimal convergence rate for exact policy mirror descent in discounted Markov decision processes

    Get PDF
    Policy Mirror Descent (PMD) is a general family of algorithms that covers a wide range of novel and fundamental methods in reinforcement learning. Motivated by the instability of policy iteration (PI) with inexact policy evaluation, unregularised PMD algorithmically regularises the policy improvement step of PI without regularising the objective function. With exact policy evaluation, PI is known to converge linearly with a rate given by the discount factor γ of a Markov Decision Process. In this work, we bridge the gap between PI and PMD with exact policy evaluation and show that the dimension-free γ-rate of PI can be achieved by the general family of unregularised PMD algorithms under an adaptive step-size. We show that both the rate and step-size are unimprovable for PMD: we provide matching lower bounds that demonstrate that the γ-rate is optimal for PMD methods as well as PI and that the adaptive step-size is necessary to achieve it. Our work is the first to relate PMD to rate-optimality and step-size necessity. Our study of the convergence of PMD avoids the use of the performance difference lemma, which leads to a direct analysis of independent interest. We also extend the analysis to the inexact setting and establish the first dimension-optimal sample complexity for unregularised PMD under a generative model, improving upon the best-known result
    corecore