Search CORE

27 research outputs found

Optimal Sample Complexity for Average Reward Markov Decision Processes

Author: Blanchet Jose
Glynn Peter
Wang Shengbo
Publication venue
Publication date: 12/10/2023
Field of study

We settle the sample complexity of policy learning for the maximization of the long run average reward associated with a uniformly ergodic Markov decision process (MDP), assuming a generative model. In this context, the existing literature provides a sample complexity upper bound of

\widetilde O(|S||A|t_{\text{mix}}^2 \epsilon^{-2})

and a lower bound of

\Omega(|S||A|t_{\text{mix}} \epsilon^{-2})

. In these expressions,

|S|

and

|A|

denote the cardinalities of the state and action spaces respectively,

t_{\text{mix}}

serves as a uniform upper limit for the total variation mixing times, and

\epsilon

signifies the error tolerance. Therefore, a notable gap of

t_{\text{mix}}

still remains to be bridged. Our primary contribution is to establish an estimator for the optimal policy of average reward MDPs with a sample complexity of

\widetilde O(|S||A|t_{\text{mix}}\epsilon^{-2})

, effectively reaching the lower bound in the literature. This is achieved by combining algorithmic ideas in Jin and Sidford (2021) with those of Li et al. (2020)

arXiv.org e-Print Archive

Finite-Time Analysis of Asynchronous Stochastic Approximation and Q-Learning

Author: Qu Guannan
Wierman Adam
Publication venue
Publication date: 01/02/2020
Field of study

We consider a general asynchronous Stochastic Approximation (SA) scheme featuring a weighted infinity-norm contractive operator, and prove a bound on its finite-time convergence rate on a single trajectory. Additionally, we specialize the result to asynchronou

arXiv.org e-Print Archive

Momentum in Reinforcement Learning

Author: Geist Matthieu
Pietquin Olivier
Scherrer Bruno
Vieillard Nino
Publication venue
Publication date: 31/03/2020
Field of study

We adapt the optimization's concept of momentum to reinforcement learning. Seeing the state-action value functions as an analog to the gradients in optimization, we interpret momentum as an average of consecutive

q

-functions. We derive Momentum Value Iteration (MoVI), a variation of Value Iteration that incorporates this momentum idea. Our analysis shows that this allows MoVI to average errors over successive iterations. We show that the proposed approach can be readily extended to deep learning. Specifically, we propose a simple improvement on DQN based on MoVI, and experiment it on Atari games.Comment: AISTATS 202

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL-Rennes 1

Optimal Convergence Rate for Exact Policy Mirror Descent in Discounted Markov Decision Processes

Author: Johnson Emmeran
Pike-Burke Ciara
Rebeschini Patrick
Publication venue
Publication date: 30/05/2023
Field of study

\gamma

of a Markov Decision Process. In this work, we bridge the gap between PI and PMD with exact policy evaluation and show that the dimension-free

\gamma

-rate of PI can be achieved by the general family of unregularised PMD algorithms under an adaptive step-size. We show that both the rate and step-size are unimprovable for PMD: we provide matching lower bounds that demonstrate that the

\gamma

-rate is optimal for PMD methods as well as PI and that the adaptive step-size is necessary to achieve it. Our work is the first to relate PMD to rate-optimality and step-size necessity. Our study of the convergence of PMD avoids the use of the performance difference lemma, which leads to a direct analysis of independent interest. We also extend the analysis to the inexact setting and establish the first dimension-optimal sample complexity for unregularised PMD under a generative model, improving upon the best-known result.Comment: 33 pages, 1 figure. New result (Theorem 3) on necessity of adaptive step-siz

arXiv.org e-Print Archive

Optimal Sample Complexity of Reinforcement Learning for Uniformly Ergodic Discounted Markov Decision Processes

Author: Blanchet Jose
Glynn Peter
Wang Shengbo
Publication venue
Publication date: 16/02/2023
Field of study

We consider the optimal sample complexity theory of tabular reinforcement learning (RL) for controlling the infinite horizon discounted reward in a Markov decision process (MDP). Optimal min-max complexity results have been developed for tabular RL in this setting, leading to a sample complexity dependence on

\gamma

and

\epsilon

of the form

\tilde \Theta((1-\gamma)^{-3}\epsilon^{-2})

, where

\gamma

is the discount factor and

\epsilon

is the tolerance solution error. However, in many applications of interest, the optimal policy (or all policies) will induce mixing. We show that in these settings the optimal min-max complexity is

\tilde \Theta(t_{\text{minorize}}(1-\gamma)^{-2}\epsilon^{-2})

, where

t_{\text{minorize}}

is a measure of mixing that is within an equivalent factor of the total variation mixing time. Our analysis is based on regeneration-type ideas, that, we believe are of independent interest since they can be used to study related problems for general state space MDPs

arXiv.org e-Print Archive

Optimal convergence rate for exact policy mirror descent in discounted Markov decision processes

Author: Johnson Emmeran
Pike-Burke Ciara
Rebeschini Patrick
Publication venue: NeurIPS
Publication date: 16/12/2023
Field of study

Policy Mirror Descent (PMD) is a general family of algorithms that covers a wide range of novel and fundamental methods in reinforcement learning. Motivated by the instability of policy iteration (PI) with inexact policy evaluation, unregularised PMD algorithmically regularises the policy improvement step of PI without regularising the objective function. With exact policy evaluation, PI is known to converge linearly with a rate given by the discount factor γ of a Markov Decision Process. In this work, we bridge the gap between PI and PMD with exact policy evaluation and show that the dimension-free γ-rate of PI can be achieved by the general family of unregularised PMD algorithms under an adaptive step-size. We show that both the rate and step-size are unimprovable for PMD: we provide matching lower bounds that demonstrate that the γ-rate is optimal for PMD methods as well as PI and that the adaptive step-size is necessary to achieve it. Our work is the first to relate PMD to rate-optimality and step-size necessity. Our study of the convergence of PMD avoids the use of the performance difference lemma, which leads to a direct analysis of independent interest. We also extend the analysis to the inexact setting and establish the first dimension-optimal sample complexity for unregularised PMD under a generative model, improving upon the best-known result

Oxford University Research Archive