27 research outputs found
Optimal Sample Complexity for Average Reward Markov Decision Processes
We settle the sample complexity of policy learning for the maximization of
the long run average reward associated with a uniformly ergodic Markov decision
process (MDP), assuming a generative model. In this context, the existing
literature provides a sample complexity upper bound of and a lower bound of
. In these expressions, and
denote the cardinalities of the state and action spaces respectively,
serves as a uniform upper limit for the total variation mixing
times, and signifies the error tolerance. Therefore, a notable gap
of still remains to be bridged. Our primary contribution is to
establish an estimator for the optimal policy of average reward MDPs with a
sample complexity of ,
effectively reaching the lower bound in the literature. This is achieved by
combining algorithmic ideas in Jin and Sidford (2021) with those of Li et al.
(2020)
Finite-Time Analysis of Asynchronous Stochastic Approximation and Q-Learning
We consider a general asynchronous Stochastic Approximation (SA) scheme featuring a weighted infinity-norm contractive operator, and prove a bound on its finite-time convergence rate on a single trajectory. Additionally, we specialize the result to asynchronou
Momentum in Reinforcement Learning
We adapt the optimization's concept of momentum to reinforcement learning.
Seeing the state-action value functions as an analog to the gradients in
optimization, we interpret momentum as an average of consecutive -functions.
We derive Momentum Value Iteration (MoVI), a variation of Value Iteration that
incorporates this momentum idea. Our analysis shows that this allows MoVI to
average errors over successive iterations. We show that the proposed approach
can be readily extended to deep learning. Specifically, we propose a simple
improvement on DQN based on MoVI, and experiment it on Atari games.Comment: AISTATS 202
Optimal Convergence Rate for Exact Policy Mirror Descent in Discounted Markov Decision Processes
Policy Mirror Descent (PMD) is a general family of algorithms that covers a
wide range of novel and fundamental methods in reinforcement learning.
Motivated by the instability of policy iteration (PI) with inexact policy
evaluation, unregularised PMD algorithmically regularises the policy
improvement step of PI without regularising the objective function. With exact
policy evaluation, PI is known to converge linearly with a rate given by the
discount factor of a Markov Decision Process. In this work, we bridge
the gap between PI and PMD with exact policy evaluation and show that the
dimension-free -rate of PI can be achieved by the general family of
unregularised PMD algorithms under an adaptive step-size. We show that both the
rate and step-size are unimprovable for PMD: we provide matching lower bounds
that demonstrate that the -rate is optimal for PMD methods as well as
PI and that the adaptive step-size is necessary to achieve it. Our work is the
first to relate PMD to rate-optimality and step-size necessity. Our study of
the convergence of PMD avoids the use of the performance difference lemma,
which leads to a direct analysis of independent interest. We also extend the
analysis to the inexact setting and establish the first dimension-optimal
sample complexity for unregularised PMD under a generative model, improving
upon the best-known result.Comment: 33 pages, 1 figure. New result (Theorem 3) on necessity of adaptive
step-siz
Optimal Sample Complexity of Reinforcement Learning for Uniformly Ergodic Discounted Markov Decision Processes
We consider the optimal sample complexity theory of tabular reinforcement
learning (RL) for controlling the infinite horizon discounted reward in a
Markov decision process (MDP). Optimal min-max complexity results have been
developed for tabular RL in this setting, leading to a sample complexity
dependence on and of the form , where is the discount factor
and is the tolerance solution error. However, in many applications
of interest, the optimal policy (or all policies) will induce mixing. We show
that in these settings the optimal min-max complexity is , where
is a measure of mixing that is within an equivalent
factor of the total variation mixing time. Our analysis is based on
regeneration-type ideas, that, we believe are of independent interest since
they can be used to study related problems for general state space MDPs
Optimal convergence rate for exact policy mirror descent in discounted Markov decision processes
Policy Mirror Descent (PMD) is a general family of algorithms that covers a wide range of novel and fundamental methods in reinforcement learning. Motivated by the instability of policy iteration (PI) with inexact policy evaluation, unregularised PMD algorithmically regularises the policy improvement step of PI without regularising the objective function. With exact policy evaluation, PI is known to converge linearly with a rate given by the discount factor γ of a Markov Decision Process. In this work, we bridge the gap between PI and PMD with exact policy evaluation and show that the dimension-free γ-rate of PI can be achieved by the general family of unregularised PMD algorithms under an adaptive step-size. We show that both the rate and step-size are unimprovable for PMD: we provide matching lower bounds that demonstrate that the γ-rate is optimal for PMD methods as well as PI and that the adaptive step-size is necessary to achieve it. Our work is the first to relate PMD to rate-optimality and step-size necessity. Our study of the convergence of PMD avoids the use of the performance difference lemma, which leads to a direct analysis of independent interest. We also extend the analysis to the inexact setting and establish the first dimension-optimal sample complexity for unregularised PMD under a generative model, improving upon the best-known result