13 research outputs found
Finite-Time Analysis of Asynchronous Stochastic Approximation and Q-Learning
We consider a general asynchronous Stochastic Approximation (SA) scheme featuring a weighted infinity-norm contractive operator, and prove a bound on its finite-time convergence rate on a single trajectory. Additionally, we specialize the result to asynchronou
Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model
We investigate the sample efficiency of reinforcement learning in a
-discounted infinite-horizon Markov decision process (MDP) with state
space and action space , assuming access to a
generative model. Despite a number of prior work tackling this problem, a
complete picture of the trade-offs between sample complexity and statistical
accuracy is yet to be determined. In particular, prior results suffer from a
sample size barrier, in the sense that their claimed statistical guarantees
hold only when the sample size exceeds at least
(up to some log factor). The
current paper overcomes this barrier by certifying the minimax optimality of
model-based reinforcement learning as soon as the sample size exceeds the order
of (modulo some log factor). More
specifically, a perturbed model-based planning algorithm provably finds an
-optimal policy with an order of
samples for any . Along the way, we
derive improved (instance-dependent) guarantees for model-based policy
evaluation. To the best of our knowledge, this work provides the first
minimax-optimal guarantee in a generative model that accommodates the entire
range of sample sizes (beyond which finding a meaningful policy is information
theoretically impossible)
Distributed Reinforcement Learning in Multi-Agent Networked Systems
We study distributed reinforcement learning (RL) for a network of agents. The objective is to find localized policies that maximize the (discounted) global reward. In general, scalability is a challenge in this setting because the size of the global state/action space can be exponential in the number of agents. Scalable algorithms are only known in cases where dependencies are local, e.g., between neighbors. In this work, we propose a Scalable Actor Critic framework that applies in settings where the dependencies are non-local and provide a finite-time error bound that shows how the convergence rate depends on the depth of the dependencies in the network. Additionally, as a byproduct of our analysis, we obtain novel finite-time convergence results for a general stochastic approximation scheme and for temporal difference learning with state aggregation that apply beyond the setting of RL in networked systems
Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction
Asynchronous Q-learning aims to learn the optimal action-value function (or
Q-function) of a Markov decision process (MDP), based on a single trajectory of
Markovian samples induced by a behavior policy. Focusing on a
-discounted MDP with state space and action space
, we demonstrate that the -based sample complexity
of classical asynchronous Q-learning -- namely, the number of samples needed to
yield an entrywise -accurate estimate of the Q-function -- is at
most on the order of \begin{equation*}
\frac{1}{\mu_{\mathsf{min}}(1-\gamma)^5\varepsilon^2}+
\frac{t_{\mathsf{mix}}}{\mu_{\mathsf{min}}(1-\gamma)} \end{equation*} up to
some logarithmic factor, provided that a proper constant learning rate is
adopted. Here, and denote respectively
the mixing time and the minimum state-action occupancy probability of the
sample trajectory. The first term of this bound matches the complexity in the
case with independent samples drawn from the stationary distribution of the
trajectory. The second term reflects the expense taken for the empirical
distribution of the Markovian trajectory to reach a steady state, which is
incurred at the very beginning and becomes amortized as the algorithm runs.
Encouragingly, the above bound improves upon the state-of-the-art result by a
factor of at least . Further, the scaling on the
discount complexity can be improved by means of variance reduction.Comment: accepted in part to Neural Information Processing Systems (NeurIPS)
202
Distributed Reinforcement Learning in Multi-Agent Networked Systems
We study distributed reinforcement learning (RL) for a network of agents. The objective is to find localized policies that maximize the (discounted) global reward. In general, scalability is a challenge in this setting because the size of the global state/action space can be exponential in the number of agents. Scalable algorithms are only known in cases where dependencies are local, e.g., between neighbors. In this work, we propose a Scalable Actor Critic framework that applies in settings where the dependencies are non-local and provide a finite-time error bound that shows how the convergence rate depends on the depth of the dependencies in the network. Additionally, as a byproduct of our analysis, we obtain novel finite-time convergence results for a general stochastic approximation scheme and for temporal difference learning with state aggregation that apply beyond the setting of RL in networked systems
Oracle-free Reinforcement Learning in Mean-Field Games along a Single Sample Path
We consider online reinforcement learning in Mean-Field Games (MFGs). Unlike
traditional approaches, we alleviate the need for a mean-field oracle by
developing an algorithm that approximates the Mean-Field Equilibrium (MFE)
using the single sample path of the generic agent. We call this {\it Sandbox
Learning}, as it can be used as a warm-start for any agent learning in a
multi-agent non-cooperative setting. We adopt a two time-scale approach in
which an online fixed-point recursion for the mean-field operates on a slower
time-scale, in tandem with a control policy update on a faster time-scale for
the generic agent. Given that the underlying Markov Decision Process (MDP) of
the agent is communicating, we provide finite sample convergence guarantees in
terms of convergence of the mean-field and control policy to the mean-field
equilibrium. The sample complexity of the Sandbox learning algorithm is
where is the MFE approximation
error. This is similar to works which assume access to oracle. Finally, we
empirically demonstrate the effectiveness of the sandbox learning algorithm in
diverse scenarios, including those where the MDP does not necessarily have a
single communicating class.Comment: Accepted for publication in AISTATS 202
Is Pessimism Provably Efficient for Offline RL?
We study offline reinforcement learning (RL), which aims to learn an optimal
policy based on a dataset collected a priori. Due to the lack of further
interactions with the environment, offline RL suffers from the insufficient
coverage of the dataset, which eludes most existing theoretical analysis. In
this paper, we propose a pessimistic variant of the value iteration algorithm
(PEVI), which incorporates an uncertainty quantifier as the penalty function.
Such a penalty function simply flips the sign of the bonus function for
promoting exploration in online RL, which makes it easily implementable and
compatible with general function approximators.
Without assuming the sufficient coverage of the dataset, we establish a
data-dependent upper bound on the suboptimality of PEVI for general Markov
decision processes (MDPs). When specialized to linear MDPs, it matches the
information-theoretic lower bound up to multiplicative factors of the dimension
and horizon. In other words, pessimism is not only provably efficient but also
minimax optimal. In particular, given the dataset, the learned policy serves as
the ``best effort'' among all policies, as no other policies can do better. Our
theoretical analysis identifies the critical role of pessimism in eliminating a
notion of spurious correlation, which emerges from the ``irrelevant''
trajectories that are less covered by the dataset and not informative for the
optimal policy.Comment: 53 pages, 3 figure
A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants
This paper develops an unified framework to study finite-sample convergence
guarantees of a large class of value-based asynchronous Reinforcement Learning
(RL) algorithms. We do this by first reformulating the RL algorithms as
Markovian Stochastic Approximation (SA) algorithms to solve fixed-point
equations. We then develop a Lyapunov analysis and derive mean-square error
bounds on the convergence of the Markovian SA. Based on this central result, we
establish finite-sample mean-square convergence bounds for asynchronous RL
algorithms such as -learning, -step TD, TD, and off-policy TD
algorithms including V-trace. As a by-product, by analyzing the performance
bounds of the TD (and -step TD) algorithm for general
(and ), we demonstrate a bias-variance trade-off, i.e., efficiency of
bootstrapping in RL. This was first posed as an open problem in [37]