2,705 research outputs found
Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity
Model-based reinforcement learning (RL), which finds an optimal policy using
an empirical model, has long been recognized as one of the corner stones of RL.
It is especially suitable for multi-agent RL (MARL), as it naturally decouples
the learning and the planning phases, and avoids the non-stationarity problem
when all agents are improving their policies simultaneously using samples.
Though intuitive, easy-to-implement, and widely-used, the sample complexity of
model-based MARL algorithms has not been fully investigated. In this paper, our
goal is to address the fundamental question about its sample complexity. We
study arguably the most basic MARL setting: two-player discounted zero-sum
Markov games, given only access to a generative model. We show that model-based
MARL achieves a sample complexity of for finding the Nash equilibrium (NE)
value up to some error, and the -NE policies with a smooth
planning oracle, where is the discount factor, and denote the
state space, and the action spaces for the two agents. We further show that
such a sample bound is minimax-optimal (up to logarithmic factors) if the
algorithm is reward-agnostic, where the algorithm queries state transition
samples without reward knowledge, by establishing a matching lower bound. This
is in contrast to the usual reward-aware setting, with a
lower bound, where
this model-based approach is near-optimal with only a gap on the
dependence. Our results not only demonstrate the sample-efficiency of this
basic model-based approach in MARL, but also elaborate on the fundamental
tradeoff between its power (easily handling the more challenging
reward-agnostic case) and limitation (less adaptive and suboptimal in
), particularly arises in the multi-agent context
Model-Free Algorithm with Improved Sample Efficiency for Zero-Sum Markov Games
The problem of two-player zero-sum Markov games has recently attracted
increasing interests in theoretical studies of multi-agent reinforcement
learning (RL). In particular, for finite-horizon episodic Markov decision
processes (MDPs), it has been shown that model-based algorithms can find an
-optimal Nash Equilibrium (NE) with the sample complexity of
, which is optimal in the dependence of the horizon
and the number of states (where and denote the number of actions of
the two players, respectively). However, none of the existing model-free
algorithms can achieve such an optimality. In this work, we propose a
model-free stage-based Q-learning algorithm and show that it achieves the same
sample complexity as the best model-based algorithm, and hence for the first
time demonstrate that model-free algorithms can enjoy the same optimality in
the dependence as model-based algorithms. The main improvement of the
dependency on arises by leveraging the popular variance reduction technique
based on the reference-advantage decomposition previously used only for
single-agent RL. However, such a technique relies on a critical monotonicity
property of the value function, which does not hold in Markov games due to the
update of the policy via the coarse correlated equilibrium (CCE) oracle. Thus,
to extend such a technique to Markov games, our algorithm features a key novel
design of updating the reference value functions as the pair of optimistic and
pessimistic value functions whose value difference is the smallest in the
history in order to achieve the desired improvement in the sample efficiency
Model-Based Reinforcement Learning for Offline Zero-Sum Markov Games
This paper makes progress towards learning Nash equilibria in two-player
zero-sum Markov games from offline data. Specifically, consider a
-discounted infinite-horizon Markov game with states, where the
max-player has actions and the min-player has actions. We propose a
pessimistic model-based algorithm with Bernstein-style lower confidence bounds
-- called VI-LCB-Game -- that provably finds an -approximate Nash
equilibrium with a sample complexity no larger than
(up
to some log factor). Here, is some unilateral
clipped concentrability coefficient that reflects the coverage and distribution
shift of the available data (vis-\`a-vis the target data), and the target
accuracy can be any value within
. Our sample complexity bound strengthens prior
art by a factor of , achieving minimax optimality for the entire
-range. An appealing feature of our result lies in algorithmic
simplicity, which reveals the unnecessity of variance reduction and sample
splitting in achieving sample optimality.Comment: accepted to Operations Researc
Reinforcement Learning with Perturbed Rewards
Recent studies have shown that reinforcement learning (RL) models are
vulnerable in various noisy scenarios. For instance, the observed reward
channel is often subject to noise in practice (e.g., when rewards are collected
through sensors), and is therefore not credible. In addition, for applications
such as robotics, a deep reinforcement learning (DRL) algorithm can be
manipulated to produce arbitrary errors by receiving corrupted rewards. In this
paper, we consider noisy RL problems with perturbed rewards, which can be
approximated with a confusion matrix. We develop a robust RL framework that
enables agents to learn in noisy environments where only perturbed rewards are
observed. Our solution framework builds on existing RL/DRL algorithms and
firstly addresses the biased noisy reward setting without any assumptions on
the true distribution (e.g., zero-mean Gaussian noise as made in previous
works). The core ideas of our solution include estimating a reward confusion
matrix and defining a set of unbiased surrogate rewards. We prove the
convergence and sample complexity of our approach. Extensive experiments on
different DRL platforms show that trained policies based on our estimated
surrogate reward can achieve higher expected rewards, and converge faster than
existing baselines. For instance, the state-of-the-art PPO algorithm is able to
obtain 84.6% and 80.8% improvements on average score for five Atari games, with
error rates as 10% and 30% respectively.Comment: AAAI 2020 (Spotlight
- …