    Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity

    Model-based reinforcement learning (RL), which finds an optimal policy using an empirical model, has long been recognized as one of the corner stones of RL. It is especially suitable for multi-agent RL (MARL), as it naturally decouples the learning and the planning phases, and avoids the non-stationarity problem when all agents are improving their policies simultaneously using samples. Though intuitive, easy-to-implement, and widely-used, the sample complexity of model-based MARL algorithms has not been fully investigated. In this paper, our goal is to address the fundamental question about its sample complexity. We study arguably the most basic MARL setting: two-player discounted zero-sum Markov games, given only access to a generative model. We show that model-based MARL achieves a sample complexity of O~(SAB(1γ)3ϵ2)\tilde O(|S||A||B|(1-\gamma)^{-3}\epsilon^{-2}) for finding the Nash equilibrium (NE) value up to some ϵ\epsilon error, and the ϵ\epsilon-NE policies with a smooth planning oracle, where γ\gamma is the discount factor, and S,A,BS,A,B denote the state space, and the action spaces for the two agents. We further show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge, by establishing a matching lower bound. This is in contrast to the usual reward-aware setting, with a Ω~(S(A+B)(1γ)3ϵ2)\tilde\Omega(|S|(|A|+|B|)(1-\gamma)^{-3}\epsilon^{-2}) lower bound, where this model-based approach is near-optimal with only a gap on the A,B|A|,|B| dependence. Our results not only demonstrate the sample-efficiency of this basic model-based approach in MARL, but also elaborate on the fundamental tradeoff between its power (easily handling the more challenging reward-agnostic case) and limitation (less adaptive and suboptimal in A,B|A|,|B|), particularly arises in the multi-agent context

    Model-Free Algorithm with Improved Sample Efficiency for Zero-Sum Markov Games

    The problem of two-player zero-sum Markov games has recently attracted increasing interests in theoretical studies of multi-agent reinforcement learning (RL). In particular, for finite-horizon episodic Markov decision processes (MDPs), it has been shown that model-based algorithms can find an ϵ\epsilon-optimal Nash Equilibrium (NE) with the sample complexity of O(H3SAB/ϵ2)O(H^3SAB/\epsilon^2), which is optimal in the dependence of the horizon HH and the number of states SS (where AA and BB denote the number of actions of the two players, respectively). However, none of the existing model-free algorithms can achieve such an optimality. In this work, we propose a model-free stage-based Q-learning algorithm and show that it achieves the same sample complexity as the best model-based algorithm, and hence for the first time demonstrate that model-free algorithms can enjoy the same optimality in the HH dependence as model-based algorithms. The main improvement of the dependency on HH arises by leveraging the popular variance reduction technique based on the reference-advantage decomposition previously used only for single-agent RL. However, such a technique relies on a critical monotonicity property of the value function, which does not hold in Markov games due to the update of the policy via the coarse correlated equilibrium (CCE) oracle. Thus, to extend such a technique to Markov games, our algorithm features a key novel design of updating the reference value functions as the pair of optimistic and pessimistic value functions whose value difference is the smallest in the history in order to achieve the desired improvement in the sample efficiency

    Model-Based Reinforcement Learning for Offline Zero-Sum Markov Games

    This paper makes progress towards learning Nash equilibria in two-player zero-sum Markov games from offline data. Specifically, consider a γ\gamma-discounted infinite-horizon Markov game with SS states, where the max-player has AA actions and the min-player has BB actions. We propose a pessimistic model-based algorithm with Bernstein-style lower confidence bounds -- called VI-LCB-Game -- that provably finds an ε\varepsilon-approximate Nash equilibrium with a sample complexity no larger than CclippedS(A+B)(1γ)3ε2\frac{C_{\mathsf{clipped}}^{\star}S(A+B)}{(1-\gamma)^{3}\varepsilon^{2}} (up to some log factor). Here, CclippedC_{\mathsf{clipped}}^{\star} is some unilateral clipped concentrability coefficient that reflects the coverage and distribution shift of the available data (vis-\`a-vis the target data), and the target accuracy ε\varepsilon can be any value within (0,11γ]\big(0,\frac{1}{1-\gamma}\big]. Our sample complexity bound strengthens prior art by a factor of min{A,B}\min\{A,B\}, achieving minimax optimality for the entire ε\varepsilon-range. An appealing feature of our result lies in algorithmic simplicity, which reveals the unnecessity of variance reduction and sample splitting in achieving sample optimality.Comment: accepted to Operations Researc

    Reinforcement Learning with Perturbed Rewards

    Recent studies have shown that reinforcement learning (RL) models are vulnerable in various noisy scenarios. For instance, the observed reward channel is often subject to noise in practice (e.g., when rewards are collected through sensors), and is therefore not credible. In addition, for applications such as robotics, a deep reinforcement learning (DRL) algorithm can be manipulated to produce arbitrary errors by receiving corrupted rewards. In this paper, we consider noisy RL problems with perturbed rewards, which can be approximated with a confusion matrix. We develop a robust RL framework that enables agents to learn in noisy environments where only perturbed rewards are observed. Our solution framework builds on existing RL/DRL algorithms and firstly addresses the biased noisy reward setting without any assumptions on the true distribution (e.g., zero-mean Gaussian noise as made in previous works). The core ideas of our solution include estimating a reward confusion matrix and defining a set of unbiased surrogate rewards. We prove the convergence and sample complexity of our approach. Extensive experiments on different DRL platforms show that trained policies based on our estimated surrogate reward can achieve higher expected rewards, and converge faster than existing baselines. For instance, the state-of-the-art PPO algorithm is able to obtain 84.6% and 80.8% improvements on average score for five Atari games, with error rates as 10% and 30% respectively.Comment: AAAI 2020 (Spotlight