Search CORE

33 research outputs found

Policy Space Diversity for Non-Transitive Games

Author: Fu Haobo
Fu Qiang
Liu Weiming
McAleer Stephen
Yang Wei
Yang Yaodong
Yao Jian
Publication venue
Publication date: 29/06/2023
Field of study

Policy-Space Response Oracles (PSRO) is an influential algorithm framework for approximating a Nash Equilibrium (NE) in multi-agent non-transitive games. Many previous studies have been trying to promote policy diversity in PSRO. A major weakness in existing diversity metrics is that a more diverse (according to their diversity metrics) population does not necessarily mean (as we proved in the paper) a better approximation to a NE. To alleviate this problem, we propose a new diversity metric, the improvement of which guarantees a better approximation to a NE. Meanwhile, we develop a practical and well-justified method to optimize our diversity metric using only state-action samples. By incorporating our diversity regularization into the best response solving in PSRO, we obtain a new PSRO variant, Policy Space Diversity PSRO (PSD-PSRO). We present the convergence property of PSD-PSRO. Empirically, extensive experiments on various games demonstrate that PSD-PSRO is more effective in producing significantly less exploitable policies than state-of-the-art PSRO variants

arXiv.org e-Print Archive

Efficient Last-iterate Convergence Algorithms in Solving Games

Author: An Bo
Gao Yang
Ge Zhenxing
Li Wenbin
Meng Linjian
Publication venue
Publication date: 22/08/2023
Field of study

No-regret algorithms are popular for learning Nash equilibrium (NE) in two-player zero-sum normal-form games (NFGs) and extensive-form games (EFGs). Many recent works consider the last-iterate convergence no-regret algorithms. Among them, the two most famous algorithms are Optimistic Gradient Descent Ascent (OGDA) and Optimistic Multiplicative Weight Update (OMWU). However, OGDA has high per-iteration complexity. OMWU exhibits a lower per-iteration complexity but poorer empirical performance, and its convergence holds only when NE is unique. Recent works propose a Reward Transformation (RT) framework for MWU, which removes the uniqueness condition and achieves competitive performance with OMWU. Unfortunately, RT-based algorithms perform worse than OGDA under the same number of iterations, and their convergence guarantee is based on the continuous-time feedback assumption, which does not hold in most scenarios. To address these issues, we provide a closer analysis of the RT framework, which holds for both continuous and discrete-time feedback. We demonstrate that the essence of the RT framework is to transform the problem of learning NE in the original game into a series of strongly convex-concave optimization problems (SCCPs). We show that the bottleneck of RT-based algorithms is the speed of solving SCCPs. To improve the their empirical performance, we design a novel transformation method to enable the SCCPs can be solved by Regret Matching+ (RM+), a no-regret algorithm with better empirical performance, resulting in Reward Transformation RM+ (RTRM+). RTRM+ enjoys last-iterate convergence under the discrete-time feedback setting. Using the counterfactual regret decomposition framework, we propose Reward Transformation CFR+ (RTCFR+) to extend RTRM+ to EFGs. Experimental results show that our algorithms significantly outperform existing last-iterate convergence algorithms and RM+ (CFR+)

arXiv.org e-Print Archive

A Unified Approach to Reinforcement Learning, Quantal Response Equilibria, and Two-Player Zero-Sum Games

Author: Brown Noam
D'Orazio Ryan
Kolter J. Zico
Kroer Christian
Lanctot Marc
Loizou Nicolas
Mitliagkas Ioannis
Sokota Samuel
Publication venue
Publication date: 26/11/2022
Field of study

Algorithms designed for single-agent reinforcement learning (RL) generally fail to converge to equilibria in two-player zero-sum (2p0s) games. On the other hand, game-theoretic algorithms for approximating Nash and regularized equilibria in 2p0s games are not typically competitive for RL and can be difficult to scale. As a result, algorithms for these two cases are generally developed and evaluated separately. In this work, we show that a single algorithm can produce strong results in both settings, despite their fundamental differences. This algorithm, which we call magnet mirror descent (MMD), is a simple extension to mirror descent and a special case of a non-Euclidean proximal gradient algorithm. From a theoretical standpoint, we prove a novel linear convergence for this non-Euclidean proximal gradient algorithm for a class of variational inequality problems. It follows from this result that MMD converges linearly to quantal response equilibria (i.e., entropy regularized Nash equilibria) in extensive-form games; this is the first time linear convergence has been proven for a first order solver. Moreover, applied as a tabular Nash equilibrium solver via self-play, we show empirically that MMD produces results competitive with CFR; this is the first time that a standard RL algorithm has done so. Furthermore, for single-agent deep RL, on a small collection of Atari and Mujoco tasks, we show that MMD can produce results competitive with those of PPO. Lastly, for multi-agent deep RL, we show MMD can outperform NFSP in 3x3 Abrupt Dark Hex

arXiv.org e-Print Archive

Scalable First-Order Methods for Robust MDPs

Author: Grand-Clément Julien
Kroer Christian
Publication venue
Publication date: 14/01/2021
Field of study

Robust Markov Decision Processes (MDPs) are a powerful framework for modeling sequential decision-making problems with model uncertainty. This paper proposes the first first-order framework for solving robust MDPs. Our algorithm interleaves primal-dual first-order updates with approximate Value Iteration updates. By carefully controlling the tradeoff between the accuracy and cost of Value Iteration updates, we achieve an ergodic convergence rate of

O \left( A^{2} S^{3}\log(S)\log(\epsilon^{-1}) \epsilon^{-1} \right)

for the best choice of parameters on ellipsoidal and Kullback-Leibler

s

-rectangular uncertainty sets, where

S

and

A

is the number of states and actions, respectively. Our dependence on the number of states and actions is significantly better (by a factor of

O(A^{1.5}S^{1.5})

) than that of pure Value Iteration algorithms. In numerical experiments on ellipsoidal uncertainty sets we show that our algorithm is significantly more scalable than state-of-the-art approaches. Our framework is also the first one to solve robust MDPs with

s

-rectangular KL uncertainty sets

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Fast swap regret minimization and applications to approximate correlated equilibria

Author: Peng Binghui
Rubinstein Aviad
Publication venue
Publication date: 14/11/2023
Field of study

We give a simple and computationally efficient algorithm that, for any constant

\varepsilon>0

, obtains

\varepsilon T

-swap regret within only

T = \mathsf{polylog}(n)

rounds; this is an exponential improvement compared to the super-linear number of rounds required by the state-of-the-art algorithm, and resolves the main open problem of [Blum and Mansour 2007]. Our algorithm has an exponential dependence on

\varepsilon

, but we prove a new, matching lower bound. Our algorithm for swap regret implies faster convergence to

\varepsilon

-Correlated Equilibrium (

\varepsilon

-CE) in several regimes: For normal form two-player games with

n

actions, it implies the first uncoupled dynamics that converges to the set of

\varepsilon

-CE in polylogarithmic rounds; a

\mathsf{polylog}(n)

-bit communication protocol for

\varepsilon

-CE in two-player games (resolving an open problem mentioned by [Babichenko-Rubinstein'2017, Goos-Rubinstein'2018, Ganor-CS'2018]); and an

\tilde{O}(n)

-query algorithm for

\varepsilon

-CE (resolving an open problem of [Babichenko'2020] and obtaining the first separation between

\varepsilon

-CE and

\varepsilon

-Nash equilibrium in the query complexity model). For extensive-form games, our algorithm implies a PTAS for

\mathit{normal}

\mathit{form}

\mathit{correlated}

\mathit{equilibria}

, a solution concept often conjectured to be computationally intractable (e.g. [Stengel-Forges'08, Fujii'23])

arXiv.org e-Print Archive