33 research outputs found
Policy Space Diversity for Non-Transitive Games
Policy-Space Response Oracles (PSRO) is an influential algorithm framework
for approximating a Nash Equilibrium (NE) in multi-agent non-transitive games.
Many previous studies have been trying to promote policy diversity in PSRO. A
major weakness in existing diversity metrics is that a more diverse (according
to their diversity metrics) population does not necessarily mean (as we proved
in the paper) a better approximation to a NE. To alleviate this problem, we
propose a new diversity metric, the improvement of which guarantees a better
approximation to a NE. Meanwhile, we develop a practical and well-justified
method to optimize our diversity metric using only state-action samples. By
incorporating our diversity regularization into the best response solving in
PSRO, we obtain a new PSRO variant, Policy Space Diversity PSRO (PSD-PSRO). We
present the convergence property of PSD-PSRO. Empirically, extensive
experiments on various games demonstrate that PSD-PSRO is more effective in
producing significantly less exploitable policies than state-of-the-art PSRO
variants
Efficient Last-iterate Convergence Algorithms in Solving Games
No-regret algorithms are popular for learning Nash equilibrium (NE) in
two-player zero-sum normal-form games (NFGs) and extensive-form games (EFGs).
Many recent works consider the last-iterate convergence no-regret algorithms.
Among them, the two most famous algorithms are Optimistic Gradient Descent
Ascent (OGDA) and Optimistic Multiplicative Weight Update (OMWU). However, OGDA
has high per-iteration complexity. OMWU exhibits a lower per-iteration
complexity but poorer empirical performance, and its convergence holds only
when NE is unique. Recent works propose a Reward Transformation (RT) framework
for MWU, which removes the uniqueness condition and achieves competitive
performance with OMWU. Unfortunately, RT-based algorithms perform worse than
OGDA under the same number of iterations, and their convergence guarantee is
based on the continuous-time feedback assumption, which does not hold in most
scenarios. To address these issues, we provide a closer analysis of the RT
framework, which holds for both continuous and discrete-time feedback. We
demonstrate that the essence of the RT framework is to transform the problem of
learning NE in the original game into a series of strongly convex-concave
optimization problems (SCCPs). We show that the bottleneck of RT-based
algorithms is the speed of solving SCCPs. To improve the their empirical
performance, we design a novel transformation method to enable the SCCPs can be
solved by Regret Matching+ (RM+), a no-regret algorithm with better empirical
performance, resulting in Reward Transformation RM+ (RTRM+). RTRM+ enjoys
last-iterate convergence under the discrete-time feedback setting. Using the
counterfactual regret decomposition framework, we propose Reward Transformation
CFR+ (RTCFR+) to extend RTRM+ to EFGs. Experimental results show that our
algorithms significantly outperform existing last-iterate convergence
algorithms and RM+ (CFR+)
A Unified Approach to Reinforcement Learning, Quantal Response Equilibria, and Two-Player Zero-Sum Games
Algorithms designed for single-agent reinforcement learning (RL) generally
fail to converge to equilibria in two-player zero-sum (2p0s) games. On the
other hand, game-theoretic algorithms for approximating Nash and regularized
equilibria in 2p0s games are not typically competitive for RL and can be
difficult to scale. As a result, algorithms for these two cases are generally
developed and evaluated separately. In this work, we show that a single
algorithm can produce strong results in both settings, despite their
fundamental differences. This algorithm, which we call magnet mirror descent
(MMD), is a simple extension to mirror descent and a special case of a
non-Euclidean proximal gradient algorithm. From a theoretical standpoint, we
prove a novel linear convergence for this non-Euclidean proximal gradient
algorithm for a class of variational inequality problems. It follows from this
result that MMD converges linearly to quantal response equilibria (i.e.,
entropy regularized Nash equilibria) in extensive-form games; this is the first
time linear convergence has been proven for a first order solver. Moreover,
applied as a tabular Nash equilibrium solver via self-play, we show empirically
that MMD produces results competitive with CFR; this is the first time that a
standard RL algorithm has done so. Furthermore, for single-agent deep RL, on a
small collection of Atari and Mujoco tasks, we show that MMD can produce
results competitive with those of PPO. Lastly, for multi-agent deep RL, we show
MMD can outperform NFSP in 3x3 Abrupt Dark Hex
Scalable First-Order Methods for Robust MDPs
Robust Markov Decision Processes (MDPs) are a powerful framework for modeling
sequential decision-making problems with model uncertainty. This paper proposes
the first first-order framework for solving robust MDPs. Our algorithm
interleaves primal-dual first-order updates with approximate Value Iteration
updates. By carefully controlling the tradeoff between the accuracy and cost of
Value Iteration updates, we achieve an ergodic convergence rate of for the best
choice of parameters on ellipsoidal and Kullback-Leibler -rectangular
uncertainty sets, where and is the number of states and actions,
respectively. Our dependence on the number of states and actions is
significantly better (by a factor of ) than that of pure
Value Iteration algorithms. In numerical experiments on ellipsoidal uncertainty
sets we show that our algorithm is significantly more scalable than
state-of-the-art approaches. Our framework is also the first one to solve
robust MDPs with -rectangular KL uncertainty sets
Fast swap regret minimization and applications to approximate correlated equilibria
We give a simple and computationally efficient algorithm that, for any
constant , obtains -swap regret within only rounds; this is an exponential improvement compared to the
super-linear number of rounds required by the state-of-the-art algorithm, and
resolves the main open problem of [Blum and Mansour 2007]. Our algorithm has an
exponential dependence on , but we prove a new, matching lower
bound.
Our algorithm for swap regret implies faster convergence to
-Correlated Equilibrium (-CE) in several regimes: For
normal form two-player games with actions, it implies the first uncoupled
dynamics that converges to the set of -CE in polylogarithmic
rounds; a -bit communication protocol for -CE
in two-player games (resolving an open problem mentioned by
[Babichenko-Rubinstein'2017, Goos-Rubinstein'2018, Ganor-CS'2018]); and an
-query algorithm for -CE (resolving an open problem
of [Babichenko'2020] and obtaining the first separation between
-CE and -Nash equilibrium in the query complexity
model).
For extensive-form games, our algorithm implies a PTAS for
, a solution concept
often conjectured to be computationally intractable (e.g. [Stengel-Forges'08,
Fujii'23])