33 research outputs found

    Policy Space Diversity for Non-Transitive Games

    Full text link
    Policy-Space Response Oracles (PSRO) is an influential algorithm framework for approximating a Nash Equilibrium (NE) in multi-agent non-transitive games. Many previous studies have been trying to promote policy diversity in PSRO. A major weakness in existing diversity metrics is that a more diverse (according to their diversity metrics) population does not necessarily mean (as we proved in the paper) a better approximation to a NE. To alleviate this problem, we propose a new diversity metric, the improvement of which guarantees a better approximation to a NE. Meanwhile, we develop a practical and well-justified method to optimize our diversity metric using only state-action samples. By incorporating our diversity regularization into the best response solving in PSRO, we obtain a new PSRO variant, Policy Space Diversity PSRO (PSD-PSRO). We present the convergence property of PSD-PSRO. Empirically, extensive experiments on various games demonstrate that PSD-PSRO is more effective in producing significantly less exploitable policies than state-of-the-art PSRO variants

    Efficient Last-iterate Convergence Algorithms in Solving Games

    Full text link
    No-regret algorithms are popular for learning Nash equilibrium (NE) in two-player zero-sum normal-form games (NFGs) and extensive-form games (EFGs). Many recent works consider the last-iterate convergence no-regret algorithms. Among them, the two most famous algorithms are Optimistic Gradient Descent Ascent (OGDA) and Optimistic Multiplicative Weight Update (OMWU). However, OGDA has high per-iteration complexity. OMWU exhibits a lower per-iteration complexity but poorer empirical performance, and its convergence holds only when NE is unique. Recent works propose a Reward Transformation (RT) framework for MWU, which removes the uniqueness condition and achieves competitive performance with OMWU. Unfortunately, RT-based algorithms perform worse than OGDA under the same number of iterations, and their convergence guarantee is based on the continuous-time feedback assumption, which does not hold in most scenarios. To address these issues, we provide a closer analysis of the RT framework, which holds for both continuous and discrete-time feedback. We demonstrate that the essence of the RT framework is to transform the problem of learning NE in the original game into a series of strongly convex-concave optimization problems (SCCPs). We show that the bottleneck of RT-based algorithms is the speed of solving SCCPs. To improve the their empirical performance, we design a novel transformation method to enable the SCCPs can be solved by Regret Matching+ (RM+), a no-regret algorithm with better empirical performance, resulting in Reward Transformation RM+ (RTRM+). RTRM+ enjoys last-iterate convergence under the discrete-time feedback setting. Using the counterfactual regret decomposition framework, we propose Reward Transformation CFR+ (RTCFR+) to extend RTRM+ to EFGs. Experimental results show that our algorithms significantly outperform existing last-iterate convergence algorithms and RM+ (CFR+)

    A Unified Approach to Reinforcement Learning, Quantal Response Equilibria, and Two-Player Zero-Sum Games

    Full text link
    Algorithms designed for single-agent reinforcement learning (RL) generally fail to converge to equilibria in two-player zero-sum (2p0s) games. On the other hand, game-theoretic algorithms for approximating Nash and regularized equilibria in 2p0s games are not typically competitive for RL and can be difficult to scale. As a result, algorithms for these two cases are generally developed and evaluated separately. In this work, we show that a single algorithm can produce strong results in both settings, despite their fundamental differences. This algorithm, which we call magnet mirror descent (MMD), is a simple extension to mirror descent and a special case of a non-Euclidean proximal gradient algorithm. From a theoretical standpoint, we prove a novel linear convergence for this non-Euclidean proximal gradient algorithm for a class of variational inequality problems. It follows from this result that MMD converges linearly to quantal response equilibria (i.e., entropy regularized Nash equilibria) in extensive-form games; this is the first time linear convergence has been proven for a first order solver. Moreover, applied as a tabular Nash equilibrium solver via self-play, we show empirically that MMD produces results competitive with CFR; this is the first time that a standard RL algorithm has done so. Furthermore, for single-agent deep RL, on a small collection of Atari and Mujoco tasks, we show that MMD can produce results competitive with those of PPO. Lastly, for multi-agent deep RL, we show MMD can outperform NFSP in 3x3 Abrupt Dark Hex

    Scalable First-Order Methods for Robust MDPs

    Full text link
    Robust Markov Decision Processes (MDPs) are a powerful framework for modeling sequential decision-making problems with model uncertainty. This paper proposes the first first-order framework for solving robust MDPs. Our algorithm interleaves primal-dual first-order updates with approximate Value Iteration updates. By carefully controlling the tradeoff between the accuracy and cost of Value Iteration updates, we achieve an ergodic convergence rate of O(A2S3log(S)log(ϵ1)ϵ1)O \left( A^{2} S^{3}\log(S)\log(\epsilon^{-1}) \epsilon^{-1} \right) for the best choice of parameters on ellipsoidal and Kullback-Leibler ss-rectangular uncertainty sets, where SS and AA is the number of states and actions, respectively. Our dependence on the number of states and actions is significantly better (by a factor of O(A1.5S1.5)O(A^{1.5}S^{1.5})) than that of pure Value Iteration algorithms. In numerical experiments on ellipsoidal uncertainty sets we show that our algorithm is significantly more scalable than state-of-the-art approaches. Our framework is also the first one to solve robust MDPs with ss-rectangular KL uncertainty sets

    Fast swap regret minimization and applications to approximate correlated equilibria

    Full text link
    We give a simple and computationally efficient algorithm that, for any constant ε>0\varepsilon>0, obtains εT\varepsilon T-swap regret within only T=polylog(n)T = \mathsf{polylog}(n) rounds; this is an exponential improvement compared to the super-linear number of rounds required by the state-of-the-art algorithm, and resolves the main open problem of [Blum and Mansour 2007]. Our algorithm has an exponential dependence on ε\varepsilon, but we prove a new, matching lower bound. Our algorithm for swap regret implies faster convergence to ε\varepsilon-Correlated Equilibrium (ε\varepsilon-CE) in several regimes: For normal form two-player games with nn actions, it implies the first uncoupled dynamics that converges to the set of ε\varepsilon-CE in polylogarithmic rounds; a polylog(n)\mathsf{polylog}(n)-bit communication protocol for ε\varepsilon-CE in two-player games (resolving an open problem mentioned by [Babichenko-Rubinstein'2017, Goos-Rubinstein'2018, Ganor-CS'2018]); and an O~(n)\tilde{O}(n)-query algorithm for ε\varepsilon-CE (resolving an open problem of [Babichenko'2020] and obtaining the first separation between ε\varepsilon-CE and ε\varepsilon-Nash equilibrium in the query complexity model). For extensive-form games, our algorithm implies a PTAS for normal\mathit{normal} form\mathit{form} correlated\mathit{correlated} equilibria\mathit{equilibria}, a solution concept often conjectured to be computationally intractable (e.g. [Stengel-Forges'08, Fujii'23])
    corecore