8 research outputs found

    Regret-Minimizing Double Oracle for Extensive-Form Games

    Full text link
    By incorporating regret minimization, double oracle methods have demonstrated rapid convergence to Nash Equilibrium (NE) in normal-form games and extensive-form games, through algorithms such as online double oracle (ODO) and extensive-form double oracle (XDO), respectively. In this study, we further examine the theoretical convergence rate and sample complexity of such regret minimization-based double oracle methods, utilizing a unified framework called Regret-Minimizing Double Oracle. Based on this framework, we extend ODO to extensive-form games and determine its sample complexity. Moreover, we demonstrate that the sample complexity of XDO can be exponential in the number of information sets ∣S∣|S|, owing to the exponentially decaying stopping threshold of restricted games. To solve this problem, we propose the Periodic Double Oracle (PDO) method, which has the lowest sample complexity among regret minimization-based double oracle methods, being only polynomial in ∣S∣|S|. Empirical evaluations on multiple poker and board games show that PDO achieves significantly faster convergence than previous double oracle algorithms and reaches a competitive level with state-of-the-art regret minimization methods.Comment: Accepted at ICML, 202

    Analysis of Hannan Consistent Selection for Monte Carlo Tree Search in Simultaneous Move Games

    Get PDF
    Hannan consistency, or no external regret, is a~key concept for learning in games. An action selection algorithm is Hannan consistent (HC) if its performance is eventually as good as selecting the~best fixed action in hindsight. If both players in a~zero-sum normal form game use a~Hannan consistent algorithm, their average behavior converges to a~Nash equilibrium (NE) of the~game. A similar result is known about extensive form games, but the~played strategies need to be Hannan consistent with respect to the~counterfactual values, which are often difficult to obtain. We study zero-sum extensive form games with simultaneous moves, but otherwise perfect information. These games generalize normal form games and they are a special case of extensive form games. We study whether applying HC algorithms in each decision point of these games directly to the~observed payoffs leads to convergence to a~Nash equilibrium. This learning process corresponds to a~class of Monte Carlo Tree Search algorithms, which are popular for playing simultaneous-move games but do not have any known performance guarantees. We show that using HC algorithms directly on the~observed payoffs is not sufficient to guarantee the~convergence. With an~additional averaging over joint actions, the~convergence is guaranteed, but empirically slower. We further define an~additional property of HC algorithms, which is sufficient to guarantee the~convergence without the~averaging and we empirically show that commonly used HC algorithms have this property.Comment: arXiv admin note: substantial text overlap with arXiv:1509.0014

    Actor-Critic Fictitious Play in Simultaneous Move Multistage Games

    Get PDF
    International audienceFictitious play is a game theoretic iterative procedure meant to learn an equilibrium in normal form games. However, this algorithm requires that each player has full knowledge of other players' strategies. Using an architecture inspired by actor-critic algorithms, we build a stochastic approximation of the fictitious play process. This procedure is on-line, decentralized (an agent has no information of others' strategies and rewards) and applies to multistage games (a generalization of normal form games). In addition, we prove convergence of our method towards a Nash equilibrium in both the cases of zero-sum two-player multistage games and cooperative multistage games. We also provide empirical evidence of the soundness of our approach on the game of Alesia with and without function approximation

    Heuristic Search Value Iteration for zero-sum Stochastic Games

    Get PDF
    International audienceIn sequential decision-making, heuristic search algorithms allow exploiting both the initial situation and an admissible heuristic to efficiently search for an optimal solution, often for planning purposes. Such algorithms exist for problems with uncertain dynamics, partial observability, multiple criteria, or multiple collaborating agents. Here we look at two-player zero-sum stochastic games with discounted criterion, in a view to propose a solution tailored to the fully observable case, while solutions have been proposed for particular, though still more general, partially observable cases. This setting induces reasoning on both a lower and an upper bound of the value function, which leads us to proposing zsSG-HSVI, an algorithm based on Heuristic Search Value Iteration (HSVI), and which thus relies on generating trajectories. We demonstrate that, each player acting optimistically, and employing simple heuristic initializations, HSVI's convergence in finite time to an ϵ-optimal solution is preserved. An empirical study of the resulting approach is conducted on benchmark problems of various sizes
    corecore