63 research outputs found
Actor-Critic Fictitious Play in Simultaneous Move Multistage Games
International audienceFictitious play is a game theoretic iterative procedure meant to learn an equilibrium in normal form games. However, this algorithm requires that each player has full knowledge of other players' strategies. Using an architecture inspired by actor-critic algorithms, we build a stochastic approximation of the fictitious play process. This procedure is on-line, decentralized (an agent has no information of others' strategies and rewards) and applies to multistage games (a generalization of normal form games). In addition, we prove convergence of our method towards a Nash equilibrium in both the cases of zero-sum two-player multistage games and cooperative multistage games. We also provide empirical evidence of the soundness of our approach on the game of Alesia with and without function approximation
On the Convergence of Model Free Learning in Mean Field Games
Learning by experience in Multi-Agent Systems (MAS) is a difficult and
exciting task, due to the lack of stationarity of the environment, whose
dynamics evolves as the population learns. In order to design scalable
algorithms for systems with a large population of interacting agents (e.g.
swarms), this paper focuses on Mean Field MAS, where the number of agents is
asymptotically infinite. Recently, a very active burgeoning field studies the
effects of diverse reinforcement learning algorithms for agents with no prior
information on a stationary Mean Field Game (MFG) and learn their policy
through repeated experience. We adopt a high perspective on this problem and
analyze in full generality the convergence of a fictitious iterative scheme
using any single agent learning algorithm at each step. We quantify the quality
of the computed approximate Nash equilibrium, in terms of the accumulated
errors arising at each learning iteration step. Notably, we show for the first
time convergence of model free learning algorithms towards non-stationary MFG
equilibria, relying only on classical assumptions on the MFG dynamics. We
illustrate our theoretical results with a numerical experiment in a continuous
action-space environment, where the approximate best response of the iterative
fictitious play scheme is computed with a deep RL algorithm
A General Framework for Learning Mean-Field Games
This paper presents a general mean-field game (GMFG) framework for
simultaneous learning and decision-making in stochastic games with a large
population. It first establishes the existence of a unique Nash Equilibrium to
this GMFG, and demonstrates that naively combining reinforcement learning with
the fixed-point approach in classical MFGs yields unstable algorithms. It then
proposes value-based and policy-based reinforcement learning algorithms (GMF-V
and GMF-P, respectively) with smoothed policies, with analysis of their
convergence properties and computational complexities. Experiments on an
equilibrium product pricing problem demonstrate that GMF-V-Q and GMF-P-TRPO,
two specific instantiations of GMF-V and GMF-P, respectively, with Q-learning
and TRPO, are both efficient and robust in the GMFG setting. Moreover, their
performance is superior in convergence speed, accuracy, and stability when
compared with existing algorithms for multi-agent reinforcement learning in the
-player setting.Comment: 43 pages, 7 figures. arXiv admin note: substantial text overlap with
arXiv:1901.0958
Competitive Policy Optimization
A core challenge in policy optimization in competitive Markov decision processes is the design of efficient optimization methods with desirable convergence and stability properties. To tackle this, we propose competitive policy optimization (CoPO), a novel policy gradient approach that exploits the game-theoretic nature of competitive games to derive policy updates. Motivated by the competitive gradient optimization method, we derive a bilinear approximation of the game objective. In contrast, off-the-shelf policy gradient methods utilize only linear approximations, and hence do not capture interactions among the players. We instantiate CoPO in two ways:(i) competitive policy gradient, and (ii) trust-region competitive policy optimization. We theoretically study these methods, and empirically investigate their behavior on a set of comprehensive, yet challenging, competitive games. We observe that they provide stable optimization, convergence to sophisticated strategies, and higher scores when played against baseline policy gradient methods
- …