9 research outputs found
Maximum Entropy Heterogeneous-Agent Mirror Learning
Multi-agent reinforcement learning (MARL) has been shown effective for
cooperative games in recent years. However, existing state-of-the-art methods
face challenges related to sample inefficiency, brittleness regarding
hyperparameters, and the risk of converging to a suboptimal Nash Equilibrium.
To resolve these issues, in this paper, we propose a novel theoretical
framework, named Maximum Entropy Heterogeneous-Agent Mirror Learning (MEHAML),
that leverages the maximum entropy principle to design maximum entropy MARL
actor-critic algorithms. We prove that algorithms derived from the MEHAML
framework enjoy the desired properties of the monotonic improvement of the
joint maximum entropy objective and the convergence to quantal response
equilibrium (QRE). The practicality of MEHAML is demonstrated by developing a
MEHAML extension of the widely used RL algorithm, HASAC (for soft
actor-critic), which shows significant improvements in exploration and
robustness on three challenging benchmarks: Multi-Agent MuJoCo, StarCraftII,
and Google Research Football. Our results show that HASAC outperforms strong
baseline methods such as HATD3, HAPPO, QMIX, and MAPPO, thereby establishing
the new state of the art. See our project page at
https://sites.google.com/view/mehaml
A Policy Gradient Algorithm for Learning to Learn in Multiagent Reinforcement Learning
A fundamental challenge in multiagent reinforcement learning is to learn
beneficial behaviors in a shared environment with other simultaneously learning
agents. In particular, each agent perceives the environment as effectively
non-stationary due to the changing policies of other agents. Moreover, each
agent is itself constantly learning, leading to natural non-stationarity in the
distribution of experiences encountered. In this paper, we propose a novel
meta-multiagent policy gradient theorem that directly accounts for the
non-stationary policy dynamics inherent to multiagent learning settings. This
is achieved by modeling our gradient updates to consider both an agent's own
non-stationary policy dynamics and the non-stationary policy dynamics of other
agents in the environment. We show that our theoretically grounded approach
provides a general solution to the multiagent learning problem, which
inherently comprises all key aspects of previous state of the art approaches on
this topic. We test our method on a diverse suite of multiagent benchmarks and
demonstrate a more efficient ability to adapt to new agents as they learn than
baseline methods across the full spectrum of mixed incentive, competitive, and
cooperative domains.Comment: Accepted to ICML 2021. Code at https://github.com/dkkim93/meta-mapg
and Videos at https://sites.google.com/view/meta-mapg/hom
Learning in Nonzero-Sum Stochastic Games with Potentials
Multi-agent reinforcement learning (MARL) has become effective in tackling
discrete cooperative game scenarios. However, MARL has yet to penetrate
settings beyond those modelled by team and zero-sum games, confining it to a
small subset of multi-agent systems. In this paper, we introduce a new
generation of MARL learners that can handle nonzero-sum payoff structures and
continuous settings. In particular, we study the MARL problem in a class of
games known as stochastic potential games (SPGs) with continuous state-action
spaces. Unlike cooperative games, in which all agents share a common reward,
SPGs are capable of modelling real-world scenarios where agents seek to fulfil
their individual goals. We prove theoretically our learning method, SPot-AC,
enables independent agents to learn Nash equilibrium strategies in polynomial
time. We demonstrate our framework tackles previously unsolvable tasks such as
Coordination Navigation and large selfish routing games and that it outperforms
the state of the art MARL baselines such as MADDPG and COMIX in such scenarios.Comment: ICML 202
Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning
In many real-world settings, a team of agents must coordinate its behaviour
while acting in a decentralised fashion. At the same time, it is often possible
to train the agents in a centralised fashion where global state information is
available and communication constraints are lifted. Learning joint
action-values conditioned on extra state information is an attractive way to
exploit centralised learning, but the best strategy for then extracting
decentralised policies is unclear. Our solution is QMIX, a novel value-based
method that can train decentralised policies in a centralised end-to-end
fashion. QMIX employs a mixing network that estimates joint action-values as a
monotonic combination of per-agent values. We structurally enforce that the
joint-action value is monotonic in the per-agent values, through the use of
non-negative weights in the mixing network, which guarantees consistency
between the centralised and decentralised policies. To evaluate the performance
of QMIX, we propose the StarCraft Multi-Agent Challenge (SMAC) as a new
benchmark for deep multi-agent reinforcement learning. We evaluate QMIX on a
challenging set of SMAC scenarios and show that it significantly outperforms
existing multi-agent reinforcement learning methods.Comment: Extended version of the ICML 2018 conference paper (arXiv:1803.11485
Regularized Softmax Deep Multi-Agent Q-Learning
Tackling overestimation in Q-learning is an important problem that has been extensively studied in single-agent reinforcement learning, but has received comparatively little attention in the multi-agent setting. In this work, we empirically demonstrate that QMIX, a popular Q-learning algorithm for cooperative multiagent reinforcement learning (MARL), suffers from a more severe overestimation in practice than previously acknowledged, and is not mitigated by existing approaches. We rectify this with a novel regularization-based update scheme that penalizes large joint action-values that deviate from a baseline and demonstrate its effectiveness in stabilizing learning. Furthermore, we propose to employ a softmax operator, which we efficiently approximate in a novel way in the multiagent setting, to further reduce the potential overestimation bias. Our approach, Regularized Softmax (RES) Deep Multi-Agent Q-Learning, is general and can be applied to any Q-learning based MARL algorithm. We demonstrate that, when applied to QMIX, RES avoids severe overestimation and significantly improves performance, yielding state-of-the-art results in a variety of cooperative multi-agent tasks, including the challenging StarCraft II micromanagement benchmarks