2,310 research outputs found
The StarCraft Multi-Agent Challenge
In the last few years, deep multi-agent reinforcement learning (RL) has
become a highly active area of research. A particularly challenging class of
problems in this area is partially observable, cooperative, multi-agent
learning, in which teams of agents must learn to coordinate their behaviour
while conditioning only on their private observations. This is an attractive
research area since such problems are relevant to a large number of real-world
systems and are also more amenable to evaluation than general-sum problems.
Standardised environments such as the ALE and MuJoCo have allowed single-agent
RL to move beyond toy domains, such as grid worlds. However, there is no
comparable benchmark for cooperative multi-agent RL. As a result, most papers
in this field use one-off toy problems, making it difficult to measure real
progress. In this paper, we propose the StarCraft Multi-Agent Challenge (SMAC)
as a benchmark problem to fill this gap. SMAC is based on the popular real-time
strategy game StarCraft II and focuses on micromanagement challenges where each
unit is controlled by an independent agent that must act based on local
observations. We offer a diverse set of challenge maps and recommendations for
best practices in benchmarking and evaluations. We also open-source a deep
multi-agent RL learning framework including state-of-the-art algorithms. We
believe that SMAC can provide a standard benchmark environment for years to
come. Videos of our best agents for several SMAC scenarios are available at:
https://youtu.be/VZ7zmQ_obZ0
The Surprising Effectiveness of PPO in Cooperative, Multi-Agent Games
Proximal Policy Optimization (PPO) is a popular on-policy reinforcement
learning algorithm but is significantly less utilized than off-policy learning
algorithms in multi-agent settings. This is often due the belief that on-policy
methods are significantly less sample efficient than their off-policy
counterparts in multi-agent problems. In this work, we investigate Multi-Agent
PPO (MAPPO), a variant of PPO which is specialized for multi-agent settings.
Using a 1-GPU desktop, we show that MAPPO achieves surprisingly strong
performance in three popular multi-agent testbeds: the particle-world
environments, the Starcraft multi-agent challenge, and the Hanabi challenge,
with minimal hyperparameter tuning and without any domain-specific algorithmic
modifications or architectures. In the majority of environments, we find that
compared to off-policy baselines, MAPPO achieves strong results while
exhibiting comparable sample efficiency. Finally, through ablation studies, we
present the implementation and algorithmic factors which are most influential
to MAPPO's practical performance
Attention-Based Recurrence for Multi-Agent Reinforcement Learning under Stochastic Partial Observability
Stochastic partial observability poses a major challenge for decentralized
coordination in multi-agent reinforcement learning but is largely neglected in
state-of-the-art research due to a strong focus on state-based centralized
training for decentralized execution (CTDE) and benchmarks that lack sufficient
stochasticity like StarCraft Multi-Agent Challenge (SMAC). In this paper, we
propose Attention-based Embeddings of Recurrence In multi-Agent Learning
(AERIAL) to approximate value functions under stochastic partial observability.
AERIAL replaces the true state with a learned representation of multi-agent
recurrence, considering more accurate information about decentralized agent
decisions than state-based CTDE. We then introduce MessySMAC, a modified
version of SMAC with stochastic observations and higher variance in initial
states, to provide a more general and configurable benchmark regarding
stochastic partial observability. We evaluate AERIAL in Dec-Tiger as well as in
a variety of SMAC and MessySMAC maps, and compare the results with state-based
CTDE. Furthermore, we evaluate the robustness of AERIAL and state-based CTDE
against various stochasticity configurations in MessySMAC.Comment: Accepted at ICML 202
Offline Multi-Agent Reinforcement Learning with Implicit Global-to-Local Value Regularization
Offline reinforcement learning (RL) has received considerable attention in
recent years due to its attractive capability of learning policies from offline
datasets without environmental interactions. Despite some success in the
single-agent setting, offline multi-agent RL (MARL) remains to be a challenge.
The large joint state-action space and the coupled multi-agent behaviors pose
extra complexities for offline policy optimization. Most existing offline MARL
studies simply apply offline data-related regularizations on individual agents,
without fully considering the multi-agent system at the global level. In this
work, we present OMIGA, a new offline m ulti-agent RL algorithm with implicit
global-to-local v alue regularization. OMIGA provides a principled framework to
convert global-level value regularization into equivalent implicit local value
regularizations and simultaneously enables in-sample learning, thus elegantly
bridging multi-agent value decomposition and policy learning with offline
regularizations. Based on comprehensive experiments on the offline multi-agent
MuJoCo and StarCraft II micro-management tasks, we show that OMIGA achieves
superior performance over the state-of-the-art offline MARL methods in almost
all tasks
On Reinforcement Learning for Full-length Game of StarCraft
StarCraft II poses a grand challenge for reinforcement learning. The main
difficulties of it include huge state and action space and a long-time horizon.
In this paper, we investigate a hierarchical reinforcement learning approach
for StarCraft II. The hierarchy involves two levels of abstraction. One is the
macro-action automatically extracted from expert's trajectories, which reduces
the action space in an order of magnitude yet remains effective. The other is a
two-layer hierarchical architecture which is modular and easy to scale,
enabling a curriculum transferring from simpler tasks to more complex tasks.
The reinforcement training algorithm for this architecture is also
investigated. On a 64x64 map and using restrictive units, we achieve a winning
rate of more than 99\% against the difficulty level-1 built-in AI. Through the
curriculum transfer learning algorithm and a mixture of combat model, we can
achieve over 93\% winning rate of Protoss against the most difficult
non-cheating built-in AI (level-7) of Terran, training within two days using a
single machine with only 48 CPU cores and 8 K40 GPUs. It also shows strong
generalization performance, when tested against never seen opponents including
cheating levels built-in AI and all levels of Zerg and Protoss built-in AI. We
hope this study could shed some light on the future research of large-scale
reinforcement learning.Comment: Appeared in AAAI 201
- …