4,933 research outputs found
Bi-level Actor-Critic for Multi-agent Coordination
Coordination is one of the essential problems in multi-agent systems.
Typically multi-agent reinforcement learning (MARL) methods treat agents
equally and the goal is to solve the Markov game to an arbitrary Nash
equilibrium (NE) when multiple equilibra exist, thus lacking a solution for NE
selection. In this paper, we treat agents \emph{unequally} and consider
Stackelberg equilibrium as a potentially better convergence point than Nash
equilibrium in terms of Pareto superiority, especially in cooperative
environments. Under Markov games, we formally define the bi-level reinforcement
learning problem in finding Stackelberg equilibrium. We propose a novel
bi-level actor-critic learning method that allows agents to have different
knowledge base (thus intelligent), while their actions still can be executed
simultaneously and distributedly. The convergence proof is given, while the
resulting learning algorithm is tested against the state of the arts. We found
that the proposed bi-level actor-critic algorithm successfully converged to the
Stackelberg equilibria in matrix games and find an asymmetric solution in a
highway merge environment
Multiagent Bidirectionally-Coordinated Nets: Emergence of Human-level Coordination in Learning to Play StarCraft Combat Games
Many artificial intelligence (AI) applications often require multiple
intelligent agents to work in a collaborative effort. Efficient learning for
intra-agent communication and coordination is an indispensable step towards
general AI. In this paper, we take StarCraft combat game as a case study, where
the task is to coordinate multiple agents as a team to defeat their enemies. To
maintain a scalable yet effective communication protocol, we introduce a
Multiagent Bidirectionally-Coordinated Network (BiCNet ['bIknet]) with a
vectorised extension of actor-critic formulation. We show that BiCNet can
handle different types of combats with arbitrary numbers of AI agents for both
sides. Our analysis demonstrates that without any supervisions such as human
demonstrations or labelled data, BiCNet could learn various types of advanced
coordination strategies that have been commonly used by experienced game
players. In our experiments, we evaluate our approach against multiple
baselines under different scenarios; it shows state-of-the-art performance, and
possesses potential values for large-scale real-world applications.Comment: 10 pages, 10 figures. Previously as title: "Multiagent
Bidirectionally-Coordinated Nets for Learning to Play StarCraft Combat
Games", Mar 201
A Regularized Opponent Model with Maximum Entropy Objective
In a single-agent setting, reinforcement learning (RL) tasks can be cast into
an inference problem by introducing a binary random variable o, which stands
for the "optimality". In this paper, we redefine the binary random variable o
in multi-agent setting and formalize multi-agent reinforcement learning (MARL)
as probabilistic inference. We derive a variational lower bound of the
likelihood of achieving the optimality and name it as Regularized Opponent
Model with Maximum Entropy Objective (ROMMEO). From ROMMEO, we present a novel
perspective on opponent modeling and show how it can improve the performance of
training agents theoretically and empirically in cooperative games. To optimize
ROMMEO, we first introduce a tabular Q-iteration method ROMMEO-Q with proof of
convergence. We extend the exact algorithm to complex environments by proposing
an approximate version, ROMMEO-AC. We evaluate these two algorithms on the
challenging iterated matrix game and differential game respectively and show
that they can outperform strong MARL baselines.Comment: Accepted to International Joint Conference on Artificial Intelligence
(IJCA2019
Overcoming Exploration in Reinforcement Learning with Demonstrations
Exploration in environments with sparse rewards has been a persistent problem
in reinforcement learning (RL). Many tasks are natural to specify with a sparse
reward, and manually shaping a reward function can result in suboptimal
performance. However, finding a non-zero reward is exponentially more difficult
with increasing task horizon or action dimensionality. This puts many
real-world tasks out of practical reach of RL methods. In this work, we use
demonstrations to overcome the exploration problem and successfully learn to
perform long-horizon, multi-step robotics tasks with continuous control such as
stacking blocks with a robot arm. Our method, which builds on top of Deep
Deterministic Policy Gradients and Hindsight Experience Replay, provides an
order of magnitude of speedup over RL on simulated robotics tasks. It is simple
to implement and makes only the additional assumption that we can collect a
small set of demonstrations. Furthermore, our method is able to solve tasks not
solvable by either RL or behavior cloning alone, and often ends up
outperforming the demonstrator policy.Comment: 8 pages, ICRA 201
- …