6,404 research outputs found
Multiagent Bidirectionally-Coordinated Nets: Emergence of Human-level Coordination in Learning to Play StarCraft Combat Games
Many artificial intelligence (AI) applications often require multiple
intelligent agents to work in a collaborative effort. Efficient learning for
intra-agent communication and coordination is an indispensable step towards
general AI. In this paper, we take StarCraft combat game as a case study, where
the task is to coordinate multiple agents as a team to defeat their enemies. To
maintain a scalable yet effective communication protocol, we introduce a
Multiagent Bidirectionally-Coordinated Network (BiCNet ['bIknet]) with a
vectorised extension of actor-critic formulation. We show that BiCNet can
handle different types of combats with arbitrary numbers of AI agents for both
sides. Our analysis demonstrates that without any supervisions such as human
demonstrations or labelled data, BiCNet could learn various types of advanced
coordination strategies that have been commonly used by experienced game
players. In our experiments, we evaluate our approach against multiple
baselines under different scenarios; it shows state-of-the-art performance, and
possesses potential values for large-scale real-world applications.Comment: 10 pages, 10 figures. Previously as title: "Multiagent
Bidirectionally-Coordinated Nets for Learning to Play StarCraft Combat
Games", Mar 201
End-to-end optimization of goal-driven and visually grounded dialogue systems
End-to-end design of dialogue systems has recently become a popular research
topic thanks to powerful tools such as encoder-decoder architectures for
sequence-to-sequence learning. Yet, most current approaches cast human-machine
dialogue management as a supervised learning problem, aiming at predicting the
next utterance of a participant given the full history of the dialogue. This
vision is too simplistic to render the intrinsic planning problem inherent to
dialogue as well as its grounded nature, making the context of a dialogue
larger than the sole history. This is why only chit-chat and question answering
tasks have been addressed so far using end-to-end architectures. In this paper,
we introduce a Deep Reinforcement Learning method to optimize visually grounded
task-oriented dialogues, based on the policy gradient algorithm. This approach
is tested on a dataset of 120k dialogues collected through Mechanical Turk and
provides encouraging results at solving both the problem of generating natural
dialogues and the task of discovering a specific object in a complex picture
iPLAN: Intent-Aware Planning in Heterogeneous Traffic via Distributed Multi-Agent Reinforcement Learning
Navigating safely and efficiently in dense and heterogeneous traffic
scenarios is challenging for autonomous vehicles (AVs) due to their inability
to infer the behaviors or intentions of nearby drivers. In this work, we
introduce a distributed multi-agent reinforcement learning (MARL) algorithm
that can predict trajectories and intents in dense and heterogeneous traffic
scenarios. Our approach for intent-aware planning, iPLAN, allows agents to
infer nearby drivers' intents solely from their local observations. We model
two distinct incentives for agents' strategies: Behavioral Incentive for
high-level decision-making based on their driving behavior or personality and
Instant Incentive for motion planning for collision avoidance based on the
current traffic state. Our approach enables agents to infer their opponents'
behavior incentives and integrate this inferred information into their
decision-making and motion-planning processes. We perform experiments on two
simulation environments, Non-Cooperative Navigation and Heterogeneous Highway.
In Heterogeneous Highway, results show that, compared with centralized training
decentralized execution (CTDE) MARL baselines such as QMIX and MAPPO, our
method yields a 4.3% and 38.4% higher episodic reward in mild and chaotic
traffic, with 48.1% higher success rate and 80.6% longer survival time in
chaotic traffic. We also compare with a decentralized training decentralized
execution (DTDE) baseline IPPO and demonstrate a higher episodic reward of
12.7% and 6.3% in mild traffic and chaotic traffic, 25.3% higher success rate,
and 13.7% longer survival time
- …