51,934 research outputs found
Learning Heterogeneous Agent Cooperation via Multiagent League Training
Many multiagent systems in the real world include multiple types of agents
with different abilities and functionality. Such heterogeneous multiagent
systems have significant practical advantages. However, they also come with
challenges compared with homogeneous systems for multiagent reinforcement
learning, such as the non-stationary problem and the policy version iteration
issue. This work proposes a general-purpose reinforcement learning algorithm
named as Heterogeneous League Training (HLT) to address heterogeneous
multiagent problems. HLT keeps track of a pool of policies that agents have
explored during training, gathering a league of heterogeneous policies to
facilitate future policy optimization. Moreover, a hyper-network is introduced
to increase the diversity of agent behaviors when collaborating with teammates
having different levels of cooperation skills. We use heterogeneous benchmark
tasks to demonstrate that (1) HLT promotes the success rate in cooperative
heterogeneous tasks; (2) HLT is an effective approach to solving the policy
version iteration problem; (3) HLT provides a practical way to assess the
difficulty of learning each role in a heterogeneous team
Heterogeneous Multi-Agent Reinforcement Learning via Mirror Descent Policy Optimization
This paper presents an extension of the Mirror Descent method to overcome
challenges in cooperative Multi-Agent Reinforcement Learning (MARL) settings,
where agents have varying abilities and individual policies. The proposed
Heterogeneous-Agent Mirror Descent Policy Optimization (HAMDPO) algorithm
utilizes the multi-agent advantage decomposition lemma to enable efficient
policy updates for each agent while ensuring overall performance improvements.
By iteratively updating agent policies through an approximate solution of the
trust-region problem, HAMDPO guarantees stability and improves performance.
Moreover, the HAMDPO algorithm is capable of handling both continuous and
discrete action spaces for heterogeneous agents in various MARL problems. We
evaluate HAMDPO on Multi-Agent MuJoCo and StarCraftII tasks, demonstrating its
superiority over state-of-the-art algorithms such as HATRPO and HAPPO. These
results suggest that HAMDPO is a promising approach for solving cooperative
MARL problems and could potentially be extended to address other challenging
problems in the field of MARL
A Cooperative Optimal Control Framework for Connected and Automated Vehicles in Mixed Traffic Using Social Value Orientation
In this paper, we develop a socially cooperative optimal control framework
for connected and automated vehicles (CAVs) in mixed traffic using social value
orientation (SVO). In our approach, we formulate the interaction between a CAV
and a human-driven vehicle (HDV) as a simultaneous game to facilitate the
derivation of a Nash equilibrium. In the imposed game, each vehicle minimizes a
weighted sum of its egoistic objective and a cooperative objective. The SVO
angles are used to quantify preferences of the vehicles toward the egoistic and
cooperative objectives which lead to an appropriate design of weighting factors
in a multi-objective optimal control problem. We prove that by solving the
proposed optimal control problem, a Nash equilibrium can be obtained. To
estimate the SVO angle of the HDV, we develop a receding horizon estimation
based on maximum entropy inverse reinforcement learning. The effectiveness of
the proposed approach is demonstrated by numerical simulations at a highway
on-ramp merging scenario.Comment: submitted to CDC202
Task-Based Information Compression for Multi-Agent Communication Problems with Channel Rate Constraints
A collaborative task is assigned to a multiagent system (MAS) in which agents
are allowed to communicate. The MAS runs over an underlying Markov decision
process and its task is to maximize the averaged sum of discounted one-stage
rewards. Although knowing the global state of the environment is necessary for
the optimal action selection of the MAS, agents are limited to individual
observations. The inter-agent communication can tackle the issue of local
observability, however, the limited rate of the inter-agent communication
prevents the agent from acquiring the precise global state information. To
overcome this challenge, agents need to communicate their observations in a
compact way such that the MAS compromises the minimum possible sum of rewards.
We show that this problem is equivalent to a form of rate-distortion problem
which we call the task-based information compression. We introduce a scheme for
task-based information compression titled State aggregation for information
compression (SAIC), for which a state aggregation algorithm is analytically
designed. The SAIC is shown to be capable of achieving near-optimal performance
in terms of the achieved sum of discounted rewards. The proposed algorithm is
applied to a rendezvous problem and its performance is compared with several
benchmarks. Numerical experiments confirm the superiority of the proposed
algorithm.Comment: 13 pages, 9 figure
Guided Deep Reinforcement Learning for Swarm Systems
In this paper, we investigate how to learn to control a group of cooperative
agents with limited sensing capabilities such as robot swarms. The agents have
only very basic sensor capabilities, yet in a group they can accomplish
sophisticated tasks, such as distributed assembly or search and rescue tasks.
Learning a policy for a group of agents is difficult due to distributed partial
observability of the state. Here, we follow a guided approach where a critic
has central access to the global state during learning, which simplifies the
policy evaluation problem from a reinforcement learning point of view. For
example, we can get the positions of all robots of the swarm using a camera
image of a scene. This camera image is only available to the critic and not to
the control policies of the robots. We follow an actor-critic approach, where
the actors base their decisions only on locally sensed information. In
contrast, the critic is learned based on the true global state. Our algorithm
uses deep reinforcement learning to approximate both the Q-function and the
policy. The performance of the algorithm is evaluated on two tasks with simple
simulated 2D agents: 1) finding and maintaining a certain distance to each
others and 2) locating a target.Comment: 15 pages, 8 figures, accepted at the AAMAS 2017 Autonomous Robots and
Multirobot Systems (ARMS) Worksho
- …