36 research outputs found
Context-Aware Sparse Deep Coordination Graphs
Learning sparse coordination graphs adaptive to the coordination dynamics
among agents is a long-standing problem in cooperative multi-agent learning.
This paper studies this problem and proposes a novel method using the variance
of payoff functions to construct context-aware sparse coordination topologies.
We theoretically consolidate our method by proving that the smaller the
variance of payoff functions is, the less likely action selection will change
after removing the corresponding edge. Moreover, we propose to learn action
representations to effectively reduce the influence of payoff functions'
estimation errors on graph construction. To empirically evaluate our method, we
present the Multi-Agent COordination (MACO) benchmark by collecting classic
coordination problems in the literature, increasing their difficulty, and
classifying them into different types. We carry out a case study and
experiments on the MACO and StarCraft II micromanagement benchmark to
demonstrate the dynamics of sparse graph learning, the influence of graph
sparseness, and the learning performance of our method
Pareto Actor-Critic for Equilibrium Selection in Multi-Agent Reinforcement Learning
This work focuses on equilibrium selection in no-conflict multi-agent games,
where we specifically study the problem of selecting a Pareto-optimal
equilibrium among several existing equilibria. It has been shown that many
state-of-the-art multi-agent reinforcement learning (MARL) algorithms are prone
to converging to Pareto-dominated equilibria due to the uncertainty each agent
has about the policy of the other agents during training. To address
sub-optimal equilibrium selection, we propose Pareto Actor-Critic (Pareto-AC),
which is an actor-critic algorithm that utilises a simple property of
no-conflict games (a superset of cooperative games): the Pareto-optimal
equilibrium in a no-conflict game maximises the returns of all agents and
therefore is the preferred outcome for all agents. We evaluate Pareto-AC in a
diverse set of multi-agent games and show that it converges to higher episodic
returns compared to seven state-of-the-art MARL algorithms and that it
successfully converges to a Pareto-optimal equilibrium in a range of matrix
games. Finally, we propose PACDCG, a graph neural network extension of
Pareto-AC which is shown to efficiently scale in games with a large number of
agents.Comment: 20 pages, 12 figure
Is Centralized Training with Decentralized Execution Framework Centralized Enough for MARL?
Centralized Training with Decentralized Execution (CTDE) has recently emerged
as a popular framework for cooperative Multi-Agent Reinforcement Learning
(MARL), where agents can use additional global state information to guide
training in a centralized way and make their own decisions only based on
decentralized local policies. Despite the encouraging results achieved, CTDE
makes an independence assumption on agent policies, which limits agents to
adopt global cooperative information from each other during centralized
training. Therefore, we argue that existing CTDE methods cannot fully utilize
global information for training, leading to an inefficient joint-policy
exploration and even suboptimal results. In this paper, we introduce a novel
Centralized Advising and Decentralized Pruning (CADP) framework for multi-agent
reinforcement learning, that not only enables an efficacious message exchange
among agents during training but also guarantees the independent policies for
execution. Firstly, CADP endows agents the explicit communication channel to
seek and take advices from different agents for more centralized training. To
further ensure the decentralized execution, we propose a smooth model pruning
mechanism to progressively constraint the agent communication into a closed one
without degradation in agent cooperation capability. Empirical evaluations on
StarCraft II micromanagement and Google Research Football benchmarks
demonstrate that the proposed framework achieves superior performance compared
with the state-of-the-art counterparts. Our code will be made publicly
available
UneVEn: Universal Value Exploration for Multi-Agent Reinforcement Learning
VDN and QMIX are two popular value-based algorithms for cooperative MARL that
learn a centralized action value function as a monotonic mixing of per-agent
utilities. While this enables easy decentralization of the learned policy, the
restricted joint action value function can prevent them from solving tasks that
require significant coordination between agents at a given timestep. We show
that this problem can be overcome by improving the joint exploration of all
agents during training. Specifically, we propose a novel MARL approach called
Universal Value Exploration (UneVEn) that learns a set of related tasks
simultaneously with a linear decomposition of universal successor features.
With the policies of already solved related tasks, the joint exploration
process of all agents can be improved to help them achieve better coordination.
Empirical results on a set of exploration games, challenging cooperative
predator-prey tasks requiring significant coordination among agents, and
StarCraft II micromanagement benchmarks show that UneVEn can solve tasks where
other state-of-the-art MARL methods fail.Comment: Published at ICML 202
Self-Organized Polynomial-Time Coordination Graphs
Coordination graph is a promising approach to model agent collaboration in
multi-agent reinforcement learning. It conducts a graph-based value
factorization and induces explicit coordination among agents to complete
complicated tasks. However, one critical challenge in this paradigm is the
complexity of greedy action selection with respect to the factorized values. It
refers to the decentralized constraint optimization problem (DCOP), which and
whose constant-ratio approximation are NP-hard problems. To bypass this
systematic hardness, this paper proposes a novel method, named Self-Organized
Polynomial-time Coordination Graphs (SOP-CG), which uses structured graph
classes to guarantee the accuracy and the computational efficiency of
collaborated action selection. SOP-CG employs dynamic graph topology to ensure
sufficient value function expressiveness. The graph selection is unified into
an end-to-end learning paradigm. In experiments, we show that our approach
learns succinct and well-adapted graph topologies, induces effective
coordination, and improves performance across a variety of cooperative
multi-agent tasks
Influence of Team Interactions on Multi-Robot Cooperation: A Relational Network Perspective
Relational networks within a team play a critical role in the performance of
many real-world multi-robot systems. To successfully accomplish tasks that
require cooperation and coordination, different agents (e.g., robots)
necessitate different priorities based on their positioning within the team.
Yet, many of the existing multi-robot cooperation algorithms regard agents as
interchangeable and lack a mechanism to guide the type of cooperation strategy
the agents should exhibit. To account for the team structure in cooperative
tasks, we propose a novel algorithm that uses a relational network comprising
inter-agent relationships to prioritize certain agents over others. Through
appropriate design of the team's relational network, we can guide the
cooperation strategy, resulting in the emergence of new behaviors that
accomplish the specified task. We conducted six experiments in a multi-robot
setting with a cooperative task. Our results demonstrate that the proposed
method can effectively influence the type of solution that the algorithm
converges to by specifying the relationships between the agents, making it a
promising approach for tasks that require cooperation among agents with a
specified team structure.Comment: Accepted to Multi-Robot and Multi-Agent Systems (IEEE MRS 2023
AccMER: Accelerating Multi-Agent Experience Replay with Cache Locality-aware Prioritization
Multi-Agent Experience Replay (MER) is a key component of off-policy
reinforcement learning~(RL) algorithms. By remembering and reusing experiences
from the past, experience replay significantly improves the stability of RL
algorithms and their learning efficiency. In many scenarios, multiple agents
interact in a shared environment during online training under centralized
training and decentralized execution~(CTDE) paradigm. Current multi-agent
reinforcement learning~(MARL) algorithms consider experience replay with
uniform sampling or based on priority weights to improve transition data sample
efficiency in the sampling phase. However, moving transition data histories for
each agent through the processor memory hierarchy is a performance limiter.
Also, as the agents' transitions continuously renew every iteration, the finite
cache capacity results in increased cache misses.
To this end, we propose \name, that repeatedly reuses the
transitions~(experiences) for a window of steps in order to improve the
cache locality and minimize the transition data movement, instead of sampling
new transitions at each step. Specifically, our optimization uses priority
weights to select the transitions so that only high-priority transitions will
be reused frequently, thereby improving the cache performance. Our experimental
results on the Predator-Prey environment demonstrate the effectiveness of
reusing the essential transitions based on the priority weights, where we
observe an end-to-end training time reduction of ~(for agents)
compared to existing prioritized MER algorithms without notable degradation in
the mean reward.Comment: Accepted to ASAP'2