11,229 research outputs found
Shapley Q-value: A Local Reward Approach to Solve Global Reward Games
Cooperative game is a critical research area in the multi-agent reinforcement
learning (MARL). Global reward game is a subclass of cooperative games, where
all agents aim to maximize the global reward. Credit assignment is an important
problem studied in the global reward game. Most of previous works stood by the
view of non-cooperative-game theoretical framework with the shared reward
approach, i.e., each agent being assigned a shared global reward directly.
This, however, may give each agent an inaccurate reward on its contribution to
the group, which could cause inefficient learning. To deal with this problem,
we i) introduce a cooperative-game theoretical framework called extended convex
game (ECG) that is a superset of global reward game, and ii) propose a local
reward approach called Shapley Q-value. Shapley Q-value is able to distribute
the global reward, reflecting each agent's own contribution in contrast to the
shared reward approach. Moreover, we derive an MARL algorithm called Shapley
Q-value deep deterministic policy gradient (SQDDPG), using Shapley Q-value as
the critic for each agent. We evaluate SQDDPG on Cooperative Navigation,
Prey-and-Predator and Traffic Junction, compared with the state-of-the-art
algorithms, e.g., MADDPG, COMA, Independent DDPG and Independent A2C. In the
experiments, SQDDPG shows a significant improvement on the convergence rate.
Finally, we plot Shapley Q-value and validate the property of fair credit
assignment
Counterfactual Multi-Agent Policy Gradients
Cooperative multi-agent systems can be naturally used to model many real
world problems, such as network packet routing and the coordination of
autonomous vehicles. There is a great need for new reinforcement learning
methods that can efficiently learn decentralised policies for such systems. To
this end, we propose a new multi-agent actor-critic method called
counterfactual multi-agent (COMA) policy gradients. COMA uses a centralised
critic to estimate the Q-function and decentralised actors to optimise the
agents' policies. In addition, to address the challenges of multi-agent credit
assignment, it uses a counterfactual baseline that marginalises out a single
agent's action, while keeping the other agents' actions fixed. COMA also uses a
critic representation that allows the counterfactual baseline to be computed
efficiently in a single forward pass. We evaluate COMA in the testbed of
StarCraft unit micromanagement, using a decentralised variant with significant
partial observability. COMA significantly improves average performance over
other multi-agent actor-critic methods in this setting, and the best performing
agents are competitive with state-of-the-art centralised controllers that get
access to the full state
- …