3,190 research outputs found
Expected Policy Gradients
We propose expected policy gradients (EPG), which unify stochastic policy
gradients (SPG) and deterministic policy gradients (DPG) for reinforcement
learning. Inspired by expected sarsa, EPG integrates across the action when
estimating the gradient, instead of relying only on the action in the sampled
trajectory. We establish a new general policy gradient theorem, of which the
stochastic and deterministic policy gradient theorems are special cases. We
also prove that EPG reduces the variance of the gradient estimates without
requiring deterministic policies and, for the Gaussian case, with no
computational overhead. Finally, we show that it is optimal in a certain sense
to explore with a Gaussian policy such that the covariance is proportional to
the exponential of the scaled Hessian of the critic with respect to the
actions. We present empirical results confirming that this new form of
exploration substantially outperforms DPG with the Ornstein-Uhlenbeck heuristic
in four challenging MuJoCo domains.Comment: Conference paper, AAAI-18, 12 pages including supplemen
Counterfactual Multi-Agent Policy Gradients
Cooperative multi-agent systems can be naturally used to model many real
world problems, such as network packet routing and the coordination of
autonomous vehicles. There is a great need for new reinforcement learning
methods that can efficiently learn decentralised policies for such systems. To
this end, we propose a new multi-agent actor-critic method called
counterfactual multi-agent (COMA) policy gradients. COMA uses a centralised
critic to estimate the Q-function and decentralised actors to optimise the
agents' policies. In addition, to address the challenges of multi-agent credit
assignment, it uses a counterfactual baseline that marginalises out a single
agent's action, while keeping the other agents' actions fixed. COMA also uses a
critic representation that allows the counterfactual baseline to be computed
efficiently in a single forward pass. We evaluate COMA in the testbed of
StarCraft unit micromanagement, using a decentralised variant with significant
partial observability. COMA significantly improves average performance over
other multi-agent actor-critic methods in this setting, and the best performing
agents are competitive with state-of-the-art centralised controllers that get
access to the full state
- …