25 research outputs found
Terminal Prediction as an Auxiliary Task for Deep Reinforcement Learning
Deep reinforcement learning has achieved great successes in recent years, but
there are still open challenges, such as convergence to locally optimal
policies and sample inefficiency. In this paper, we contribute a novel
self-supervised auxiliary task, i.e., Terminal Prediction (TP), estimating
temporal closeness to terminal states for episodic tasks. The intuition is to
help representation learning by letting the agent predict how close it is to a
terminal state, while learning its control policy. Although TP could be
integrated with multiple algorithms, this paper focuses on Asynchronous
Advantage Actor-Critic (A3C) and demonstrating the advantages of A3C-TP. Our
extensive evaluation includes: a set of Atari games, the BipedalWalker domain,
and a mini version of the recently proposed multi-agent Pommerman game. Our
results on Atari games and the BipedalWalker domain suggest that A3C-TP
outperforms standard A3C in most of the tested domains and in others it has
similar performance. In Pommerman, our proposed method provides significant
improvement both in learning efficiency and converging to better policies
against different opponents.Comment: AAAI Conference on Artificial Intelligence and Interactive Digital
Entertainment (AIIDE'19). arXiv admin note: text overlap with
arXiv:1812.0004
Adversarial Soft Advantage Fitting: Imitation Learning without Policy Optimization
Adversarial Imitation Learning alternates between learning a discriminator --
which tells apart expert's demonstrations from generated ones -- and a
generator's policy to produce trajectories that can fool this discriminator.
This alternated optimization is known to be delicate in practice since it
compounds unstable adversarial training with brittle and sample-inefficient
reinforcement learning. We propose to remove the burden of the policy
optimization steps by leveraging a novel discriminator formulation.
Specifically, our discriminator is explicitly conditioned on two policies: the
one from the previous generator's iteration and a learnable policy. When
optimized, this discriminator directly learns the optimal generator's policy.
Consequently, our discriminator's update solves the generator's optimization
problem for free: learning a policy that imitates the expert does not require
an additional optimization loop. This formulation effectively cuts by half the
implementation and computational burden of Adversarial Imitation Learning
algorithms by removing the Reinforcement Learning phase altogether. We show on
a variety of tasks that our simpler approach is competitive to prevalent
Imitation Learning methods