An inherent problem of reinforcement learning is performing exploration of an
environment through random actions, of which a large portion can be
unproductive. Instead, exploration can be improved by initializing the learning
policy with an existing (previously learned or hard-coded) oracle policy,
offline data, or demonstrations. In the case of using an oracle policy, it can
be unclear how best to incorporate the oracle policy's experience into the
learning policy in a way that maximizes learning sample efficiency. In this
paper, we propose a method termed Critic Confidence Guided Exploration (CCGE)
for incorporating such an oracle policy into standard actor-critic
reinforcement learning algorithms. More specifically, CCGE takes in the oracle
policy's actions as suggestions and incorporates this information into the
learning scheme when uncertainty is high, while ignoring it when the
uncertainty is low. CCGE is agnostic to methods of estimating uncertainty, and
we show that it is equally effective with two different techniques.
Empirically, we evaluate the effect of CCGE on various benchmark reinforcement
learning tasks, and show that this idea can lead to improved sample efficiency
and final performance. Furthermore, when evaluated on sparse reward
environments, CCGE is able to perform competitively against adjacent algorithms
that also leverage an oracle policy. Our experiments show that it is possible
to utilize uncertainty as a heuristic to guide exploration using an oracle in
reinforcement learning. We expect that this will inspire more research in this
direction, where various heuristics are used to determine the direction of
guidance provided to learning.Comment: Under review at TML