Off-policy model-free deep reinforcement learning methods using previously
collected data can improve sample efficiency over on-policy policy gradient
techniques. On the other hand, on-policy algorithms are often more stable and
easier to use. This paper examines, both theoretically and empirically,
approaches to merging on- and off-policy updates for deep reinforcement
learning. Theoretical results show that off-policy updates with a value
function estimator can be interpolated with on-policy policy gradient updates
whilst still satisfying performance bounds. Our analysis uses control variate
methods to produce a family of policy gradient algorithms, with several
recently proposed algorithms being special cases of this family. We then
provide an empirical comparison of these techniques with the remaining
algorithmic details fixed, and show how different mixing of off-policy gradient
estimates with on-policy samples contribute to improvements in empirical
performance. The final algorithm provides a generalization and unification of
existing deep policy gradient techniques, has theoretical guarantees on the
bias introduced by off-policy updates, and improves on the state-of-the-art
model-free deep RL methods on a number of OpenAI Gym continuous control
benchmarks