130,976 research outputs found
Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning
Off-policy model-free deep reinforcement learning methods using previously
collected data can improve sample efficiency over on-policy policy gradient
techniques. On the other hand, on-policy algorithms are often more stable and
easier to use. This paper examines, both theoretically and empirically,
approaches to merging on- and off-policy updates for deep reinforcement
learning. Theoretical results show that off-policy updates with a value
function estimator can be interpolated with on-policy policy gradient updates
whilst still satisfying performance bounds. Our analysis uses control variate
methods to produce a family of policy gradient algorithms, with several
recently proposed algorithms being special cases of this family. We then
provide an empirical comparison of these techniques with the remaining
algorithmic details fixed, and show how different mixing of off-policy gradient
estimates with on-policy samples contribute to improvements in empirical
performance. The final algorithm provides a generalization and unification of
existing deep policy gradient techniques, has theoretical guarantees on the
bias introduced by off-policy updates, and improves on the state-of-the-art
model-free deep RL methods on a number of OpenAI Gym continuous control
benchmarks
Off-Policy Actor-Critic with Emphatic Weightings
A variety of theoretically-sound policy gradient algorithms exist for the
on-policy setting due to the policy gradient theorem, which provides a
simplified form for the gradient. The off-policy setting, however, has been
less clear due to the existence of multiple objectives and the lack of an
explicit off-policy policy gradient theorem. In this work, we unify these
objectives into one off-policy objective, and provide a policy gradient theorem
for this unified objective. The derivation involves emphatic weightings and
interest functions. We show multiple strategies to approximate the gradients,
in an algorithm called Actor Critic with Emphatic weightings (ACE). We prove in
a counterexample that previous (semi-gradient) off-policy actor-critic
methods--particularly Off-Policy Actor-Critic (OffPAC) and Deterministic Policy
Gradient (DPG)--converge to the wrong solution whereas ACE finds the optimal
solution. We also highlight why these semi-gradient approaches can still
perform well in practice, suggesting strategies for variance reduction in ACE.
We empirically study several variants of ACE on two classic control
environments and an image-based environment designed to illustrate the
tradeoffs made by each gradient approximation. We find that by approximating
the emphatic weightings directly, ACE performs as well as or better than OffPAC
in all settings tested.Comment: 63 page
Trajectory-Based Off-Policy Deep Reinforcement Learning
Policy gradient methods are powerful reinforcement learning algorithms and
have been demonstrated to solve many complex tasks. However, these methods are
also data-inefficient, afflicted with high variance gradient estimates, and
frequently get stuck in local optima. This work addresses these weaknesses by
combining recent improvements in the reuse of off-policy data and exploration
in parameter space with deterministic behavioral policies. The resulting
objective is amenable to standard neural network optimization strategies like
stochastic gradient descent or stochastic gradient Hamiltonian Monte Carlo.
Incorporation of previous rollouts via importance sampling greatly improves
data-efficiency, whilst stochastic optimization schemes facilitate the escape
from local optima. We evaluate the proposed approach on a series of continuous
control benchmark tasks. The results show that the proposed algorithm is able
to successfully and reliably learn solutions using fewer system interactions
than standard policy gradient methods.Comment: Includes appendix. Accepted for ICML 201
Mitigating Off-Policy Bias in Actor-Critic Methods with One-Step Q-learning: A Novel Correction Approach
Compared to on-policy counterparts, off-policy model-free deep reinforcement
learning can improve data efficiency by repeatedly using the previously
gathered data. However, off-policy learning becomes challenging when the
discrepancy between the underlying distributions of the agent's policy and
collected data increases. Although the well-studied importance sampling and
off-policy policy gradient techniques were proposed to compensate for this
discrepancy, they usually require a collection of long trajectories and induce
additional problems such as vanishing/exploding gradients or discarding many
useful experiences, which eventually increases the computational complexity.
Moreover, their generalization to either continuous action domains or policies
approximated by deterministic deep neural networks is strictly limited. To
overcome these limitations, we introduce a novel policy similarity measure to
mitigate the effects of such discrepancy in continuous control. Our method
offers an adequate single-step off-policy correction that is applicable to
deterministic policy networks. Theoretical and empirical studies demonstrate
that it can achieve a "safe" off-policy learning and substantially improve the
state-of-the-art by attaining higher returns in fewer steps than the competing
methods through an effective schedule of the learning rate in Q-learning and
policy optimization
- …