451 research outputs found
Trajectory-Based Off-Policy Deep Reinforcement Learning
Policy gradient methods are powerful reinforcement learning algorithms and
have been demonstrated to solve many complex tasks. However, these methods are
also data-inefficient, afflicted with high variance gradient estimates, and
frequently get stuck in local optima. This work addresses these weaknesses by
combining recent improvements in the reuse of off-policy data and exploration
in parameter space with deterministic behavioral policies. The resulting
objective is amenable to standard neural network optimization strategies like
stochastic gradient descent or stochastic gradient Hamiltonian Monte Carlo.
Incorporation of previous rollouts via importance sampling greatly improves
data-efficiency, whilst stochastic optimization schemes facilitate the escape
from local optima. We evaluate the proposed approach on a series of continuous
control benchmark tasks. The results show that the proposed algorithm is able
to successfully and reliably learn solutions using fewer system interactions
than standard policy gradient methods.Comment: Includes appendix. Accepted for ICML 201
Smoothing Policies and Safe Policy Gradients
Policy gradient algorithms are among the best candidates for the much
anticipated application of reinforcement learning to real-world control tasks,
such as the ones arising in robotics. However, the trial-and-error nature of
these methods introduces safety issues whenever the learning phase itself must
be performed on a physical system. In this paper, we address a specific safety
formulation, where danger is encoded in the reward signal and the learning
agent is constrained to never worsen its performance. By studying actor-only
policy gradient from a stochastic optimization perspective, we establish
improvement guarantees for a wide class of parametric policies, generalizing
existing results on Gaussian policies. This, together with novel upper bounds
on the variance of policy gradient estimators, allows to identify those
meta-parameter schedules that guarantee monotonic improvement with high
probability. The two key meta-parameters are the step size of the parameter
updates and the batch size of the gradient estimators. By a joint, adaptive
selection of these meta-parameters, we obtain a safe policy gradient algorithm
- …