1,070 research outputs found
Control Regularization for Reduced Variance Reinforcement Learning
Dealing with high variance is a significant challenge in model-free
reinforcement learning (RL). Existing methods are unreliable, exhibiting high
variance in performance from run to run using different initializations/seeds.
Focusing on problems arising in continuous control, we propose a functional
regularization approach to augmenting model-free RL. In particular, we
regularize the behavior of the deep policy to be similar to a policy prior,
i.e., we regularize in function space. We show that functional regularization
yields a bias-variance trade-off, and propose an adaptive tuning strategy to
optimize this trade-off. When the policy prior has control-theoretic stability
guarantees, we further show that this regularization approximately preserves
those stability guarantees throughout learning. We validate our approach
empirically on a range of settings, and demonstrate significantly reduced
variance, guaranteed dynamic stability, and more efficient learning than deep
RL alone.Comment: Appearing in ICML 201
Quantile QT-Opt for Risk-Aware Vision-Based Robotic Grasping
The distributional perspective on reinforcement learning (RL) has given rise
to a series of successful Q-learning algorithms, resulting in state-of-the-art
performance in arcade game environments. However, it has not yet been analyzed
how these findings from a discrete setting translate to complex practical
applications characterized by noisy, high dimensional and continuous
state-action spaces. In this work, we propose Quantile QT-Opt (Q2-Opt), a
distributional variant of the recently introduced distributed Q-learning
algorithm for continuous domains, and examine its behaviour in a series of
simulated and real vision-based robotic grasping tasks. The absence of an actor
in Q2-Opt allows us to directly draw a parallel to the previous discrete
experiments in the literature without the additional complexities induced by an
actor-critic architecture. We demonstrate that Q2-Opt achieves a superior
vision-based object grasping success rate, while also being more sample
efficient. The distributional formulation also allows us to experiment with
various risk distortion metrics that give us an indication of how robots can
concretely manage risk in practice using a Deep RL control policy. As an
additional contribution, we perform batch RL experiments in our virtual
environment and compare them with the latest findings from discrete settings.
Surprisingly, we find that the previous batch RL findings from the literature
obtained on arcade game environments do not generalise to our setup.Comment: Camera-ready version for RSS 2020. Contains 8 pages, 7 figure
- …