1 research outputs found
Learning Self-Correctable Policies and Value Functions from Demonstrations with Negative Sampling
Imitation learning, followed by reinforcement learning algorithms, is a
promising paradigm to solve complex control tasks sample-efficiently. However,
learning from demonstrations often suffers from the covariate shift problem,
which results in cascading errors of the learned policy. We introduce a notion
of conservatively-extrapolated value functions, which provably lead to policies
with self-correction. We design an algorithm Value Iteration with Negative
Sampling (VINS) that practically learns such value functions with conservative
extrapolation. We show that VINS can correct mistakes of the behavioral cloning
policy on simulated robotics benchmark tasks. We also propose the algorithm of
using VINS to initialize a reinforcement learning algorithm, which is shown to
outperform significantly prior works in sample efficiency