6 research outputs found
Lipschitzness Is All You Need To Tame Off-policy Generative Adversarial Imitation Learning
Despite the recent success of reinforcement learning in various domains,
these approaches remain, for the most part, deterringly sensitive to
hyper-parameters and are often riddled with essential engineering feats
allowing their success. We consider the case of off-policy generative
adversarial imitation learning, and perform an in-depth review, qualitative and
quantitative, of the method. We show that forcing the learned reward function
to be local Lipschitz-continuous is a sine qua non condition for the method to
perform well. We then study the effects of this necessary condition and provide
several theoretical results involving the local Lipschitzness of the
state-value function. We complement these guarantees with empirical evidence
attesting to the strong positive effect that the consistent satisfaction of the
Lipschitzness constraint on the reward has on imitation performance. Finally,
we tackle a generic pessimistic reward preconditioning add-on spawning a large
class of reward shaping methods, which makes the base method it is plugged into
provably more robust, as shown in several additional theoretical guarantees. We
then discuss these through a fine-grained lens and share our insights.
Crucially, the guarantees derived and reported in this work are valid for any
reward satisfying the Lipschitzness condition, nothing is specific to
imitation. As such, these may be of independent interest
Conditional Neural Relational Inference for Interacting Systems
In this work, we want to learn to model the dynamics of similar yet distinct
groups of interacting objects. These groups follow some common physical laws
that exhibit specificities that are captured through some vectorial
description. We develop a model that allows us to do conditional generation
from any such group given its vectorial description. Unlike previous work on
learning dynamical systems that can only do trajectory completion and require a
part of the trajectory dynamics to be provided as input in generation time, we
do generation using only the conditioning vector with no access to generation
time's trajectories. We evaluate our model in the setting of modeling human
gait and, in particular pathological human gait
Sample-efficient imitation learning via generative adversarial nets
GAIL is a recent successful imitation learning architecture that exploits the adversarial training procedure introduced in GANs. Albeit successful at generating behaviours similar to those demonstrated to the agent, GAIL suffers from a high sample complexity in the number of interactions it has to carry out in the environment in order to achieve satisfactory performance. We dramatically shrink the amount of interactions with the environment necessary to learn well-behaved imitation policies, by up to several orders of magnitude. Our framework, operating in the model-free regime, exhibits a significant increase in sample-efficiency over previous methods by simultaneously a) learning a self-tuned adversarially-trained surrogate reward and b) leveraging an off-policy actor-critic architecture. We show that our approach is simple to implement and that the learned agents remain remarkably stable, as shown in our experiments that span a variety of continuous control tasks. Video visualisations available at: \url{https://youtu.be/-nCsqUJnRKU}
Conditional neural relational inference for interacting systems
In this work, we want to learn to model the dynamics of similar yet distinct groups of interacting objects. These groups follow some common physical laws that exhibit speci_cities that are captured through some vectorial description. We develop a model that allows us to do conditional generation from any such group given its vectorial description. Unlike previous work on learning dynamical systems that can only do trajectory completion and require a part of the trajectory dynamics to be provided as input in generation time, we do generation using only the conditioning vector with no access to generation time's trajectories. We evaluate our model in the setting of modeling human gait and, in particular pathological human gait