8 research outputs found
A Pragmatic Look at Deep Imitation Learning
The introduction of the generative adversarial imitation learning (GAIL)
algorithm has spurred the development of scalable imitation learning approaches
using deep neural networks. Many of the algorithms that followed used a similar
procedure, combining on-policy actor-critic algorithms with inverse
reinforcement learning. More recently there have been an even larger breadth of
approaches, most of which use off-policy algorithms. However, with the breadth
of algorithms, everything from datasets to base reinforcement learning
algorithms to evaluation settings can vary, making it difficult to fairly
compare them. In this work we re-implement 6 different IL algorithms, updating
3 of them to be off-policy, base them on a common off-policy algorithm (SAC),
and evaluate them on a widely-used expert trajectory dataset (D4RL) for the
most common benchmark (MuJoCo). After giving all algorithms the same
hyperparameter optimisation budget, we compare their results for a range of
expert trajectories. In summary, GAIL, with all of its improvements,
consistently performs well across a range of sample sizes, AdRIL is a simple
contender that performs well with one important hyperparameter to tune, and
behavioural cloning remains a strong baseline when data is more plentiful.Comment: Asian Conference on Machine Learning, 202
Contrastive Example-Based Control
While many real-world problems that might benefit from reinforcement
learning, these problems rarely fit into the MDP mold: interacting with the
environment is often expensive and specifying reward functions is challenging.
Motivated by these challenges, prior work has developed data-driven approaches
that learn entirely from samples from the transition dynamics and examples of
high-return states. These methods typically learn a reward function from
high-return states, use that reward function to label the transitions, and then
apply an offline RL algorithm to these transitions. While these methods can
achieve good results on many tasks, they can be complex, often requiring
regularization and temporal difference updates. In this paper, we propose a
method for offline, example-based control that learns an implicit model of
multi-step transitions, rather than a reward function. We show that this
implicit model can represent the Q-values for the example-based control
problem. Across a range of state-based and image-based offline control tasks,
our method outperforms baselines that use learned reward functions; additional
experiments demonstrate improved robustness and scaling with dataset size.Comment: This is an updated version of a manuscript that originally appeared
at L4DC 2023. The project website is here
https://sites.google.com/view/laeo-r
BC-IRL: Learning Generalizable Reward Functions from Demonstrations
How well do reward functions learned with inverse reinforcement learning
(IRL) generalize? We illustrate that state-of-the-art IRL algorithms, which
maximize a maximum-entropy objective, learn rewards that overfit to the
demonstrations. Such rewards struggle to provide meaningful rewards for states
not covered by the demonstrations, a major detriment when using the reward to
learn policies in new situations. We introduce BC-IRL a new inverse
reinforcement learning method that learns reward functions that generalize
better when compared to maximum-entropy IRL approaches. In contrast to the
MaxEnt framework, which learns to maximize rewards around demonstrations,
BC-IRL updates reward parameters such that the policy trained with the new
reward matches the expert demonstrations better. We show that BC-IRL learns
rewards that generalize better on an illustrative simple task and two
continuous robotic control tasks, achieving over twice the success rate of
baselines in challenging generalization settings
A Simple Solution for Offline Imitation from Observations and Examples with Possibly Incomplete Trajectories
Offline imitation from observations aims to solve MDPs where only
task-specific expert states and task-agnostic non-expert state-action pairs are
available. Offline imitation is useful in real-world scenarios where arbitrary
interactions are costly and expert actions are unavailable. The
state-of-the-art "DIstribution Correction Estimation" (DICE) methods minimize
divergence of state occupancy between expert and learner policies and retrieve
a policy with weighted behavior cloning; however, their results are unstable
when learning from incomplete trajectories, due to a non-robust optimization in
the dual domain. To address the issue, in this paper, we propose
Trajectory-Aware Imitation Learning from Observations (TAILO). TAILO uses a
discounted sum along the future trajectory as the weight for weighted behavior
cloning. The terms for the sum are scaled by the output of a discriminator,
which aims to identify expert states. Despite simplicity, TAILO works well if
there exist trajectories or segments of expert behavior in the task-agnostic
data, a common assumption in prior work. In experiments across multiple
testbeds, we find TAILO to be more robust and effective, particularly with
incomplete trajectories.Comment: 35 pages; Accepted as a poster for NeurIPS202
Lipschitzness Is All You Need To Tame Off-policy Generative Adversarial Imitation Learning
Despite the recent success of reinforcement learning in various domains,
these approaches remain, for the most part, deterringly sensitive to
hyper-parameters and are often riddled with essential engineering feats
allowing their success. We consider the case of off-policy generative
adversarial imitation learning, and perform an in-depth review, qualitative and
quantitative, of the method. We show that forcing the learned reward function
to be local Lipschitz-continuous is a sine qua non condition for the method to
perform well. We then study the effects of this necessary condition and provide
several theoretical results involving the local Lipschitzness of the
state-value function. We complement these guarantees with empirical evidence
attesting to the strong positive effect that the consistent satisfaction of the
Lipschitzness constraint on the reward has on imitation performance. Finally,
we tackle a generic pessimistic reward preconditioning add-on spawning a large
class of reward shaping methods, which makes the base method it is plugged into
provably more robust, as shown in several additional theoretical guarantees. We
then discuss these through a fine-grained lens and share our insights.
Crucially, the guarantees derived and reported in this work are valid for any
reward satisfying the Lipschitzness condition, nothing is specific to
imitation. As such, these may be of independent interest