19 research outputs found
Versatile Inverse Reinforcement Learning via Cumulative Rewards
Inverse Reinforcement Learning infers a reward function from expert demonstrations, aiming to encode the behavior and intentions of the expert. Current approaches usually do this with generative and uni-modal models, meaning that they encode a single behavior. In the common setting, where there are various solutions to a problem and the experts show versatile behavior this severely limits the generalization capabilities of these methods. We propose a novel method for Inverse Reinforcement Learning that overcomes these problems by formulating the recovered reward as a sum of iteratively trained discriminators. We show on simulated tasks that our approach is able to recover general, high-quality reward functions and produces policies of the same quality as behavioral cloning approaches designed for versatile behavior
Online Observer-Based Inverse Reinforcement Learning
In this paper, a novel approach to the output-feedback inverse reinforcement
learning (IRL) problem is developed by casting the IRL problem, for linear
systems with quadratic cost functions, as a state estimation problem. Two
observer-based techniques for IRL are developed, including a novel observer
method that re-uses previous state estimates via history stacks. Theoretical
guarantees for convergence and robustness are established under appropriate
excitation conditions. Simulations demonstrate the performance of the developed
observers and filters under noisy and noise-free measurements.Comment: 7 pages, 3 figure
Primal Wasserstein Imitation Learning
Imitation Learning (IL) methods seek to match the behavior of an agent with
that of an expert. In the present work, we propose a new IL method based on a
conceptually simple algorithm: Primal Wasserstein Imitation Learning (PWIL),
which ties to the primal form of the Wasserstein distance between the expert
and the agent state-action distributions. We present a reward function which is
derived offline, as opposed to recent adversarial IL algorithms that learn a
reward function through interactions with the environment, and which requires
little fine-tuning. We show that we can recover expert behavior on a variety of
continuous control tasks of the MuJoCo domain in a sample efficient manner in
terms of agent interactions and of expert interactions with the environment.
Finally, we show that the behavior of the agent we train matches the behavior
of the expert with the Wasserstein distance, rather than the commonly used
proxy of performance.Comment: Published in International Conference on Learning Representations
(ICLR 2021