12,776 research outputs found
MESSI: Maximum Entropy Semi-Supervised Inverse Reinforcement Learning
International audienceA popular approach to apprenticeship learning (AL) is to formulate it as an inverse reinforcement learning (IRL) problem. The MaxEnt-IRL algorithm successfully integrates the maximum entropy principle into IRL and unlike its predecessors, it resolves the ambiguity arising from the fact that a possibly large number of policies could match the expert's behavior. In this paper, we study an AL setting in which in addition to the expert's trajectories, a number of unsupervised trajectories is available. We introduce MESSI, a novel algorithm that combines MaxEnt-IRL with principles coming from semi-supervised learning. In particular, MESSI integrates the unsupervised data into the MaxEnt-IRL framework using a pairwise penalty on trajectories. Empirical results in a highway driving and grid-world problems indicate that MESSI is able to take advantage of the unsupervised trajectories and improve the performance of MaxEnt-IRL
Maximum entropy approaches for inverse reinforcement learning
We make decisions to maximize our perceived reward, but handcrafting a reward function for an autonomous agent is challenging. Inverse Reinforcement Learning (IRL), which is concerned with learning a reward function from expert demonstrations, has recently attracted significant interest, with the Maximum Entropy (MaxEnt) approach being a popular method.In this talk, we will explore and contrast a variety of MaxEnt IRL approaches. We show that in the presence of stochastic dynamics, a minimum KL-divergence condition provides a rigorous derivation of the MaxEnt model, improving over a prior heuristic derivation. Furthermore, we explore extensions of the MaxEnt IRL method to the case of unknown stochastic transition dynamics, including a generative model for trajectories, a discriminative model for action sequences, and a simple logistic regression model
Revisiting Maximum Entropy Inverse Reinforcement Learning: New Perspectives and Algorithms
We provide new perspectives and inference algorithms for Maximum Entropy
(MaxEnt) Inverse Reinforcement Learning (IRL), which provides a principled
method to find a most non-committal reward function consistent with given
expert demonstrations, among many consistent reward functions.
We first present a generalized MaxEnt formulation based on minimizing a
KL-divergence instead of maximizing an entropy. This improves the previous
heuristic derivation of the MaxEnt IRL model (for stochastic MDPs), allows a
unified view of MaxEnt IRL and Relative Entropy IRL, and leads to a model-free
learning algorithm for the MaxEnt IRL model. Second, a careful review of
existing inference algorithms and implementations showed that they
approximately compute the marginals required for learning the model. We provide
examples to illustrate this, and present an efficient and exact inference
algorithm. Our algorithm can handle variable length demonstrations; in
addition, while a basic version takes time quadratic in the maximum
demonstration length L, an improved version of this algorithm reduces this to
linear using a padding trick.
Experiments show that our exact algorithm improves reward learning as
compared to the approximate ones. Furthermore, our algorithm scales up to a
large, real-world dataset involving driver behaviour forecasting. We provide an
optimized implementation compatible with the OpenAI Gym interface. Our new
insight and algorithms could possibly lead to further interest and exploration
of the original MaxEnt IRL model.Comment: Published as a conference paper at the 2020 IEEE Symposium Series on
Computational Intelligence (SSCI
BC-IRL: Learning Generalizable Reward Functions from Demonstrations
How well do reward functions learned with inverse reinforcement learning
(IRL) generalize? We illustrate that state-of-the-art IRL algorithms, which
maximize a maximum-entropy objective, learn rewards that overfit to the
demonstrations. Such rewards struggle to provide meaningful rewards for states
not covered by the demonstrations, a major detriment when using the reward to
learn policies in new situations. We introduce BC-IRL a new inverse
reinforcement learning method that learns reward functions that generalize
better when compared to maximum-entropy IRL approaches. In contrast to the
MaxEnt framework, which learns to maximize rewards around demonstrations,
BC-IRL updates reward parameters such that the policy trained with the new
reward matches the expert demonstrations better. We show that BC-IRL learns
rewards that generalize better on an illustrative simple task and two
continuous robotic control tasks, achieving over twice the success rate of
baselines in challenging generalization settings
- …