6 research outputs found
Learning Reward Functions by Integrating Human Demonstrations and Preferences
Our goal is to accurately and efficiently learn reward functions for
autonomous robots. Current approaches to this problem include inverse
reinforcement learning (IRL), which uses expert demonstrations, and
preference-based learning, which iteratively queries the user for her
preferences between trajectories. In robotics however, IRL often struggles
because it is difficult to get high-quality demonstrations; conversely,
preference-based learning is very inefficient since it attempts to learn a
continuous, high-dimensional function from binary feedback. We propose a new
framework for reward learning, DemPref, that uses both demonstrations and
preference queries to learn a reward function. Specifically, we (1) use the
demonstrations to learn a coarse prior over the space of reward functions, to
reduce the effective size of the space from which queries are generated; and
(2) use the demonstrations to ground the (active) query generation process, to
improve the quality of the generated queries. Our method alleviates the
efficiency issues faced by standard preference-based learning methods and does
not exclusively depend on (possibly low-quality) demonstrations. In numerical
experiments, we find that DemPref is significantly more efficient than a
standard active preference-based learning method. In a user study, we compare
our method to a standard IRL method; we find that users rated the robot trained
with DemPref as being more successful at learning their desired behavior, and
preferred to use the DemPref system (over IRL) to train the robot.Comment: Presented at RSS 201
Asking Easy Questions: A User-Friendly Approach to Active Reward Learning
Robots can learn the right reward function by querying a human expert.
Existing approaches attempt to choose questions where the robot is most
uncertain about the human's response; however, they do not consider how easy it
will be for the human to answer! In this paper we explore an information gain
formulation for optimally selecting questions that naturally account for the
human's ability to answer. Our approach identifies questions that optimize the
trade-off between robot and human uncertainty, and determines when these
questions become redundant or costly. Simulations and a user study show our
method not only produces easy questions, but also ultimately results in faster
reward learning.Comment: Proceedings of the 3rd Conference on Robot Learning (CoRL), October
201
Reward-rational (implicit) choice: A unifying formalism for reward learning
It is often difficult to hand-specify what the correct reward function is for
a task, so researchers have instead aimed to learn reward functions from human
behavior or feedback. The types of behavior interpreted as evidence of the
reward function have expanded greatly in recent years. We've gone from
demonstrations, to comparisons, to reading into the information leaked when the
human is pushing the robot away or turning it off. And surely, there is more to
come. How will a robot make sense of all these diverse types of behavior? Our
key insight is that different types of behavior can be interpreted in a single
unifying formalism - as a reward-rational choice that the human is making,
often implicitly. The formalism offers both a unifying lens with which to view
past work, as well as a recipe for interpreting new sources of information that
are yet to be uncovered. We provide two examples to showcase this: interpreting
a new feedback type, and reading into how the choice of feedback itself leaks
information about the reward.Comment: Published at NeurIPS 202
Better-than-Demonstrator Imitation Learning via Automatically-Ranked Demonstrations
The performance of imitation learning is typically upper-bounded by the
performance of the demonstrator. While recent empirical results demonstrate
that ranked demonstrations allow for better-than-demonstrator performance,
preferences over demonstrations may be difficult to obtain, and little is known
theoretically about when such methods can be expected to successfully
extrapolate beyond the performance of the demonstrator. To address these
issues, we first contribute a sufficient condition for better-than-demonstrator
imitation learning and provide theoretical results showing why preferences over
demonstrations can better reduce reward function ambiguity when performing
inverse reinforcement learning. Building on this theory, we introduce
Disturbance-based Reward Extrapolation (D-REX), a ranking-based imitation
learning method that injects noise into a policy learned through behavioral
cloning to automatically generate ranked demonstrations. These ranked
demonstrations are used to efficiently learn a reward function that can then be
optimized using reinforcement learning. We empirically validate our approach on
simulated robot and Atari imitation learning benchmarks and show that D-REX
outperforms standard imitation learning approaches and can significantly
surpass the performance of the demonstrator. D-REX is the first imitation
learning approach to achieve significant extrapolation beyond the
demonstrator's performance without additional side-information or supervision,
such as rewards or human preferences. By generating rankings automatically, we
show that preference-based inverse reinforcement learning can be applied in
traditional imitation learning settings where only unlabeled demonstrations are
available.Comment: In proceedings of 3rd Conference on Robot Learning (CoRL) 201
Batch Active Learning Using Determinantal Point Processes
Data collection and labeling is one of the main challenges in employing
machine learning algorithms in a variety of real-world applications with
limited data. While active learning methods attempt to tackle this issue by
labeling only the data samples that give high information, they generally
suffer from large computational costs and are impractical in settings where
data can be collected in parallel. Batch active learning methods attempt to
overcome this computational burden by querying batches of samples at a time. To
avoid redundancy between samples, previous works rely on some ad hoc
combination of sample quality and diversity. In this paper, we present a new
principled batch active learning method using Determinantal Point Processes, a
repulsive point process that enables generating diverse batches of samples. We
develop tractable algorithms to approximate the mode of a DPP distribution, and
provide theoretical guarantees on the degree of approximation. We further
demonstrate that an iterative greedy method for DPP maximization, which has
lower computational costs but worse theoretical guarantees, still gives
competitive results for batch active learning. Our experiments show the value
of our methods on several datasets against state-of-the-art baselines.Comment: Submitted to NeurIPS 201
Learning Reward Functions from Diverse Sources of Human Feedback: Optimally Integrating Demonstrations and Preferences
Reward functions are a common way to specify the objective of a robot. As
designing reward functions can be extremely challenging, a more promising
approach is to directly learn reward functions from human teachers.
Importantly, humans provide data in a variety of forms: these include
instructions (e.g., natural language), demonstrations, (e.g., kinesthetic
guidance), and preferences (e.g., comparative rankings). Prior research has
independently applied reward learning to each of these different data sources.
However, there exist many domains where some of these information sources are
not applicable or inefficient -- while multiple sources are complementary and
expressive. Motivated by this general problem, we present a framework to
integrate multiple sources of information, which are either passively or
actively collected from human users. In particular, we present an algorithm
that first utilizes user demonstrations to initialize a belief about the reward
function, and then proactively probes the user with preference queries to
zero-in on their true reward. This algorithm not only enables us combine
multiple data sources, but it also informs the robot when it should leverage
each type of information. Further, our approach accounts for the human's
ability to provide data: yielding user-friendly preference queries which are
also theoretically optimal. Our extensive simulated experiments and user
studies on a Fetch mobile manipulator demonstrate the superiority and the
usability of our integrated framework.Comment: 19 pages, 17 figures. Submitted to The International Journal of
Robotics Research (IJRR