6 research outputs found

    Learning Reward Functions by Integrating Human Demonstrations and Preferences

    Full text link
    Our goal is to accurately and efficiently learn reward functions for autonomous robots. Current approaches to this problem include inverse reinforcement learning (IRL), which uses expert demonstrations, and preference-based learning, which iteratively queries the user for her preferences between trajectories. In robotics however, IRL often struggles because it is difficult to get high-quality demonstrations; conversely, preference-based learning is very inefficient since it attempts to learn a continuous, high-dimensional function from binary feedback. We propose a new framework for reward learning, DemPref, that uses both demonstrations and preference queries to learn a reward function. Specifically, we (1) use the demonstrations to learn a coarse prior over the space of reward functions, to reduce the effective size of the space from which queries are generated; and (2) use the demonstrations to ground the (active) query generation process, to improve the quality of the generated queries. Our method alleviates the efficiency issues faced by standard preference-based learning methods and does not exclusively depend on (possibly low-quality) demonstrations. In numerical experiments, we find that DemPref is significantly more efficient than a standard active preference-based learning method. In a user study, we compare our method to a standard IRL method; we find that users rated the robot trained with DemPref as being more successful at learning their desired behavior, and preferred to use the DemPref system (over IRL) to train the robot.Comment: Presented at RSS 201

    Asking Easy Questions: A User-Friendly Approach to Active Reward Learning

    Full text link
    Robots can learn the right reward function by querying a human expert. Existing approaches attempt to choose questions where the robot is most uncertain about the human's response; however, they do not consider how easy it will be for the human to answer! In this paper we explore an information gain formulation for optimally selecting questions that naturally account for the human's ability to answer. Our approach identifies questions that optimize the trade-off between robot and human uncertainty, and determines when these questions become redundant or costly. Simulations and a user study show our method not only produces easy questions, but also ultimately results in faster reward learning.Comment: Proceedings of the 3rd Conference on Robot Learning (CoRL), October 201

    Reward-rational (implicit) choice: A unifying formalism for reward learning

    Full text link
    It is often difficult to hand-specify what the correct reward function is for a task, so researchers have instead aimed to learn reward functions from human behavior or feedback. The types of behavior interpreted as evidence of the reward function have expanded greatly in recent years. We've gone from demonstrations, to comparisons, to reading into the information leaked when the human is pushing the robot away or turning it off. And surely, there is more to come. How will a robot make sense of all these diverse types of behavior? Our key insight is that different types of behavior can be interpreted in a single unifying formalism - as a reward-rational choice that the human is making, often implicitly. The formalism offers both a unifying lens with which to view past work, as well as a recipe for interpreting new sources of information that are yet to be uncovered. We provide two examples to showcase this: interpreting a new feedback type, and reading into how the choice of feedback itself leaks information about the reward.Comment: Published at NeurIPS 202

    Better-than-Demonstrator Imitation Learning via Automatically-Ranked Demonstrations

    Full text link
    The performance of imitation learning is typically upper-bounded by the performance of the demonstrator. While recent empirical results demonstrate that ranked demonstrations allow for better-than-demonstrator performance, preferences over demonstrations may be difficult to obtain, and little is known theoretically about when such methods can be expected to successfully extrapolate beyond the performance of the demonstrator. To address these issues, we first contribute a sufficient condition for better-than-demonstrator imitation learning and provide theoretical results showing why preferences over demonstrations can better reduce reward function ambiguity when performing inverse reinforcement learning. Building on this theory, we introduce Disturbance-based Reward Extrapolation (D-REX), a ranking-based imitation learning method that injects noise into a policy learned through behavioral cloning to automatically generate ranked demonstrations. These ranked demonstrations are used to efficiently learn a reward function that can then be optimized using reinforcement learning. We empirically validate our approach on simulated robot and Atari imitation learning benchmarks and show that D-REX outperforms standard imitation learning approaches and can significantly surpass the performance of the demonstrator. D-REX is the first imitation learning approach to achieve significant extrapolation beyond the demonstrator's performance without additional side-information or supervision, such as rewards or human preferences. By generating rankings automatically, we show that preference-based inverse reinforcement learning can be applied in traditional imitation learning settings where only unlabeled demonstrations are available.Comment: In proceedings of 3rd Conference on Robot Learning (CoRL) 201

    Batch Active Learning Using Determinantal Point Processes

    Full text link
    Data collection and labeling is one of the main challenges in employing machine learning algorithms in a variety of real-world applications with limited data. While active learning methods attempt to tackle this issue by labeling only the data samples that give high information, they generally suffer from large computational costs and are impractical in settings where data can be collected in parallel. Batch active learning methods attempt to overcome this computational burden by querying batches of samples at a time. To avoid redundancy between samples, previous works rely on some ad hoc combination of sample quality and diversity. In this paper, we present a new principled batch active learning method using Determinantal Point Processes, a repulsive point process that enables generating diverse batches of samples. We develop tractable algorithms to approximate the mode of a DPP distribution, and provide theoretical guarantees on the degree of approximation. We further demonstrate that an iterative greedy method for DPP maximization, which has lower computational costs but worse theoretical guarantees, still gives competitive results for batch active learning. Our experiments show the value of our methods on several datasets against state-of-the-art baselines.Comment: Submitted to NeurIPS 201

    Learning Reward Functions from Diverse Sources of Human Feedback: Optimally Integrating Demonstrations and Preferences

    Full text link
    Reward functions are a common way to specify the objective of a robot. As designing reward functions can be extremely challenging, a more promising approach is to directly learn reward functions from human teachers. Importantly, humans provide data in a variety of forms: these include instructions (e.g., natural language), demonstrations, (e.g., kinesthetic guidance), and preferences (e.g., comparative rankings). Prior research has independently applied reward learning to each of these different data sources. However, there exist many domains where some of these information sources are not applicable or inefficient -- while multiple sources are complementary and expressive. Motivated by this general problem, we present a framework to integrate multiple sources of information, which are either passively or actively collected from human users. In particular, we present an algorithm that first utilizes user demonstrations to initialize a belief about the reward function, and then proactively probes the user with preference queries to zero-in on their true reward. This algorithm not only enables us combine multiple data sources, but it also informs the robot when it should leverage each type of information. Further, our approach accounts for the human's ability to provide data: yielding user-friendly preference queries which are also theoretically optimal. Our extensive simulated experiments and user studies on a Fetch mobile manipulator demonstrate the superiority and the usability of our integrated framework.Comment: 19 pages, 17 figures. Submitted to The International Journal of Robotics Research (IJRR
    corecore