6 research outputs found

    Imitative Planning using Conditional Normalizing Flow

    Full text link
    We explore the application of normalizing flows for improving the performance of trajectory planning for autonomous vehicles (AVs). Normalizing flows provide an invertible mapping from a known prior distribution to a potentially complex, multi-modal target distribution and allow for fast sampling with exact PDF inference. By modeling a trajectory planner's cost manifold as an energy function we learn a scene conditioned mapping from the prior to a Boltzmann distribution over the AV control space. This mapping allows for control samples and their associated energy to be generated jointly and in parallel. We propose using neural autoregressive flow (NAF) as part of an end-to-end deep learned system that allows for utilizing sensors, map, and route information to condition the flow mapping. Finally, we demonstrate the effectiveness of our approach on real world datasets over IL and hand constructed trajectory sampling techniques.Comment: Submittted to 4th Conference on Robot Learning (CoRL 2020), Cambridge MA, US

    Dual RL: Unification and New Methods for Reinforcement and Imitation Learning

    Full text link
    The goal of reinforcement learning (RL) is to maximize the expected cumulative return. It has been shown that this objective can be represented by an optimization problem of the state-action visitation distribution under linear constraints. The dual problem of this formulation, which we refer to as dual RL, is unconstrained and easier to optimize. We show that several state-of-the-art off-policy deep reinforcement learning (RL) algorithms, under both online and offline, RL and imitation learning (IL) settings, can be viewed as dual RL approaches in a unified framework. This unification provides a common ground to study and identify the components that contribute to the success of these methods and also reveals the common shortcomings across methods with new insights for improvement. Our analysis shows that prior off-policy imitation learning methods are based on an unrealistic coverage assumption and are minimizing a particular f-divergence between the visitation distributions of the learned policy and the expert policy. We propose a new method using a simple modification to the dual RL framework that allows for performant imitation learning with arbitrary off-policy data to obtain near-expert performance, without learning a discriminator. Further, by framing a recent SOTA offline RL method XQL in the dual RL framework, we propose alternative choices to replace the Gumbel regression loss, which achieve improved performance and resolve the training instability issue of XQL. Project code and details can be found at this https://hari-sikchi.github.io/dual-rl.Comment: 46 pages. Under revie

    Contrastive Preference Learning: Learning from Human Feedback without RL

    Full text link
    Reinforcement Learning from Human Feedback (RLHF) has emerged as a popular paradigm for aligning models with human intent. Typically RLHF algorithms operate in two phases: first, use human preferences to learn a reward function and second, align the model by optimizing the learned reward via reinforcement learning (RL). This paradigm assumes that human preferences are distributed according to reward, but recent work suggests that they instead follow the regret under the user's optimal policy. Thus, learning a reward function from feedback is not only based on a flawed assumption of human preference, but also leads to unwieldy optimization challenges that stem from policy gradients or bootstrapping in the RL phase. Because of these optimization challenges, contemporary RLHF methods restrict themselves to contextual bandit settings (e.g., as in large language models) or limit observation dimensionality (e.g., state-based robotics). We overcome these limitations by introducing a new family of algorithms for optimizing behavior from human feedback using the regret-based model of human preferences. Using the principle of maximum entropy, we derive Contrastive Preference Learning (CPL), an algorithm for learning optimal policies from preferences without learning reward functions, circumventing the need for RL. CPL is fully off-policy, uses only a simple contrastive objective, and can be applied to arbitrary MDPs. This enables CPL to elegantly scale to high-dimensional and sequential RLHF problems while being simpler than prior methods.Comment: Code released at https://github.com/jhejna/cpl. Edited 10/23 only to fix typo in the titl

    Real Robot Challenge: A Robotics Competition in the Cloud

    Full text link
    Dexterous manipulation remains an open problem in robotics. To coordinate efforts of the research community towards tackling this problem, we propose a shared benchmark. We designed and built robotic platforms that are hosted at MPI for Intelligent Systems and can be accessed remotely. Each platform consists of three robotic fingers that are capable of dexterous object manipulation. Users are able to control the platforms remotely by submitting code that is executed automatically, akin to a computational cluster. Using this setup, i) we host robotics competitions, where teams from anywhere in the world access our platforms to tackle challenging tasks ii) we publish the datasets collected during these competitions (consisting of hundreds of robot hours), and iii) we give researchers access to these platforms for their own projects