6 research outputs found
Imitative Planning using Conditional Normalizing Flow
We explore the application of normalizing flows for improving the performance
of trajectory planning for autonomous vehicles (AVs). Normalizing flows provide
an invertible mapping from a known prior distribution to a potentially complex,
multi-modal target distribution and allow for fast sampling with exact PDF
inference. By modeling a trajectory planner's cost manifold as an energy
function we learn a scene conditioned mapping from the prior to a Boltzmann
distribution over the AV control space. This mapping allows for control samples
and their associated energy to be generated jointly and in parallel. We propose
using neural autoregressive flow (NAF) as part of an end-to-end deep learned
system that allows for utilizing sensors, map, and route information to
condition the flow mapping. Finally, we demonstrate the effectiveness of our
approach on real world datasets over IL and hand constructed trajectory
sampling techniques.Comment: Submittted to 4th Conference on Robot Learning (CoRL 2020), Cambridge
MA, US
Dual RL: Unification and New Methods for Reinforcement and Imitation Learning
The goal of reinforcement learning (RL) is to maximize the expected
cumulative return. It has been shown that this objective can be represented by
an optimization problem of the state-action visitation distribution under
linear constraints. The dual problem of this formulation, which we refer to as
dual RL, is unconstrained and easier to optimize. We show that several
state-of-the-art off-policy deep reinforcement learning (RL) algorithms, under
both online and offline, RL and imitation learning (IL) settings, can be viewed
as dual RL approaches in a unified framework. This unification provides a
common ground to study and identify the components that contribute to the
success of these methods and also reveals the common shortcomings across
methods with new insights for improvement. Our analysis shows that prior
off-policy imitation learning methods are based on an unrealistic coverage
assumption and are minimizing a particular f-divergence between the visitation
distributions of the learned policy and the expert policy. We propose a new
method using a simple modification to the dual RL framework that allows for
performant imitation learning with arbitrary off-policy data to obtain
near-expert performance, without learning a discriminator. Further, by framing
a recent SOTA offline RL method XQL in the dual RL framework, we propose
alternative choices to replace the Gumbel regression loss, which achieve
improved performance and resolve the training instability issue of XQL. Project
code and details can be found at this https://hari-sikchi.github.io/dual-rl.Comment: 46 pages. Under revie
Contrastive Preference Learning: Learning from Human Feedback without RL
Reinforcement Learning from Human Feedback (RLHF) has emerged as a popular
paradigm for aligning models with human intent. Typically RLHF algorithms
operate in two phases: first, use human preferences to learn a reward function
and second, align the model by optimizing the learned reward via reinforcement
learning (RL). This paradigm assumes that human preferences are distributed
according to reward, but recent work suggests that they instead follow the
regret under the user's optimal policy. Thus, learning a reward function from
feedback is not only based on a flawed assumption of human preference, but also
leads to unwieldy optimization challenges that stem from policy gradients or
bootstrapping in the RL phase. Because of these optimization challenges,
contemporary RLHF methods restrict themselves to contextual bandit settings
(e.g., as in large language models) or limit observation dimensionality (e.g.,
state-based robotics). We overcome these limitations by introducing a new
family of algorithms for optimizing behavior from human feedback using the
regret-based model of human preferences. Using the principle of maximum
entropy, we derive Contrastive Preference Learning (CPL), an algorithm for
learning optimal policies from preferences without learning reward functions,
circumventing the need for RL. CPL is fully off-policy, uses only a simple
contrastive objective, and can be applied to arbitrary MDPs. This enables CPL
to elegantly scale to high-dimensional and sequential RLHF problems while being
simpler than prior methods.Comment: Code released at https://github.com/jhejna/cpl. Edited 10/23 only to
fix typo in the titl
Real Robot Challenge: A Robotics Competition in the Cloud
Dexterous manipulation remains an open problem in robotics. To coordinate
efforts of the research community towards tackling this problem, we propose a
shared benchmark. We designed and built robotic platforms that are hosted at
MPI for Intelligent Systems and can be accessed remotely. Each platform
consists of three robotic fingers that are capable of dexterous object
manipulation. Users are able to control the platforms remotely by submitting
code that is executed automatically, akin to a computational cluster. Using
this setup, i) we host robotics competitions, where teams from anywhere in the
world access our platforms to tackle challenging tasks ii) we publish the
datasets collected during these competitions (consisting of hundreds of robot
hours), and iii) we give researchers access to these platforms for their own
projects