9 research outputs found
Inverse Decision Modeling: Learning Interpretable Representations of Behavior
Decision analysis deals with modeling and enhancing decision processes. A
principal challenge in improving behavior is in obtaining a transparent
description of existing behavior in the first place. In this paper, we develop
an expressive, unifying perspective on inverse decision modeling: a framework
for learning parameterized representations of sequential decision behavior.
First, we formalize the forward problem (as a normative standard), subsuming
common classes of control behavior. Second, we use this to formalize the
inverse problem (as a descriptive model), generalizing existing work on
imitation/reward learning -- while opening up a much broader class of research
problems in behavior representation. Finally, we instantiate this approach with
an example (inverse bounded rational control), illustrating how this structure
enables learning (interpretable) representations of (bounded) rationality --
while naturally capturing intuitive notions of suboptimal actions, biased
beliefs, and imperfect knowledge of environments
SEABO: A Simple Search-Based Method for Offline Imitation Learning
Offline reinforcement learning (RL) has attracted much attention due to its
ability in learning from static offline datasets and eliminating the need of
interacting with the environment. Nevertheless, the success of offline RL
relies heavily on the offline transitions annotated with reward labels. In
practice, we often need to hand-craft the reward function, which is sometimes
difficult, labor-intensive, or inefficient. To tackle this challenge, we set
our focus on the offline imitation learning (IL) setting, and aim at getting a
reward function based on the expert data and unlabeled data. To that end, we
propose a simple yet effective search-based offline IL method, tagged SEABO.
SEABO allocates a larger reward to the transition that is close to its closest
neighbor in the expert demonstration, and a smaller reward otherwise, all in an
unsupervised learning manner. Experimental results on a variety of D4RL
datasets indicate that SEABO can achieve competitive performance to offline RL
algorithms with ground-truth rewards, given only a single expert trajectory,
and can outperform prior reward learning and offline IL methods across many
tasks. Moreover, we demonstrate that SEABO also works well if the expert
demonstrations contain only observations. Our code is publicly available at
https://github.com/dmksjfl/SEABO.Comment: To appear in ICLR202
Understanding Expertise through Demonstrations: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learning
Offline inverse reinforcement learning (Offline IRL) aims to recover the
structure of rewards and environment dynamics that underlie observed actions in
a fixed, finite set of demonstrations from an expert agent. Accurate models of
expertise in executing a task has applications in safety-sensitive applications
such as clinical decision making and autonomous driving. However, the structure
of an expert's preferences implicit in observed actions is closely linked to
the expert's model of the environment dynamics (i.e. the ``world''). Thus,
inaccurate models of the world obtained from finite data with limited coverage
could compound inaccuracy in estimated rewards. To address this issue, we
propose a bi-level optimization formulation of the estimation task wherein the
upper level is likelihood maximization based upon a conservative model of the
expert's policy (lower level). The policy model is conservative in that it
maximizes reward subject to a penalty that is increasing in the uncertainty of
the estimated model of the world. We propose a new algorithmic framework to
solve the bi-level optimization problem formulation and provide statistical and
computational guarantees of performance for the associated reward estimator.
Finally, we demonstrate that the proposed algorithm outperforms the
state-of-the-art offline IRL and imitation learning benchmarks by a large
margin, over the continuous control tasks in MuJoCo and different datasets in
the D4RL benchmark
Recommended from our members
Vision-based Manipulation In-the-Wild
Deploying robots in real-world environments involves immense engineering complexity, potentially surpassing the resources required for autonomous vehicles due to the increased dimensionality and task variety. To maximize the chances of successful real-world deployment, finding a simple solution that minimizes engineering complexity at every level, from hardware to algorithm to operations, is crucial.
In this dissertation, we consider a vision-based manipulation system that can be deployed in-the-wild when trained to imitate sufficient quantity and diversity of human demonstration data on the desired task. At deployment time, the robot is driven by a single diffusion-based visuomotor policy, with raw RGB images as input and robot end-effector pose as output. Compared to existing policy representations, Diffusion Policy handles multimodal action distributions gracefully, being scalable to high-dimensional action spaces and exhibiting impressive training stability. These properties allow a single software system to be used for multiple tasks, with data collected by multiple demonstrators, deployed to multiple robot embodiments, and without significant hyper-parameter tuning.
We developed a Universal Manipulation Interface (UMI), a portable, low-cost, and information-rich data collection system to enable direct manipulation skill learning from in-the-wild human demonstrations. UMI provides an intuitive interface for non-expert users by using hand-held grippers with mounted GoPro cameras. Compared to existing robotic data collection systems, UMI enables robotic data collection without needing a robot, drastically reducing the engineering and operational complexity. Trained with UMI data, the resulting diffusion policies can be deployed across multiple robot platforms in unseen environments for novel objects and to complete dynamic, bimanual, precise, and long-horizon tasks.
The Diffusion Policy and UMI combination provides a simple full-stack solution to many manipulation problems. The turn-around time of building a single-task manipulation system (such as object tossing and cloth folding) can be reduced from a few months to a few days