31 research outputs found
Waypoint-Based Imitation Learning for Robotic Manipulation
While imitation learning methods have seen a resurgent interest for robotic
manipulation, the well-known problem of compounding errors continues to afflict
behavioral cloning (BC). Waypoints can help address this problem by reducing
the horizon of the learning problem for BC, and thus, the errors compounded
over time. However, waypoint labeling is underspecified, and requires
additional human supervision. Can we generate waypoints automatically without
any additional human supervision? Our key insight is that if a trajectory
segment can be approximated by linear motion, the endpoints can be used as
waypoints. We propose Automatic Waypoint Extraction (AWE) for imitation
learning, a preprocessing module to decompose a demonstration into a minimal
set of waypoints which when interpolated linearly can approximate the
trajectory up to a specified error threshold. AWE can be combined with any BC
algorithm, and we find that AWE can increase the success rate of
state-of-the-art algorithms by up to 25% in simulation and by 4-28% on
real-world bimanual manipulation tasks, reducing the decision making horizon by
up to a factor of 10. Videos and code are available at
https://lucys0.github.io/awe/Comment: The first two authors contributed equall
An Emulator for Fine-Tuning Large Language Models using Small Language Models
Widely used language models (LMs) are typically built by scaling up a
two-stage training pipeline: a pre-training stage that uses a very large,
diverse dataset of text and a fine-tuning (sometimes, 'alignment') stage that
uses targeted examples or other specifications of desired behaviors. While it
has been hypothesized that knowledge and skills come from pre-training, and
fine-tuning mostly filters this knowledge and skillset, this intuition has not
been extensively tested. To aid in doing so, we introduce a novel technique for
decoupling the knowledge and skills gained in these two stages, enabling a
direct answer to the question, "What would happen if we combined the knowledge
learned by a large model during pre-training with the knowledge learned by a
small model during fine-tuning (or vice versa)?" Using an RL-based framework
derived from recent developments in learning from human preferences, we
introduce emulated fine-tuning (EFT), a principled and practical method for
sampling from a distribution that approximates (or 'emulates') the result of
pre-training and fine-tuning at different scales. Our experiments with EFT show
that scaling up fine-tuning tends to improve helpfulness, while scaling up
pre-training tends to improve factuality. Beyond decoupling scale, we show that
EFT enables test-time adjustment of competing behavioral traits like
helpfulness and harmlessness without additional training. Finally, a special
case of emulated fine-tuning, which we call LM up-scaling, avoids
resource-intensive fine-tuning of large pre-trained models by ensembling them
with small fine-tuned models, essentially emulating the result of fine-tuning
the large pre-trained model. Up-scaling consistently improves helpfulness and
factuality of instruction-following models in the Llama, Llama-2, and Falcon
families, without additional hyperparameters or training
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
While large-scale unsupervised language models (LMs) learn broad world
knowledge and some reasoning skills, achieving precise control of their
behavior is difficult due to the completely unsupervised nature of their
training. Existing methods for gaining such steerability collect human labels
of the relative quality of model generations and fine-tune the unsupervised LM
to align with these preferences, often with reinforcement learning from human
feedback (RLHF). However, RLHF is a complex and often unstable procedure, first
fitting a reward model that reflects the human preferences, and then
fine-tuning the large unsupervised LM using reinforcement learning to maximize
this estimated reward without drifting too far from the original model. In this
paper, we leverage a mapping between reward functions and optimal policies to
show that this constrained reward maximization problem can be optimized exactly
with a single stage of policy training, essentially solving a classification
problem on the human preference data. The resulting algorithm, which we call
Direct Preference Optimization (DPO), is stable, performant and computationally
lightweight, eliminating the need for fitting a reward model, sampling from the
LM during fine-tuning, or performing significant hyperparameter tuning. Our
experiments show that DPO can fine-tune LMs to align with human preferences as
well as or better than existing methods. Notably, fine-tuning with DPO exceeds
RLHF's ability to control sentiment of generations and improves response
quality in summarization and single-turn dialogue while being substantially
simpler to implement and train
Robot Fine-Tuning Made Easy: Pre-Training Rewards and Policies for Autonomous Real-World Reinforcement Learning
The pre-train and fine-tune paradigm in machine learning has had dramatic
success in a wide range of domains because the use of existing data or
pre-trained models on the internet enables quick and easy learning of new
tasks. We aim to enable this paradigm in robotic reinforcement learning,
allowing a robot to learn a new task with little human effort by leveraging
data and models from the Internet. However, reinforcement learning often
requires significant human effort in the form of manual reward specification or
environment resets, even if the policy is pre-trained. We introduce RoboFuME, a
reset-free fine-tuning system that pre-trains a multi-task manipulation policy
from diverse datasets of prior experiences and self-improves online to learn a
target task with minimal human intervention. Our insights are to utilize
calibrated offline reinforcement learning techniques to ensure efficient online
fine-tuning of a pre-trained policy in the presence of distribution shifts and
leverage pre-trained vision language models (VLMs) to build a robust reward
classifier for autonomously providing reward signals during the online
fine-tuning process. In a diverse set of five real robot manipulation tasks, we
show that our method can incorporate data from an existing robot dataset
collected at a different institution and improve on a target task within as
little as 3 hours of autonomous real-world experience. We also demonstrate in
simulation experiments that our method outperforms prior works that use
different RL algorithms or different approaches for predicting rewards. Project
website: https://robofume.github.i
Adapt On-the-Go: Behavior Modulation for Single-Life Robot Deployment
To succeed in the real world, robots must cope with situations that differ
from those seen during training. We study the problem of adapting on-the-fly to
such novel scenarios during deployment, by drawing upon a diverse repertoire of
previously learned behaviors. Our approach, RObust Autonomous Modulation
(ROAM), introduces a mechanism based on the perceived value of pre-trained
behaviors to select and adapt pre-trained behaviors to the situation at hand.
Crucially, this adaptation process all happens within a single episode at test
time, without any human supervision. We provide theoretical analysis of our
selection mechanism and demonstrate that ROAM enables a robot to adapt rapidly
to changes in dynamics both in simulation and on a real Go1 quadruped, even
successfully moving forward with roller skates on its feet. Our approach adapts
over 2x as efficiently compared to existing methods when facing a variety of
out-of-distribution situations during deployment by effectively choosing and
adapting relevant behaviors on-the-fly.Comment: 19 pages, 6 figure
RLVF: Learning from Verbal Feedback without Overgeneralization
The diversity of contexts in which large language models (LLMs) are deployed
requires the ability to modify or customize default model behaviors to
incorporate nuanced requirements and preferences. A convenient interface to
specify such model adjustments is high-level verbal feedback, such as "Don't
use emojis when drafting emails to my boss." However, while writing high-level
feedback is far simpler than collecting annotations for reinforcement learning
from human feedback (RLHF), we find that simply prompting a model with such
feedback leads to overgeneralization of the feedback to contexts where it is
not relevant. We study the problem of incorporating verbal feedback without
such overgeneralization, inspiring a new method Contextualized Critiques with
Constrained Preference Optimization (C3PO). C3PO uses a piece of high-level
feedback to generate a small synthetic preference dataset specifying how the
feedback should (and should not) be applied. It then fine-tunes the model in
accordance with the synthetic preference data while minimizing the divergence
from the original model for prompts where the feedback does not apply. Our
experimental results indicate that our approach effectively applies verbal
feedback to relevant scenarios while preserving existing behaviors for other
contexts. For both human- and GPT-4-generated high-level feedback, C3PO
effectively adheres to the given feedback comparably to in-context baselines
while reducing overgeneralization by 30%.Comment: 9 pages, 9 figure