21,632 research outputs found
Robot Composite Learning and the Nunchaku Flipping Challenge
Advanced motor skills are essential for robots to physically coexist with
humans. Much research on robot dynamics and control has achieved success on
hyper robot motor capabilities, but mostly through heavily case-specific
engineering. Meanwhile, in terms of robot acquiring skills in a ubiquitous
manner, robot learning from human demonstration (LfD) has achieved great
progress, but still has limitations handling dynamic skills and compound
actions. In this paper, we present a composite learning scheme which goes
beyond LfD and integrates robot learning from human definition, demonstration,
and evaluation. The method tackles advanced motor skills that require dynamic
time-critical maneuver, complex contact control, and handling partly soft
partly rigid objects. We also introduce the "nunchaku flipping challenge", an
extreme test that puts hard requirements to all these three aspects. Continued
from our previous presentations, this paper introduces the latest update of the
composite learning scheme and the physical success of the nunchaku flipping
challenge
Aligning Large Language Models through Synthetic Feedback
Aligning large language models (LLMs) to human values has become increasingly
important as it enables sophisticated steering of LLMs, e.g., making them
follow given instructions while keeping them less toxic. However, it requires a
significant amount of human demonstrations and feedback. Recently, open-sourced
models have attempted to replicate the alignment learning process by distilling
data from already aligned LLMs like InstructGPT or ChatGPT. While this process
reduces human efforts, constructing these datasets has a heavy dependency on
the teacher models. In this work, we propose a novel framework for alignment
learning with almost no human labor and no dependency on pre-aligned LLMs.
First, we perform reward modeling (RM) with synthetic feedback by contrasting
responses from vanilla LLMs with various sizes and prompts. Then, we use the RM
for simulating high-quality demonstrations to train a supervised policy and for
further optimizing the model with reinforcement learning. Our resulting model,
Aligned Language Model with Synthetic Training dataset (ALMoST), outperforms
open-sourced models, including Alpaca, Dolly, and OpenAssistant, which are
trained on the outputs of InstructGPT or human-annotated instructions. Our
7B-sized model outperforms the 12-13B models in the A/B tests using GPT-4 as
the judge with about 75% winning rate on average.Comment: Preprint, 9 pages (with 10 pages of supplementary
MimicPlay: Long-Horizon Imitation Learning by Watching Human Play
Imitation learning from human demonstrations is a promising paradigm for
teaching robots manipulation skills in the real world. However, learning
complex long-horizon tasks often requires an unattainable amount of
demonstrations. To reduce the high data requirement, we resort to human play
data - video sequences of people freely interacting with the environment using
their hands. Even with different morphologies, we hypothesize that human play
data contain rich and salient information about physical interactions that can
readily facilitate robot policy learning. Motivated by this, we introduce a
hierarchical learning framework named MimicPlay that learns latent plans from
human play data to guide low-level visuomotor control trained on a small number
of teleoperated demonstrations. With systematic evaluations of 14 long-horizon
manipulation tasks in the real world, we show that MimicPlay outperforms
state-of-the-art imitation learning methods in task success rate,
generalization ability, and robustness to disturbances. Code and videos are
available at https://mimic-play.github.ioComment: 7th Conference on Robot Learning (CoRL 2023 oral presentation
SparseDFF: Sparse-View Feature Distillation for One-Shot Dexterous Manipulation
Humans excel at transferring manipulation skills across diverse object
shapes, poses, and appearances due to their understanding of semantic
correspondences between different instances. To endow robots with a similar
high-level understanding, we develop a Distilled Feature Field (DFF) for 3D
scenes, leveraging large 2D vision models to distill semantic features from
multiview images. While current research demonstrates advanced performance in
reconstructing DFFs from dense views, the development of learning a DFF from
sparse views is relatively nascent, despite its prevalence in numerous
manipulation tasks with fixed cameras. In this work, we introduce SparseDFF, a
novel method for acquiring view-consistent 3D DFFs from sparse RGBD
observations, enabling one-shot learning of dexterous manipulations that are
transferable to novel scenes. Specifically, we map the image features to the 3D
point cloud, allowing for propagation across the 3D space to establish a dense
feature field. At the core of SparseDFF is a lightweight feature refinement
network, optimized with a contrastive loss between pairwise views after
back-projecting the image features onto the 3D point cloud. Additionally, we
implement a point-pruning mechanism to augment feature continuity within each
local neighborhood. By establishing coherent feature fields on both source and
target scenes, we devise an energy function that facilitates the minimization
of feature discrepancies w.r.t. the end-effector parameters between the
demonstration and the target manipulation. We evaluate our approach using a
dexterous hand, mastering real-world manipulations on both rigid and deformable
objects, and showcase robust generalization in the face of object and
scene-context variations
- …