We approach the problem of learning by watching humans in the wild. While
traditional approaches in Imitation and Reinforcement Learning are promising
for learning in the real world, they are either sample inefficient or are
constrained to lab settings. Meanwhile, there has been a lot of success in
processing passive, unstructured human data. We propose tackling this problem
via an efficient one-shot robot learning algorithm, centered around learning
from a third-person perspective. We call our method WHIRL: In-the-Wild Human
Imitating Robot Learning. WHIRL extracts a prior over the intent of the human
demonstrator, using it to initialize our agent's policy. We introduce an
efficient real-world policy learning scheme that improves using interactions.
Our key contributions are a simple sampling-based policy optimization approach,
a novel objective function for aligning human and robot videos as well as an
exploration method to boost sample efficiency. We show one-shot generalization
and success in real-world settings, including 20 different manipulation tasks
in the wild. Videos and talk at https://human2robot.github.ioComment: Published at RSS 2022. Demos at https://human2robot.github.i