Fine manipulation tasks, such as threading cable ties or slotting a battery,
are notoriously difficult for robots because they require precision, careful
coordination of contact forces, and closed-loop visual feedback. Performing
these tasks typically requires high-end robots, accurate sensors, or careful
calibration, which can be expensive and difficult to set up. Can learning
enable low-cost and imprecise hardware to perform these fine manipulation
tasks? We present a low-cost system that performs end-to-end imitation learning
directly from real demonstrations, collected with a custom teleoperation
interface. Imitation learning, however, presents its own challenges,
particularly in high-precision domains: errors in the policy can compound over
time, and human demonstrations can be non-stationary. To address these
challenges, we develop a simple yet novel algorithm, Action Chunking with
Transformers (ACT), which learns a generative model over action sequences. ACT
allows the robot to learn 6 difficult tasks in the real world, such as opening
a translucent condiment cup and slotting a battery with 80-90% success, with
only 10 minutes worth of demonstrations. Project website:
https://tonyzhaozh.github.io/aloha