Industrial insertion tasks are often performed repetitively with parts that
are subject to tight tolerances and prone to breakage. In this paper, we
present a safe method to learn a visuo-tactile insertion policy that is robust
against grasp pose variations while minimizing human inputs and collision
between the robot and the environment. We achieve this by dividing the
insertion task into two phases. In the first align phase, we learn a
tactile-based grasp pose estimation model to align the insertion part with the
receptacle. In the second insert phase, we learn a vision-based policy to guide
the part into the receptacle. Using force-torque sensing, we also develop a
safe self-supervised data collection pipeline that limits collision between the
part and the surrounding environment. Physical experiments on the USB insertion
task from the NIST Assembly Taskboard suggest that our approach can achieve
45/45 insertion successes on 45 different initial grasp poses, improving on two
baselines: (1) a behavior cloning agent trained on 50 human insertion
demonstrations (1/45) and (2) an online RL policy (TD3) trained in real (0/45)