Training embodied agents in simulation has become mainstream for the embodied
AI community. However, these agents often struggle when deployed in the
physical world due to their inability to generalize to real-world environments.
In this paper, we present Phone2Proc, a method that uses a 10-minute phone scan
and conditional procedural generation to create a distribution of training
scenes that are semantically similar to the target environment. The generated
scenes are conditioned on the wall layout and arrangement of large objects from
the scan, while also sampling lighting, clutter, surface textures, and
instances of smaller objects with randomized placement and materials.
Leveraging just a simple RGB camera, training with Phone2Proc shows massive
improvements from 34.7% to 70.7% success rate in sim-to-real ObjectNav
performance across a test suite of over 200 trials in diverse real-world
environments, including homes, offices, and RoboTHOR. Furthermore, Phone2Proc's
diverse distribution of generated scenes makes agents remarkably robust to
changes in the real world, such as human movement, object rearrangement,
lighting changes, or clutter.Comment: https://allenai.org/project/phone2pro