Given a natural language instruction and an input scene, our goal is to train
a model to output a manipulation program that can be executed by the robot.
Prior approaches for this task possess one of the following limitations: (i)
rely on hand-coded symbols for concepts limiting generalization beyond those
seen during training [1] (ii) infer action sequences from instructions but
require dense sub-goal supervision [2] or (iii) lack semantics required for
deeper object-centric reasoning inherent in interpreting complex instructions
[3]. In contrast, our approach can handle linguistic as well as perceptual
variations, end-to-end trainable and requires no intermediate supervision. The
proposed model uses symbolic reasoning constructs that operate on a latent
neural object-centric representation, allowing for deeper reasoning over the
input scene. Central to our approach is a modular structure consisting of a
hierarchical instruction parser and an action simulator to learn disentangled
action representations. Our experiments on a simulated environment with a 7-DOF
manipulator, consisting of instructions with varying number of steps and scenes
with different number of objects, demonstrate that our model is robust to such
variations and significantly outperforms baselines, particularly in the
generalization settings. The code, dataset and experiment videos are available
at https://nsrmp.github.ioComment: International Conference on Robotics and Automation (ICRA), 202