Humans are remarkably flexible when understanding new sentences that include
combinations of concepts they have never encountered before. Recent work has
shown that while deep networks can mimic some human language abilities when
presented with novel sentences, systematic variation uncovers the limitations
in the language-understanding abilities of neural networks. We demonstrate that
these limitations can be overcome by addressing the generalization challenges
in a recently-released dataset, gSCAN, which explicitly measures how well a
robotic agent is able to interpret novel ideas grounded in vision, e.g., novel
pairings of adjectives and nouns. The key principle we employ is
compositionality: that the compositional structure of networks should reflect
the compositional structure of the problem domain they address, while allowing
all other parameters and properties to be learned end-to-end with weak
supervision. We build a general-purpose mechanism that enables robots to
generalize their language understanding to compositional domains. Crucially,
our base network has the same state-of-the-art performance as prior work, 97%
execution accuracy, while at the same time generalizing its knowledge when
prior work does not; for example, achieving 95% accuracy on novel
adjective-noun compositions where previous work has 55% average accuracy.
Robust language understanding without dramatic failures and without corner
causes is critical to building safe and fair robots; we demonstrate the
significant role that compositionality can play in achieving that goal