Large language models encode a vast amount of semantic knowledge and possess
remarkable understanding and reasoning capabilities. Previous research has
explored how to ground language models in robotic tasks to ensure that the
sequences generated by the language model are both logically correct and
practically executable. However, low-level execution may deviate from the
high-level plan due to environmental perturbations or imperfect controller
design. In this paper, we propose DoReMi, a novel language model grounding
framework that enables immediate Detection and Recovery from Misalignments
between plan and execution. Specifically, LLMs are leveraged for both planning
and generating constraints for planned steps. These constraints can indicate
plan-execution misalignments and we use a vision question answering (VQA) model
to check constraints during low-level skill execution. If certain misalignment
occurs, our method will call the language model to re-plan in order to recover
from misalignments. Experiments on various complex tasks including robot arms
and humanoid robots demonstrate that our method can lead to higher task success
rates and shorter task completion times. Videos of DoReMi are available at
https://sites.google.com/view/doremi-paper.Comment: 21 pages, 13 figure