This paper describes a domestic service robot (DSR) that fetches everyday
objects and carries them to specified destinations according to free-form
natural language instructions. Given an instruction such as "Move the bottle on
the left side of the plate to the empty chair," the DSR is expected to identify
the bottle and the chair from multiple candidates in the environment and carry
the target object to the destination. Most of the existing multimodal language
understanding methods are impractical in terms of computational complexity
because they require inferences for all combinations of target object
candidates and destination candidates. We propose Switching Head-Tail Funnel
UNITER, which solves the task by predicting the target object and the
destination individually using a single model. Our method is validated on a
newly-built dataset consisting of object manipulation instructions and semi
photo-realistic images captured in a standard Embodied AI simulator. The
results show that our method outperforms the baseline method in terms of
language comprehension accuracy. Furthermore, we conduct physical experiments
in which a DSR delivers standardized everyday objects in a standardized
domestic environment as requested by instructions with referring expressions.
The experimental results show that the object grasping and placing actions are
achieved with success rates of more than 90%.Comment: Accepted for presentation at IROS202