Vision-and-language navigation (VLN) is the task to enable an embodied agent
to navigate to a remote location following the natural language instruction in
real scenes. Most of the previous approaches utilize the entire features or
object-centric features to represent navigable candidates. However, these
representations are not efficient enough for an agent to perform actions to
arrive the target location. As knowledge provides crucial information which is
complementary to visible content, in this paper, we propose a Knowledge
Enhanced Reasoning Model (KERM) to leverage knowledge to improve agent
navigation ability. Specifically, we first retrieve facts (i.e., knowledge
described by language descriptions) for the navigation views based on local
regions from the constructed knowledge base. The retrieved facts range from
properties of a single object (e.g., color, shape) to relationships between
objects (e.g., action, spatial position), providing crucial information for
VLN. We further present the KERM which contains the purification, fact-aware
interaction, and instruction-guided aggregation modules to integrate visual,
history, instruction, and fact features. The proposed KERM can automatically
select and gather crucial and relevant cues, obtaining more accurate action
prediction. Experimental results on the REVERIE, R2R, and SOON datasets
demonstrate the effectiveness of the proposed method.Comment: Accepted by CVPR 2023. The code is available at
https://github.com/XiangyangLi20/KER