6,261 research outputs found
Structure-Encoding Auxiliary Tasks for Improved Visual Representation in Vision-and-Language Navigation
In Vision-and-Language Navigation (VLN), researchers typically take an image
encoder pre-trained on ImageNet without fine-tuning on the environments that
the agent will be trained or tested on. However, the distribution shift between
the training images from ImageNet and the views in the navigation environments
may render the ImageNet pre-trained image encoder suboptimal. Therefore, in
this paper, we design a set of structure-encoding auxiliary tasks (SEA) that
leverage the data in the navigation environments to pre-train and improve the
image encoder. Specifically, we design and customize (1) 3D jigsaw, (2)
traversability prediction, and (3) instance classification to pre-train the
image encoder. Through rigorous ablations, our SEA pre-trained features are
shown to better encode structural information of the scenes, which ImageNet
pre-trained features fail to properly encode but is crucial for the target
navigation task. The SEA pre-trained features can be easily plugged into
existing VLN agents without any tuning. For example, on Test-Unseen
environments, the VLN agents combined with our SEA pre-trained features achieve
absolute success rate improvement of 12% for Speaker-Follower, 5% for
Env-Dropout, and 4% for AuxRN
Bridging the visual gap in VLN via semantically richer instructions
The Visual-and-Language Navigation (VLN) task requires understanding a
textual instruction to navigate a natural indoor environment using only visual
information. While this is a trivial task for most humans, it is still an open
problem for AI models. In this work, we hypothesize that poor use of the visual
information available is at the core of the low performance of current models.
To support this hypothesis, we provide experimental evidence showing that
state-of-the-art models are not severely affected when they receive just
limited or even no visual data, indicating a strong overfitting to the textual
instructions. To encourage a more suitable use of the visual information, we
propose a new data augmentation method that fosters the inclusion of more
explicit visual information in the generation of textual navigational
instructions. Our main intuition is that current VLN datasets include textual
instructions that are intended to inform an expert navigator, such as a human,
but not a beginner visual navigational agent, such as a randomly initialized DL
model. Specifically, to bridge the visual semantic gap of current VLN datasets,
we take advantage of metadata available for the Matterport3D dataset that,
among others, includes information about object labels that are present in the
scenes. Training a state-of-the-art model with the new set of instructions
increase its performance by 8% in terms of success rate on unseen environments,
demonstrating the advantages of the proposed data augmentation method.Comment: Accepted in ECCV 2022. Research completed on November 21, 202
Evolving Graphical Planner: Contextual Global Planning for Vision-and-Language Navigation
The ability to perform effective planning is crucial for building an
instruction-following agent. When navigating through a new environment, an
agent is challenged with (1) connecting the natural language instructions with
its progressively growing knowledge of the world; and (2) performing long-range
planning and decision making in the form of effective exploration and error
correction. Current methods are still limited on both fronts despite extensive
efforts. In this paper, we introduce the Evolving Graphical Planner (EGP), a
model that performs global planning for navigation based on raw sensory input.
The model dynamically constructs a graphical representation, generalizes the
action space to allow for more flexible decision making, and performs efficient
planning on a proxy graph representation. We evaluate our model on a
challenging Vision-and-Language Navigation (VLN) task with photorealistic
images and achieve superior performance compared to previous navigation
architectures. For instance, we achieve a 53% success rate on the test split of
the Room-to-Room navigation task through pure imitation learning, outperforming
previous navigation architectures by up to 5%
- …