5,392 research outputs found
Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks
Vision-Language Navigation (VLN) is a task where agents learn to navigate
following natural language instructions. The key to this task is to perceive
both the visual scene and natural language sequentially. Conventional
approaches exploit the vision and language features in cross-modal grounding.
However, the VLN task remains challenging, since previous works have neglected
the rich semantic information contained in the environment (such as implicit
navigation graphs or sub-trajectory semantics). In this paper, we introduce
Auxiliary Reasoning Navigation (AuxRN), a framework with four self-supervised
auxiliary reasoning tasks to take advantage of the additional training signals
derived from the semantic information. The auxiliary tasks have four reasoning
objectives: explaining the previous actions, estimating the navigation
progress, predicting the next orientation, and evaluating the trajectory
consistency. As a result, these additional training signals help the agent to
acquire knowledge of semantic representations in order to reason about its
activity and build a thorough perception of the environment. Our experiments
indicate that auxiliary reasoning tasks improve both the performance of the
main task and the model generalizability by a large margin. Empirically, we
demonstrate that an agent trained with self-supervised auxiliary reasoning
tasks substantially outperforms the previous state-of-the-art method, being the
best existing approach on the standard benchmark
From Seeing to Moving: A Survey on Learning for Visual Indoor Navigation (VIN)
Visual Indoor Navigation (VIN) task has drawn increasing attention from the
data-driven machine learning communities especially with the recently reported
success from learning-based methods. Due to the innate complexity of this task,
researchers have tried approaching the problem from a variety of different
angles, the full scope of which has not yet been captured within an overarching
report. This survey first summarizes the representative work of learning-based
approaches for the VIN task and then identifies and discusses lingering issues
impeding the VIN performance, as well as motivates future research in these key
areas worth exploring for the community
Structure-Encoding Auxiliary Tasks for Improved Visual Representation in Vision-and-Language Navigation
In Vision-and-Language Navigation (VLN), researchers typically take an image
encoder pre-trained on ImageNet without fine-tuning on the environments that
the agent will be trained or tested on. However, the distribution shift between
the training images from ImageNet and the views in the navigation environments
may render the ImageNet pre-trained image encoder suboptimal. Therefore, in
this paper, we design a set of structure-encoding auxiliary tasks (SEA) that
leverage the data in the navigation environments to pre-train and improve the
image encoder. Specifically, we design and customize (1) 3D jigsaw, (2)
traversability prediction, and (3) instance classification to pre-train the
image encoder. Through rigorous ablations, our SEA pre-trained features are
shown to better encode structural information of the scenes, which ImageNet
pre-trained features fail to properly encode but is crucial for the target
navigation task. The SEA pre-trained features can be easily plugged into
existing VLN agents without any tuning. For example, on Test-Unseen
environments, the VLN agents combined with our SEA pre-trained features achieve
absolute success rate improvement of 12% for Speaker-Follower, 5% for
Env-Dropout, and 4% for AuxRN
- …