732 research outputs found
Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks
Vision-Language Navigation (VLN) is a task where agents learn to navigate
following natural language instructions. The key to this task is to perceive
both the visual scene and natural language sequentially. Conventional
approaches exploit the vision and language features in cross-modal grounding.
However, the VLN task remains challenging, since previous works have neglected
the rich semantic information contained in the environment (such as implicit
navigation graphs or sub-trajectory semantics). In this paper, we introduce
Auxiliary Reasoning Navigation (AuxRN), a framework with four self-supervised
auxiliary reasoning tasks to take advantage of the additional training signals
derived from the semantic information. The auxiliary tasks have four reasoning
objectives: explaining the previous actions, estimating the navigation
progress, predicting the next orientation, and evaluating the trajectory
consistency. As a result, these additional training signals help the agent to
acquire knowledge of semantic representations in order to reason about its
activity and build a thorough perception of the environment. Our experiments
indicate that auxiliary reasoning tasks improve both the performance of the
main task and the model generalizability by a large margin. Empirically, we
demonstrate that an agent trained with self-supervised auxiliary reasoning
tasks substantially outperforms the previous state-of-the-art method, being the
best existing approach on the standard benchmark
Learning Vision-and-Language Navigation from YouTube Videos
Vision-and-language navigation (VLN) requires an embodied agent to navigate
in realistic 3D environments using natural language instructions. Existing VLN
methods suffer from training on small-scale environments or unreasonable
path-instruction datasets, limiting the generalization to unseen environments.
There are massive house tour videos on YouTube, providing abundant real
navigation experiences and layout information. However, these videos have not
been explored for VLN before. In this paper, we propose to learn an agent from
these videos by creating a large-scale dataset which comprises reasonable
path-instruction pairs from house tour videos and pre-training the agent on it.
To achieve this, we have to tackle the challenges of automatically constructing
path-instruction pairs and exploiting real layout knowledge from raw and
unlabeled videos. To address these, we first leverage an entropy-based method
to construct the nodes of a path trajectory. Then, we propose an action-aware
generator for generating instructions from unlabeled trajectories. Last, we
devise a trajectory judgment pretext task to encourage the agent to mine the
layout knowledge. Experimental results show that our method achieves
state-of-the-art performance on two popular benchmarks (R2R and REVERIE). Code
is available at https://github.com/JeremyLinky/YouTube-VLNComment: Accepted by ICCV 202
Evolving Graphical Planner: Contextual Global Planning for Vision-and-Language Navigation
The ability to perform effective planning is crucial for building an
instruction-following agent. When navigating through a new environment, an
agent is challenged with (1) connecting the natural language instructions with
its progressively growing knowledge of the world; and (2) performing long-range
planning and decision making in the form of effective exploration and error
correction. Current methods are still limited on both fronts despite extensive
efforts. In this paper, we introduce the Evolving Graphical Planner (EGP), a
model that performs global planning for navigation based on raw sensory input.
The model dynamically constructs a graphical representation, generalizes the
action space to allow for more flexible decision making, and performs efficient
planning on a proxy graph representation. We evaluate our model on a
challenging Vision-and-Language Navigation (VLN) task with photorealistic
images and achieve superior performance compared to previous navigation
architectures. For instance, we achieve a 53% success rate on the test split of
the Room-to-Room navigation task through pure imitation learning, outperforming
previous navigation architectures by up to 5%
- …