268 research outputs found
General Evaluation for Instruction Conditioned Navigation using Dynamic Time Warping
In instruction conditioned navigation, agents interpret natural language and
their surroundings to navigate through an environment. Datasets for studying
this task typically contain pairs of these instructions and reference
trajectories. Yet, most evaluation metrics used thus far fail to properly
account for the latter, relying instead on insufficient similarity comparisons.
We address fundamental flaws in previously used metrics and show how Dynamic
Time Warping (DTW), a long known method of measuring similarity between two
time series, can be used for evaluation of navigation agents. For such, we
define the normalized Dynamic Time Warping (nDTW) metric, that softly penalizes
deviations from the reference path, is naturally sensitive to the order of the
nodes composing each path, is suited for both continuous and graph-based
evaluations, and can be efficiently calculated. Further, we define SDTW, which
constrains nDTW to only successful paths. We collect human similarity judgments
for simulated paths and find nDTW correlates better with human rankings than
all other metrics. We also demonstrate that using nDTW as a reward signal for
Reinforcement Learning navigation agents improves their performance on both the
Room-to-Room (R2R) and Room-for-Room (R4R) datasets. The R4R results in
particular highlight the superiority of SDTW over previous success-constrained
metrics
VALAN: Vision and Language Agent Navigation
VALAN is a lightweight and scalable software framework for deep reinforcement
learning based on the SEED RL architecture. The framework facilitates the
development and evaluation of embodied agents for solving grounded language
understanding tasks, such as Vision-and-Language Navigation and
Vision-and-Dialog Navigation, in photo-realistic environments, such as
Matterport3D and Google StreetView. We have added a minimal set of abstractions
on top of SEED RL allowing us to generalize the architecture to solve a variety
of other RL problems. In this article, we will describe VALAN's software
abstraction and architecture, and also present an example of using VALAN to
design agents for instruction-conditioned indoor navigation
Evolving Graphical Planner: Contextual Global Planning for Vision-and-Language Navigation
The ability to perform effective planning is crucial for building an
instruction-following agent. When navigating through a new environment, an
agent is challenged with (1) connecting the natural language instructions with
its progressively growing knowledge of the world; and (2) performing long-range
planning and decision making in the form of effective exploration and error
correction. Current methods are still limited on both fronts despite extensive
efforts. In this paper, we introduce the Evolving Graphical Planner (EGP), a
model that performs global planning for navigation based on raw sensory input.
The model dynamically constructs a graphical representation, generalizes the
action space to allow for more flexible decision making, and performs efficient
planning on a proxy graph representation. We evaluate our model on a
challenging Vision-and-Language Navigation (VLN) task with photorealistic
images and achieve superior performance compared to previous navigation
architectures. For instance, we achieve a 53% success rate on the test split of
the Room-to-Room navigation task through pure imitation learning, outperforming
previous navigation architectures by up to 5%
Multimodal Attention Networks for Low-Level Vision-and-Language Navigation
Vision-and-Language Navigation (VLN) is a challenging task in which an agent needs to follow a language-specified path to reach a target destination. The goal gets even harder as the actions available to the agent get simpler and move towards low-level, atomic interactions with the environment. This setting takes the name of low-level VLN. In this paper, we strive for the creation of an agent able to tackle three key issues: multi-modality, long-term dependencies, and adaptability towards different locomotive settings. To that end, we devise "Perceive, Transform, and Act" (PTA): a fully-attentive VLN architecture that leaves the recurrent approach behind and the first Transformer-like architecture incorporating three different modalities -- natural language, images, and low-level actions for the agent control. In particular, we adopt an early fusion strategy to merge lingual and visual information efficiently in our encoder. We then propose to refine the decoding phase with a late fusion extension between the agent's history of actions and the perceptual modalities. We experimentally validate our model on two datasets: PTA achieves promising results in low-level VLN on R2R and achieves good performance in the recently proposed R4R benchmark. Our code is publicly available at https://github.com/aimagelab/perceive-transform-and-act
Retouchdown: Adding Touchdown to StreetLearn as a Shareable Resource for Language Grounding Tasks in Street View
The Touchdown dataset (Chen et al., 2019) provides instructions by human
annotators for navigation through New York City streets and for resolving
spatial descriptions at a given location. To enable the wider research
community to work effectively with the Touchdown tasks, we are publicly
releasing the 29k raw Street View panoramas needed for Touchdown. We follow the
process used for the StreetLearn data release (Mirowski et al., 2019) to check
panoramas for personally identifiable information and blur them as necessary.
These have been added to the StreetLearn dataset and can be obtained via the
same process as used previously for StreetLearn. We also provide a reference
implementation for both of the Touchdown tasks: vision and language navigation
(VLN) and spatial description resolution (SDR). We compare our model results to
those given in Chen et al. (2019) and show that the panoramas we have added to
StreetLearn fully support both Touchdown tasks and can be used effectively for
further research and comparison
Sim-to-Real Transfer for Vision-and-Language Navigation
We study the challenging problem of releasing a robot in a previously unseen
environment, and having it follow unconstrained natural language navigation
instructions. Recent work on the task of Vision-and-Language Navigation (VLN)
has achieved significant progress in simulation. To assess the implications of
this work for robotics, we transfer a VLN agent trained in simulation to a
physical robot. To bridge the gap between the high-level discrete action space
learned by the VLN agent, and the robot's low-level continuous action space, we
propose a subgoal model to identify nearby waypoints, and use domain
randomization to mitigate visual domain differences. For accurate sim and real
comparisons in parallel environments, we annotate a 325m2 office space with
1.3km of navigation instructions, and create a digitized replica in simulation.
We find that sim-to-real transfer to an environment not seen in training is
successful if an occupancy map and navigation graph can be collected and
annotated in advance (success rate of 46.8% vs. 55.9% in sim), but much more
challenging in the hardest setting with no prior mapping at all (success rate
of 22.5%).Comment: CoRL 202
Structured Scene Memory for Vision-Language Navigation
Recently, numerous algorithms have been developed to tackle the problem of
vision-language navigation (VLN), i.e., entailing an agent to navigate 3D
environments through following linguistic instructions. However, current VLN
agents simply store their past experiences/observations as latent states in
recurrent networks, failing to capture environment layouts and make long-term
planning. To address these limitations, we propose a crucial architecture,
called Structured Scene Memory (SSM). It is compartmentalized enough to
accurately memorize the percepts during navigation. It also serves as a
structured scene representation, which captures and disentangles visual and
geometric cues in the environment. SSM has a collect-read controller that
adaptively collects information for supporting current decision making and
mimics iterative algorithms for long-range reasoning. As SSM provides a
complete action space, i.e., all the navigable places on the map, a
frontier-exploration based navigation decision making strategy is introduced to
enable efficient and global planning. Experiment results on two VLN datasets
(i.e., R2R and R4R) show that our method achieves state-of-the-art performance
on several metrics.Comment: Accepted on CVPR2021; Implementation will be available at
https://github.com/HanqingWangAI/SSM-VL
Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation
One of the most challenging topics in Natural Language Processing (NLP) is
visually-grounded language understanding and reasoning. Outdoor
vision-and-language navigation (VLN) is such a task where an agent follows
natural language instructions and navigates a real-life urban environment. Due
to the lack of human-annotated instructions that illustrate intricate urban
scenes, outdoor VLN remains a challenging task to solve. This paper introduces
a Multimodal Text Style Transfer (MTST) learning approach and leverages
external multimodal resources to mitigate data scarcity in outdoor navigation
tasks. We first enrich the navigation data by transferring the style of the
instructions generated by Google Maps API, then pre-train the navigator with
the augmented external outdoor navigation dataset. Experimental results show
that our MTST learning approach is model-agnostic, and our MTST approach
significantly outperforms the baseline models on the outdoor VLN task,
improving task completion rate by 8.7% relatively on the test set.Comment: EACL 202
Object-and-Action Aware Model for Visual Language Navigation
Vision-and-Language Navigation (VLN) is unique in that it requires turning
relatively general natural-language instructions into robot agent actions, on
the basis of the visible environment. This requires to extract value from two
very different types of natural-language information. The first is object
description (e.g., 'table', 'door'), each presenting as a tip for the agent to
determine the next action by finding the item visible in the environment, and
the second is action specification (e.g., 'go straight', 'turn left') which
allows the robot to directly predict the next movements without relying on
visual perceptions. However, most existing methods pay few attention to
distinguish these information from each other during instruction encoding and
mix together the matching between textual object/action encoding and visual
perception/orientation features of candidate viewpoints. In this paper, we
propose an Object-and-Action Aware Model (OAAM) that processes these two
different forms of natural language based instruction separately. This enables
each process to match object-centered/action-centered instruction to their own
counterpart visual perception/action orientation flexibly. However, one
side-issue caused by above solution is that an object mentioned in instructions
may be observed in the direction of two or more candidate viewpoints, thus the
OAAM may not predict the viewpoint on the shortest path as the next action. To
handle this problem, we design a simple but effective path loss to penalize
trajectories deviating from the ground truth path. Experimental results
demonstrate the effectiveness of the proposed model and path loss, and the
superiority of their combination with a 50% SPL score on the R2R dataset and a
40% CLS score on the R4R dataset in unseen environments, outperforming the
previous state-of-the-art
Bayesian State Tracking and Sim-to-Real Transfer for Vision-and-Language Navigation
A visually-grounded navigation instruction can be interpreted as a sequence of expected observations and actions an agent following the correct trajectory would encounter and perform. Based on this intuition, we formulate the problem of finding the goal location in Vision-and-Language Navigation (VLN) within the framework of Bayesian state tracking -- learning observation and motion models conditioned on these expectable events. Together with a mapper that constructs a semantic spatial map on-the-fly during navigation, we formulate an end-to-end differentiable Bayes filter and train it to identify the goal by predicting the most likely trajectory through the map according to the instructions. The resulting navigation policy constitutes a new approach to instruction following that explicitly models a probability distribution over states, encoding strong geometric and algorithmic priors while enabling greater explainability. Our experiments show that our approach outperforms a strong LingUNet baseline when predicting the goal location on the map. On the full VLN task, i.e. navigating to the goal location, our approach achieves promising results with less reliance on navigation constraints.
In the second half of the thesis, we study the challenging problem of releasing a robot in a previously unseen environment, and having it follow unconstrained natural language navigation instructions. Recent work on the task of VLN has achieved significant progress in simulation. To assess the implications of this work for robotics, we transfer a VLN agent trained in simulation to a physical robot. To bridge the gap between the high-level discrete action space learned by the VLN agent, and the robot's low-level continuous action space, we propose a subgoal model to identify nearby waypoints, and use domain randomization to mitigate visual domain differences. For accurate sim and real comparisons in parallel environments, we annotate a 325m2 office space with 1.3km of navigation instructions, and create a digitized replica in simulation. We find that sim-to-real transfer to an environment not seen in training is successful if an occupancy map and navigation graph can be collected and annotated in advance (success rate of 46.8% vs. 55.9% in sim), but much more challenging in the hardest setting with no prior mapping at all (success rate of 22.5%).M.S
- …