14,193 research outputs found
Target-Grounded Graph-Aware Transformer for Aerial Vision-and-Dialog Navigation
This report details the method of the winning entry of the AVDN Challenge in
ICCV 2023. The competition addresses the Aerial Navigation from Dialog History
(ANDH) task, which requires a drone agent to associate dialog history with
aerial observations to reach the destination. For better cross-modal grounding
abilities of the drone agent, we propose a Target-Grounded Graph-Aware
Transformer (TG-GAT) framework. Concretely, TG-GAT first leverages a
graph-aware transformer to capture spatiotemporal dependency, which is
beneficial for navigation state tracking and robust action planning. TG-GAT
first leverages a graph-aware transformer to capture spatiotemporal
dependencies for more robust action planning. In addition, an auxiliary visual
grounding task is devised to boost the agent's awareness of referred landmarks.
Moreover, a hybrid augmentation strategy based on large language models is
utilized to mitigate data scarcity limitations. Our TG-GAT framework won the
AVDN Challenge 2023, with 2.2% and 3.0% absolute improvements over the baseline
on SPL and SR metrics, respectively. The code is available at
https://github.com/yifeisu/avdn-challenge
Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
A robot that can carry out a natural-language instruction has been a dream
since before the Jetsons cartoon series imagined a life of leisure mediated by
a fleet of attentive robot helpers. It is a dream that remains stubbornly
distant. However, recent advances in vision and language methods have made
incredible progress in closely related areas. This is significant because a
robot interpreting a natural-language navigation instruction on the basis of
what it sees is carrying out a vision and language process that is similar to
Visual Question Answering. Both tasks can be interpreted as visually grounded
sequence-to-sequence translation problems, and many of the same methods are
applicable. To enable and encourage the application of vision and language
methods to the problem of interpreting visually-grounded navigation
instructions, we present the Matterport3D Simulator -- a large-scale
reinforcement learning environment based on real imagery. Using this simulator,
which can in future support a range of embodied vision and language tasks, we
provide the first benchmark dataset for visually-grounded natural language
navigation in real buildings -- the Room-to-Room (R2R) dataset.Comment: CVPR 2018 Spotlight presentatio
Embodied Question Answering
We present a new AI task -- Embodied Question Answering (EmbodiedQA) -- where
an agent is spawned at a random location in a 3D environment and asked a
question ("What color is the car?"). In order to answer, the agent must first
intelligently navigate to explore the environment, gather information through
first-person (egocentric) vision, and then answer the question ("orange").
This challenging task requires a range of AI skills -- active perception,
language understanding, goal-driven navigation, commonsense reasoning, and
grounding of language into actions. In this work, we develop the environments,
end-to-end-trained reinforcement learning agents, and evaluation protocols for
EmbodiedQA.Comment: 20 pages, 13 figures, Webpage: https://embodiedqa.org
Exploiting Deep Semantics and Compositionality of Natural Language for Human-Robot-Interaction
We develop a natural language interface for human robot interaction that
implements reasoning about deep semantics in natural language. To realize the
required deep analysis, we employ methods from cognitive linguistics, namely
the modular and compositional framework of Embodied Construction Grammar (ECG)
[Feldman, 2009]. Using ECG, robots are able to solve fine-grained reference
resolution problems and other issues related to deep semantics and
compositionality of natural language. This also includes verbal interaction
with humans to clarify commands and queries that are too ambiguous to be
executed safely. We implement our NLU framework as a ROS package and present
proof-of-concept scenarios with different robots, as well as a survey on the
state of the art
- …