1,109 research outputs found
Learning with Latent Language
The named concepts and compositional operators present in natural language
provide a rich source of information about the kinds of abstractions humans use
to navigate the world. Can this linguistic background knowledge improve the
generality and efficiency of learned classifiers and control policies? This
paper aims to show that using the space of natural language strings as a
parameter space is an effective way to capture natural task structure. In a
pretraining phase, we learn a language interpretation model that transforms
inputs (e.g. images) into outputs (e.g. labels) given natural language
descriptions. To learn a new concept (e.g. a classifier), we search directly in
the space of descriptions to minimize the interpreter's loss on training
examples. Crucially, our models do not require language data to learn these
concepts: language is used only in pretraining to impose structure on
subsequent learning. Results on image classification, text editing, and
reinforcement learning show that, in all settings, models with a linguistic
parameterization outperform those without
Gated-Attention Architectures for Task-Oriented Language Grounding
To perform tasks specified by natural language instructions, autonomous
agents need to extract semantically meaningful representations of language and
map it to visual elements and actions in the environment. This problem is
called task-oriented language grounding. We propose an end-to-end trainable
neural architecture for task-oriented language grounding in 3D environments
which assumes no prior linguistic or perceptual knowledge and requires only raw
pixels from the environment and the natural language instruction as input. The
proposed model combines the image and text representations using a
Gated-Attention mechanism and learns a policy to execute the natural language
instruction using standard reinforcement and imitation learning methods. We
show the effectiveness of the proposed model on unseen instructions as well as
unseen maps, both quantitatively and qualitatively. We also introduce a novel
environment based on a 3D game engine to simulate the challenges of
task-oriented language grounding over a rich set of instructions and
environment states.Comment: To appear in AAAI-1
Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions
A long-term goal of AI research is to build intelligent agents that can
communicate with humans in natural language, perceive the environment, and
perform real-world tasks. Vision-and-Language Navigation (VLN) is a fundamental
and interdisciplinary research topic towards this goal, and receives increasing
attention from natural language processing, computer vision, robotics, and
machine learning communities. In this paper, we review contemporary studies in
the emerging field of VLN, covering tasks, evaluation metrics, methods, etc.
Through structured analysis of current progress and challenges, we highlight
the limitations of current VLN and opportunities for future work. This paper
serves as a thorough reference for the VLN research community.Comment: 19 pages. Accepted to ACL 202
Embodied Multimodal Multitask Learning
Recent efforts on training visual navigation agents conditioned on language
using deep reinforcement learning have been successful in learning policies for
different multimodal tasks, such as semantic goal navigation and embodied
question answering. In this paper, we propose a multitask model capable of
jointly learning these multimodal tasks, and transferring knowledge of words
and their grounding in visual objects across the tasks. The proposed model uses
a novel Dual-Attention unit to disentangle the knowledge of words in the
textual representations and visual concepts in the visual representations, and
align them with each other. This disentangled task-invariant alignment of
representations facilitates grounding and knowledge transfer across both tasks.
We show that the proposed model outperforms a range of baselines on both tasks
in simulated 3D environments. We also show that this disentanglement of
representations makes our model modular, interpretable, and allows for transfer
to instructions containing new words by leveraging object detectors.Comment: See https://devendrachaplot.github.io/projects/EMML for demo video
Embodied Question Answering
We present a new AI task -- Embodied Question Answering (EmbodiedQA) -- where
an agent is spawned at a random location in a 3D environment and asked a
question ("What color is the car?"). In order to answer, the agent must first
intelligently navigate to explore the environment, gather information through
first-person (egocentric) vision, and then answer the question ("orange").
This challenging task requires a range of AI skills -- active perception,
language understanding, goal-driven navigation, commonsense reasoning, and
grounding of language into actions. In this work, we develop the environments,
end-to-end-trained reinforcement learning agents, and evaluation protocols for
EmbodiedQA.Comment: 20 pages, 13 figures, Webpage: https://embodiedqa.org
- …