3,488 research outputs found
Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout
A grand goal in AI is to build a robot that can accurately navigate based on
natural language instructions, which requires the agent to perceive the scene,
understand and ground language, and act in the real-world environment. One key
challenge here is to learn to navigate in new environments that are unseen
during training. Most of the existing approaches perform dramatically worse in
unseen environments as compared to seen ones. In this paper, we present a
generalizable navigational agent. Our agent is trained in two stages. The first
stage is training via mixed imitation and reinforcement learning, combining the
benefits from both off-policy and on-policy optimization. The second stage is
fine-tuning via newly-introduced 'unseen' triplets (environment, path,
instruction). To generate these unseen triplets, we propose a simple but
effective 'environmental dropout' method to mimic unseen environments, which
overcomes the problem of limited seen environment variability. Next, we apply
semi-supervised learning (via back-translation) on these dropped-out
environments to generate new paths and instructions. Empirically, we show that
our agent is substantially better at generalizability when fine-tuned with
these triplets, outperforming the state-of-art approaches by a large margin on
the private unseen test set of the Room-to-Room task, and achieving the top
rank on the leaderboard.Comment: NAACL 2019 (12 pages
Emergent Communication in a Multi-Modal, Multi-Step Referential Game
Inspired by previous work on emergent communication in referential games, we
propose a novel multi-modal, multi-step referential game, where the sender and
receiver have access to distinct modalities of an object, and their information
exchange is bidirectional and of arbitrary duration. The multi-modal multi-step
setting allows agents to develop an internal communication significantly closer
to natural language, in that they share a single set of messages, and that the
length of the conversation may vary according to the difficulty of the task. We
examine these properties empirically using a dataset consisting of images and
textual descriptions of mammals, where the agents are tasked with identifying
the correct object. Our experiments indicate that a robust and efficient
communication protocol emerges, where gradual information exchange informs
better predictions and higher communication bandwidth improves generalization.Comment: Published as a conference paper at ICLR 2018. 12 page
Game-Based Video-Context Dialogue
Current dialogue systems focus more on textual and speech context knowledge
and are usually based on two speakers. Some recent work has investigated static
image-based dialogue. However, several real-world human interactions also
involve dynamic visual context (similar to videos) as well as dialogue
exchanges among multiple speakers. To move closer towards such multimodal
conversational skills and visually-situated applications, we introduce a new
video-context, many-speaker dialogue dataset based on live-broadcast soccer
game videos and chats from Twitch.tv. This challenging testbed allows us to
develop visually-grounded dialogue models that should generate relevant
temporal and spatial event language from the live video, while also being
relevant to the chat history. For strong baselines, we also present several
discriminative and generative models, e.g., based on tridirectional attention
flow (TriDAF). We evaluate these models via retrieval ranking-recall, automatic
phrase-matching metrics, as well as human evaluation studies. We also present
dataset analyses, model ablations, and visualizations to understand the
contribution of different modalities and model components.Comment: EMNLP 2018 (14 pages) (fixed Table5 typo in v2
When Autonomous Systems Meet Accuracy and Transferability through AI: A Survey
With widespread applications of artificial intelligence (AI), the
capabilities of the perception, understanding, decision-making and control for
autonomous systems have improved significantly in the past years. When
autonomous systems consider the performance of accuracy and transferability,
several AI methods, like adversarial learning, reinforcement learning (RL) and
meta-learning, show their powerful performance. Here, we review the
learning-based approaches in autonomous systems from the perspectives of
accuracy and transferability. Accuracy means that a well-trained model shows
good results during the testing phase, in which the testing set shares a same
task or a data distribution with the training set. Transferability means that
when a well-trained model is transferred to other testing domains, the accuracy
is still good. Firstly, we introduce some basic concepts of transfer learning
and then present some preliminaries of adversarial learning, RL and
meta-learning. Secondly, we focus on reviewing the accuracy or transferability
or both of them to show the advantages of adversarial learning, like generative
adversarial networks (GANs), in typical computer vision tasks in autonomous
systems, including image style transfer, image superresolution, image
deblurring/dehazing/rain removal, semantic segmentation, depth estimation,
pedestrian detection and person re-identification (re-ID). Then, we further
review the performance of RL and meta-learning from the aspects of accuracy or
transferability or both of them in autonomous systems, involving pedestrian
tracking, robot navigation and robotic manipulation. Finally, we discuss
several challenges and future topics for using adversarial learning, RL and
meta-learning in autonomous systems
Span-based Localizing Network for Natural Language Video Localization
Given an untrimmed video and a text query, natural language video
localization (NLVL) is to locate a matching span from the video that
semantically corresponds to the query. Existing solutions formulate NLVL either
as a ranking task and apply multimodal matching architecture, or as a
regression task to directly regress the target video span. In this work, we
address NLVL task with a span-based QA approach by treating the input video as
text passage. We propose a video span localizing network (VSLNet), on top of
the standard span-based QA framework, to address NLVL. The proposed VSLNet
tackles the differences between NLVL and span-based QA through a simple yet
effective query-guided highlighting (QGH) strategy. The QGH guides VSLNet to
search for matching video span within a highlighted region. Through extensive
experiments on three benchmark datasets, we show that the proposed VSLNet
outperforms the state-of-the-art methods; and adopting span-based QA framework
is a promising direction to solve NLVL.Comment: To appear at ACL 202
Multimodal Research in Vision and Language: A Review of Current and Emerging Trends
Deep Learning and its applications have cascaded impactful research and
development with a diverse range of modalities present in the real-world data.
More recently, this has enhanced research interests in the intersection of the
Vision and Language arena with its numerous applications and fast-paced growth.
In this paper, we present a detailed overview of the latest trends in research
pertaining to visual and language modalities. We look at its applications in
their task formulations and how to solve various problems related to semantic
perception and content generation. We also address task-specific trends, along
with their evaluation strategies and upcoming challenges. Moreover, we shed
some light on multi-disciplinary patterns and insights that have emerged in the
recent past, directing this field towards more modular and transparent
intelligent systems. This survey identifies key trends gravitating recent
literature in VisLang research and attempts to unearth directions that the
field is heading towards
A Comprehensive Survey of Deep Learning for Image Captioning
Generating a description of an image is called image captioning. Image
captioning requires to recognize the important objects, their attributes and
their relationships in an image. It also needs to generate syntactically and
semantically correct sentences. Deep learning-based techniques are capable of
handling the complexities and challenges of image captioning. In this survey
paper, we aim to present a comprehensive review of existing deep learning-based
image captioning techniques. We discuss the foundation of the techniques to
analyze their performances, strengths and limitations. We also discuss the
datasets and the evaluation metrics popularly used in deep learning based
automatic image captioning.Comment: 36 Pages, Accepted as a Journal Paper in ACM Computing Surveys
(October 2018
Dual Ask-Answer Network for Machine Reading Comprehension
There are three modalities in the reading comprehension setting: question,
answer and context. The task of question answering or question generation aims
to infer an answer or a question when given the counterpart based on context.
We present a novel two-way neural sequence transduction model that connects
three modalities, allowing it to learn two tasks simultaneously and mutually
benefit one another. During training, the model receives
question-context-answer triplets as input and captures the cross-modal
interaction via a hierarchical attention process. Unlike previous joint
learning paradigms that leverage the duality of question generation and
question answering at data level, we solve such dual tasks at the architecture
level by mirroring the network structure and partially sharing components at
different layers. This enables the knowledge to be transferred from one task to
another, helping the model to find a general representation for each modality.
The evaluation on four public datasets shows that our dual-learning model
outperforms the mono-learning counterpart as well as the state-of-the-art joint
models on both question answering and question generation tasks.Comment: 8 pages, 5 figures, 4 tables. Code is available at
https://github.com/hanxiao/daane
From Standard Summarization to New Tasks and Beyond: Summarization with Manifold Information
Text summarization is the research area aiming at creating a short and
condensed version of the original document, which conveys the main idea of the
document in a few words. This research topic has started to attract the
attention of a large community of researchers, and it is nowadays counted as
one of the most promising research areas. In general, text summarization
algorithms aim at using a plain text document as input and then output a
summary. However, in real-world applications, most of the data is not in a
plain text format. Instead, there is much manifold information to be
summarized, such as the summary for a web page based on a query in the search
engine, extreme long document (e.g., academic paper), dialog history and so on.
In this paper, we focus on the survey of these new summarization tasks and
approaches in the real-world application.Comment: Accepted by IJCAI 2020 Survey Trac
Deep Residual Output Layers for Neural Language Generation
Many tasks, including language generation, benefit from learning the
structure of the output space, particularly when the space of output labels is
large and the data is sparse. State-of-the-art neural language models
indirectly capture the output space structure in their classifier weights since
they lack parameter sharing across output labels. Learning shared output label
mappings helps, but existing methods have limited expressivity and are prone to
overfitting. In this paper, we investigate the usefulness of more powerful
shared mappings for output labels, and propose a deep residual output mapping
with dropout between layers to better capture the structure of the output space
and avoid overfitting. Evaluations on three language generation tasks show that
our output label mapping can match or improve state-of-the-art recurrent and
self-attention architectures, and suggest that the classifier does not
necessarily need to be high-rank to better model natural language if it is
better at capturing the structure of the output space.Comment: To appear in ICML 201
- …