7,729 research outputs found
Structured Fusion Networks for Dialog
Neural dialog models have exhibited strong performance, however their
end-to-end nature lacks a representation of the explicit structure of dialog.
This results in a loss of generalizability, controllability and a data-hungry
nature. Conversely, more traditional dialog systems do have strong models of
explicit structure. This paper introduces several approaches for explicitly
incorporating structure into neural models of dialog. Structured Fusion
Networks first learn neural dialog modules corresponding to the structured
components of traditional dialog systems and then incorporate these modules in
a higher-level generative model. Structured Fusion Networks obtain strong
results on the MultiWOZ dataset, both with and without reinforcement learning.
Structured Fusion Networks are shown to have several valuable properties,
including better domain generalizability, improved performance in reduced data
scenarios and robustness to divergence during reinforcement learning.Comment: Accepted to SIGDial 201
Factor Graph Attention
Dialog is an effective way to exchange information, but subtle details and
nuances are extremely important. While significant progress has paved a path to
address visual dialog with algorithms, details and nuances remain a challenge.
Attention mechanisms have demonstrated compelling results to extract details in
visual question answering and also provide a convincing framework for visual
dialog due to their interpretability and effectiveness. However, the many data
utilities that accompany visual dialog challenge existing attention techniques.
We address this issue and develop a general attention mechanism for visual
dialog which operates on any number of data utilities. To this end, we design a
factor graph based attention mechanism which combines any number of utility
representations. We illustrate the applicability of the proposed approach on
the challenging and recently introduced VisDial datasets, outperforming recent
state-of-the-art methods by 1.1% for VisDial0.9 and by 2% for VisDial1.0 on
MRR. Our ensemble model improved the MRR score on VisDial1.0 by more than 6%.Comment: Accepted to CVPR 2019; revised version includes bottom-up feature
SOLOIST: Building Task Bots at Scale with Transfer Learning and Machine Teaching
We present a new method SOLOIST that uses transfer learning and machine
teaching to build task bots at scale. We parameterize classical modular
task-oriented dialog systems using a Transformer-based auto-regressive language
model, which subsumes different dialog modules into a single neural model. We
pre-train, on heterogeneous dialog corpora, a task-grounded response generation
model, which can generate dialog responses grounded in user goals and
real-world knowledge for task completion. The pre-trained model can be
efficiently adapted to accomplish new tasks with a handful of task-specific
dialogs via machine teaching, where training samples are generated by human
teachers interacting with the system. Experiments show that (i) SOLOIST creates
new state-of-the-art on well-studied task-oriented dialog benchmarks, including
CamRest676 and MultiWOZ; (ii) in the few-shot fine-tuning settings, SOLOIST
significantly outperforms existing methods, and (iii) the use of machine
teaching substantially reduces the labeling cost of fine-tuning. The
pre-trained models and codes are available at https://aka.ms/soloist.Comment: 18 pages; To appear at TACL; Project Website: https://aka.ms/solois
History for Visual Dialog: Do we really need it?
Visual Dialog involves "understanding" the dialog history (what has been
discussed previously) and the current question (what is asked), in addition to
grounding information in the image, to generate the correct response. In this
paper, we show that co-attention models which explicitly encode dialog history
outperform models that don't, achieving state-of-the-art performance (72 % NDCG
on val set). However, we also expose shortcomings of the crowd-sourcing dataset
collection procedure by showing that history is indeed only required for a
small amount of the data and that the current evaluation metric encourages
generic replies. To that end, we propose a challenging subset (VisDialConv) of
the VisDial val set and provide a benchmark of 63% NDCG.Comment: ACL'2
Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog
This paper presents a new model for visual dialog, Recurrent Dual Attention
Network (ReDAN), using multi-step reasoning to answer a series of questions
about an image. In each question-answering turn of a dialog, ReDAN infers the
answer progressively through multiple reasoning steps. In each step of the
reasoning process, the semantic representation of the question is updated based
on the image and the previous dialog history, and the recurrently-refined
representation is used for further reasoning in the subsequent step. On the
VisDial v1.0 dataset, the proposed ReDAN model achieves a new state-of-the-art
of 64.47% NDCG score. Visualization on the reasoning process further
demonstrates that ReDAN can locate context-relevant visual and textual clues
via iterative refinement, which can lead to the correct answer step-by-step.Comment: Accepted to ACL 201
A Simple Baseline for Audio-Visual Scene-Aware Dialog
The recently proposed audio-visual scene-aware dialog task paves the way to a
more data-driven way of learning virtual assistants, smart speakers and car
navigation systems. However, very little is known to date about how to
effectively extract meaningful information from a plethora of sensors that
pound the computational engine of those devices. Therefore, in this paper, we
provide and carefully analyze a simple baseline for audio-visual scene-aware
dialog which is trained end-to-end. Our method differentiates in a data-driven
manner useful signals from distracting ones using an attention mechanism. We
evaluate the proposed approach on the recently introduced and challenging
audio-visual scene-aware dataset, and demonstrate the key features that permit
to outperform the current state-of-the-art by more than 20\% on CIDEr.Comment: Accepted to CVPR 201
MovieGraphs: Towards Understanding Human-Centric Situations from Videos
There is growing interest in artificial intelligence to build socially
intelligent robots. This requires machines to have the ability to "read"
people's emotions, motivations, and other factors that affect behavior. Towards
this goal, we introduce a novel dataset called MovieGraphs which provides
detailed, graph-based annotations of social situations depicted in movie clips.
Each graph consists of several types of nodes, to capture who is present in the
clip, their emotional and physical attributes, their relationships (i.e.,
parent/child), and the interactions between them. Most interactions are
associated with topics that provide additional details, and reasons that give
motivations for actions. In addition, most interactions and many attributes are
grounded in the video with time stamps. We provide a thorough analysis of our
dataset, showing interesting common-sense correlations between different social
aspects of scenes, as well as across scenes over time. We propose a method for
querying videos and text with graphs, and show that: 1) our graphs contain rich
and sufficient information to summarize and localize each scene; and 2)
subgraphs allow us to describe situations at an abstract level and retrieve
multiple semantically relevant situations. We also propose methods for
interaction understanding via ordering, and reason understanding. MovieGraphs
is the first benchmark to focus on inferred properties of human-centric
situations, and opens up an exciting avenue towards socially-intelligent AI
agents.Comment: Spotlight at CVPR 2018. Webpage: http://moviegraphs.cs.toronto.ed
Making History Matter: History-Advantage Sequence Training for Visual Dialog
We study the multi-round response generation in visual dialog, where a
response is generated according to a visually grounded conversational history.
Given a triplet: an image, Q&A history, and current question, all the
prevailing methods follow a codec (i.e., encoder-decoder) fashion in a
supervised learning paradigm: a multimodal encoder encodes the triplet into a
feature vector, which is then fed into the decoder for the current answer
generation, supervised by the ground-truth. However, this conventional
supervised learning does NOT take into account the impact of imperfect history,
violating the conversational nature of visual dialog and thus making the codec
more inclined to learn history bias but not contextual reasoning. To this end,
inspired by the actor-critic policy gradient in reinforcement learning, we
propose a novel training paradigm called History Advantage Sequence Training
(HAST). Specifically, we intentionally impose wrong answers in the history,
obtaining an adverse critic, and see how the historic error impacts the codec's
future behavior by History Advantage-a quantity obtained by subtracting the
adverse critic from the gold reward of ground-truth history. Moreover, to make
the codec more sensitive to the history, we propose a novel attention network
called History-Aware Co-Attention Network (HACAN) which can be effectively
trained by using HAST. Experimental results on three benchmarks: VisDial
v0.9&v1.0 and GuessWhat?!, show that the proposed HAST strategy consistently
outperforms the state-of-the-art supervised counterparts
Abstractive Summarization of Spoken and Written Conversation
Nowadays, lots of information is available in form of dialogues. We propose a
novel abstractive summarization system for conversations. We use sequence
tagging of utterances for identifying the discourse relations of the dialogue.
After aptly capturing these relations in a paragraph, we feed it into an
Attention-based pointer network to produce abstractive summaries. We obtain
ROUGE-1, 2 F-scores similar to those of extractive summaries of various
previous works
From Standard Summarization to New Tasks and Beyond: Summarization with Manifold Information
Text summarization is the research area aiming at creating a short and
condensed version of the original document, which conveys the main idea of the
document in a few words. This research topic has started to attract the
attention of a large community of researchers, and it is nowadays counted as
one of the most promising research areas. In general, text summarization
algorithms aim at using a plain text document as input and then output a
summary. However, in real-world applications, most of the data is not in a
plain text format. Instead, there is much manifold information to be
summarized, such as the summary for a web page based on a query in the search
engine, extreme long document (e.g., academic paper), dialog history and so on.
In this paper, we focus on the survey of these new summarization tasks and
approaches in the real-world application.Comment: Accepted by IJCAI 2020 Survey Trac
- …