677 research outputs found
Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems
Developing Video-Grounded Dialogue Systems (VGDS), where a dialogue is
conducted based on visual and audio aspects of a given video, is significantly
more challenging than traditional image or text-grounded dialogue systems
because (1) feature space of videos span across multiple picture frames, making
it difficult to obtain semantic information; and (2) a dialogue agent must
perceive and process information from different modalities (audio, video,
caption, etc.) to obtain a comprehensive understanding. Most existing work is
based on RNNs and sequence-to-sequence architectures, which are not very
effective for capturing complex long-term dependencies (like in videos). To
overcome this, we propose Multimodal Transformer Networks (MTN) to encode
videos and incorporate information from different modalities. We also propose
query-aware attention through an auto-encoder to extract query-aware features
from non-text modalities. We develop a training procedure to simulate
token-level decoding to improve the quality of generated responses during
inference. We get state of the art performance on Dialogue System Technology
Challenge 7 (DSTC7). Our model also generalizes to another multimodal
visual-grounded dialogue task, and obtains promising performance. We
implemented our models using PyTorch and the code is released at
https://github.com/henryhungle/MTN.Comment: Accepted at ACL 2019 (Long Paper
Multimodal transformer networks for end-to-end video-grounded dialogue systems
Agency for Science, Technology and Research (A*STAR) under its AME Programmatic Funding Schem
Structured Co-reference Graph Attention for Video-grounded Dialogue
A video-grounded dialogue system referred to as the Structured Co-reference
Graph Attention (SCGA) is presented for decoding the answer sequence to a
question regarding a given video while keeping track of the dialogue context.
Although recent efforts have made great strides in improving the quality of the
response, performance is still far from satisfactory. The two main challenging
issues are as follows: (1) how to deduce co-reference among multiple modalities
and (2) how to reason on the rich underlying semantic structure of video with
complex spatial and temporal dynamics. To this end, SCGA is based on (1)
Structured Co-reference Resolver that performs dereferencing via building a
structured graph over multiple modalities, (2) Spatio-temporal Video Reasoner
that captures local-to-global dynamics of video via gradually neighboring graph
attention. SCGA makes use of pointer network to dynamically replicate parts of
the question for decoding the answer sequence. The validity of the proposed
SCGA is demonstrated on AVSD@DSTC7 and AVSD@DSTC8 datasets, a challenging
video-grounded dialogue benchmarks, and TVQA dataset, a large-scale videoQA
benchmark. Our empirical results show that SCGA outperforms other
state-of-the-art dialogue systems on both benchmarks, while extensive ablation
study and qualitative analysis reveal performance gain and improved
interpretability.Comment: Accepted to AAAI202
From Knowledge Augmentation to Multi-tasking: Towards Human-like Dialogue Systems
The goal of building dialogue agents that can converse with humans naturally
has been a long-standing dream of researchers since the early days of
artificial intelligence. The well-known Turing Test proposed to judge the
ultimate validity of an artificial intelligence agent on the
indistinguishability of its dialogues from humans'. It should come as no
surprise that human-level dialogue systems are very challenging to build. But,
while early effort on rule-based systems found limited success, the emergence
of deep learning enabled great advance on this topic.
In this thesis, we focus on methods that address the numerous issues that
have been imposing the gap between artificial conversational agents and
human-level interlocutors. These methods were proposed and experimented with in
ways that were inspired by general state-of-the-art AI methodologies. But they
also targeted the characteristics that dialogue systems possess.Comment: PhD thesi
A Unified Framework for Slot based Response Generation in a Multimodal Dialogue System
Natural Language Understanding (NLU) and Natural Language Generation (NLG)
are the two critical components of every conversational system that handles the
task of understanding the user by capturing the necessary information in the
form of slots and generating an appropriate response in accordance with the
extracted information. Recently, dialogue systems integrated with complementary
information such as images, audio, or video have gained immense popularity. In
this work, we propose an end-to-end framework with the capability to extract
necessary slot values from the utterance and generate a coherent response,
thereby assisting the user to achieve their desired goals in a multimodal
dialogue system having both textual and visual information. The task of
extracting the necessary information is dependent not only on the text but also
on the visual cues present in the dialogue. Similarly, for the generation, the
previous dialog context comprising multimodal information is significant for
providing coherent and informative responses. We employ a multimodal
hierarchical encoder using pre-trained DialoGPT and also exploit the knowledge
base (Kb) to provide a stronger context for both the tasks. Finally, we design
a slot attention mechanism to focus on the necessary information in a given
utterance. Lastly, a decoder generates the corresponding response for the given
dialogue context and the extracted slot values. Experimental results on the
Multimodal Dialogue Dataset (MMD) show that the proposed framework outperforms
the baselines approaches in both the tasks. The code is available at
https://github.com/avinashsai/slot-gpt.Comment: Published in the journal Multimedia Tools and Application
Information-Theoretic Text Hallucination Reduction for Video-grounded Dialogue
Video-grounded Dialogue (VGD) aims to decode an answer sentence to a question
regarding a given video and dialogue context. Despite the recent success of
multi-modal reasoning to generate answer sentences, existing dialogue systems
still suffer from a text hallucination problem, which denotes indiscriminate
text-copying from input texts without an understanding of the question. This is
due to learning spurious correlations from the fact that answer sentences in
the dataset usually include the words of input texts, thus the VGD system
excessively relies on copying words from input texts by hoping those words to
overlap with ground-truth texts. Hence, we design Text Hallucination Mitigating
(THAM) framework, which incorporates Text Hallucination Regularization (THR)
loss derived from the proposed information-theoretic text hallucination
measurement approach. Applying THAM with current dialogue systems validates the
effectiveness on VGD benchmarks (i.e., AVSD@DSTC7 and AVSD@DSTC8) and shows
enhanced interpretability.Comment: 12 pages, Accepted in EMNLP 202
- …