99 research outputs found
Audio-Visual Scene-Aware Dialog
We introduce the task of scene-aware dialog. Our goal is to generate a
complete and natural response to a question about a scene, given video and
audio of the scene and the history of previous turns in the dialog. To answer
successfully, agents must ground concepts from the question in the video while
leveraging contextual cues from the dialog history. To benchmark this task, we
introduce the Audio Visual Scene-Aware Dialog (AVSD) Dataset. For each of more
than 11,000 videos of human actions from the Charades dataset, our dataset
contains a dialog about the video, plus a final summary of the video by one of
the dialog participants. We train several baseline systems for this task and
evaluate the performance of the trained models using both qualitative and
quantitative metrics. Our results indicate that models must utilize all the
available inputs (video, audio, question, and dialog history) to perform best
on this dataset
Audio Visual Scene-Aware Dialog (AVSD) Challenge at DSTC7
Scene-aware dialog systems will be able to have conversations with users
about the objects and events around them. Progress on such systems can be made
by integrating state-of-the-art technologies from multiple research areas
including end-to-end dialog systems visual dialog, and video description. We
introduce the Audio Visual Scene Aware Dialog (AVSD) challenge and dataset. In
this challenge, which is one track of the 7th Dialog System Technology
Challenges (DSTC7) workshop1, the task is to build a system that generates
responses in a dialog about an input vide
A Simple Baseline for Audio-Visual Scene-Aware Dialog
The recently proposed audio-visual scene-aware dialog task paves the way to a
more data-driven way of learning virtual assistants, smart speakers and car
navigation systems. However, very little is known to date about how to
effectively extract meaningful information from a plethora of sensors that
pound the computational engine of those devices. Therefore, in this paper, we
provide and carefully analyze a simple baseline for audio-visual scene-aware
dialog which is trained end-to-end. Our method differentiates in a data-driven
manner useful signals from distracting ones using an attention mechanism. We
evaluate the proposed approach on the recently introduced and challenging
audio-visual scene-aware dataset, and demonstrate the key features that permit
to outperform the current state-of-the-art by more than 20\% on CIDEr.Comment: Accepted to CVPR 201
Context, Attention and Audio Feature Explorations for Audio Visual Scene-Aware Dialog
With the recent advancements in AI, Intelligent Virtual Assistants (IVA) have
become a ubiquitous part of every home. Going forward, we are witnessing a
confluence of vision, speech and dialog system technologies that are enabling
the IVAs to learn audio-visual groundings of utterances and have conversations
with users about the objects, activities and events surrounding them. As a part
of the 7th Dialog System Technology Challenges (DSTC7), for Audio Visual
Scene-Aware Dialog (AVSD) track, We explore `topics' of the dialog as an
important contextual feature into the architecture along with explorations
around multimodal Attention. We also incorporate an end-to-end audio
classification ConvNet, AclNet, into our models. We present detailed analysis
of the experiments and show that some of our model variations outperform the
baseline system presented for this task.Comment: 7 pages, 2 figures, DSTC7 workshop at AAAI 201
Exploring Context, Attention and Audio Features for Audio Visual Scene-Aware Dialog
We are witnessing a confluence of vision, speech and dialog system
technologies that are enabling the IVAs to learn audio-visual groundings of
utterances and have conversations with users about the objects, activities and
events surrounding them. Recent progress in visual grounding techniques and
Audio Understanding are enabling machines to understand shared semantic
concepts and listen to the various sensory events in the environment. With
audio and visual grounding methods, end-to-end multimodal SDS are trained to
meaningfully communicate with us in natural language about the real dynamic
audio-visual sensory world around us. In this work, we explore the role of
`topics' as the context of the conversation along with multimodal attention
into such an end-to-end audio-visual scene-aware dialog system architecture. We
also incorporate an end-to-end audio classification ConvNet, AclNet, into our
models. We develop and test our approaches on the Audio Visual Scene-Aware
Dialog (AVSD) dataset released as a part of the DSTC7. We present the analysis
of our experiments and show that some of our model variations outperform the
baseline system released for AVSD.Comment: Presented at the Visual Question Answering and Dialog Workshop, CVPR
2019, Long Beach, USA. arXiv admin note: substantial text overlap with
arXiv:1912.1013
End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features
Dialog systems need to understand dynamic visual scenes in order to have
conversations with users about the objects and events around them. Scene-aware
dialog systems for real-world applications could be developed by integrating
state-of-the-art technologies from multiple research areas, including:
end-to-end dialog technologies, which generate system responses using models
trained from dialog data; visual question answering (VQA) technologies, which
answer questions about images using learned image features; and video
description technologies, in which descriptions/captions are generated from
videos using multimodal information. We introduce a new dataset of dialogs
about videos of human behaviors. Each dialog is a typed conversation that
consists of a sequence of 10 question-and-answer(QA) pairs between two Amazon
Mechanical Turk (AMT) workers. In total, we collected dialogs on roughly 9,000
videos. Using this new dataset for Audio Visual Scene-aware dialog (AVSD), we
trained an end-to-end conversation model that generates responses in a dialog
about a video. Our experiments demonstrate that using multimodal features that
were developed for multimodal attention-based video description enhances the
quality of generated dialog about dynamic scenes (videos). Our dataset, model
code and pretrained models will be publicly available for a new Video
Scene-Aware Dialog challenge.Comment: A prototype system for the Audio Visual Scene-aware Dialog (AVSD) at
DSTC
Leveraging Topics and Audio Features with Multimodal Attention for Audio Visual Scene-Aware Dialog
With the recent advancements in Artificial Intelligence (AI), Intelligent
Virtual Assistants (IVA) such as Alexa, Google Home, etc., have become a
ubiquitous part of many homes. Currently, such IVAs are mostly audio-based, but
going forward, we are witnessing a confluence of vision, speech and dialog
system technologies that are enabling the IVAs to learn audio-visual groundings
of utterances. This will enable agents to have conversations with users about
the objects, activities and events surrounding them. In this work, we present
three main architectural explorations for the Audio Visual Scene-Aware Dialog
(AVSD): 1) investigating `topics' of the dialog as an important contextual
feature for the conversation, 2) exploring several multimodal attention
mechanisms during response generation, 3) incorporating an end-to-end audio
classification ConvNet, AclNet, into our architecture. We discuss detailed
analysis of the experimental results and show that our model variations
outperform the baseline system presented for the AVSD task.Comment: Presented at the 3rd Visually Grounded Interaction and Language
(ViGIL) Workshop, NeurIPS 2019, Vancouver, Canada. arXiv admin note:
substantial text overlap with arXiv:1812.08407, arXiv:1912.1013
The Eighth Dialog System Technology Challenge
This paper introduces the Eighth Dialog System Technology Challenge. In line
with recent challenges, the eighth edition focuses on applying end-to-end
dialog technologies in a pragmatic way for multi-domain task-completion, noetic
response selection, audio visual scene-aware dialog, and schema-guided dialog
state tracking tasks. This paper describes the task definition, provided
datasets, and evaluation set-up for each track. We also summarize the results
of the submitted systems to highlight the overall trends of the
state-of-the-art technologies for the tasks.Comment: Submitted to NeurIPS 2019 3rd Conversational AI Worksho
Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog
Audio-Visual Scene-Aware Dialog (AVSD) is a task to generate responses when
chatting about a given video, which is organized as a track of the 8th Dialog
System Technology Challenge (DSTC8). To solve the task, we propose a universal
multimodal transformer and introduce the multi-task learning method to learn
joint representations among different modalities as well as generate
informative and fluent responses. Our method extends the natural language
generation pre-trained model to multimodal dialogue generation task. Our system
achieves the best performance in both objective and subjective evaluations in
the challenge.Comment: Accepted by AAAI2020 DSTC8 workshop. Ranked 1st in DSTC8-AVSD trac
Multi-step Joint-Modality Attention Network for Scene-Aware Dialogue System
Understanding dynamic scenes and dialogue contexts in order to converse with
users has been challenging for multimodal dialogue systems. The 8-th Dialog
System Technology Challenge (DSTC8) proposed an Audio Visual Scene-Aware Dialog
(AVSD) task, which contains multiple modalities including audio, vision, and
language, to evaluate how dialogue systems understand different modalities and
response to users. In this paper, we proposed a multi-step joint-modality
attention network (JMAN) based on recurrent neural network (RNN) to reason on
videos. Our model performs a multi-step attention mechanism and jointly
considers both visual and textual representations in each reasoning process to
better integrate information from the two different modalities. Compared to the
baseline released by AVSD organizers, our model achieves a relative 12.1% and
22.4% improvement over the baseline on ROUGE-L score and CIDEr score.Comment: DSTC8 collocated with AAAI202
- …