Search CORE

99 research outputs found

Audio-Visual Scene-Aware Dialog

Author: Alamri Huda
Anderson Peter
Batra Dhruv
Cartillier Vincent
Cherian Anoop
Das Abhishek
Essa Irfan
Hori Chiori
Lee Stefan
Marks Tim K.
Parikh Devi
Wang Jue
Publication venue
Publication date: 08/05/2019
Field of study

We introduce the task of scene-aware dialog. Our goal is to generate a complete and natural response to a question about a scene, given video and audio of the scene and the history of previous turns in the dialog. To answer successfully, agents must ground concepts from the question in the video while leveraging contextual cues from the dialog history. To benchmark this task, we introduce the Audio Visual Scene-Aware Dialog (AVSD) Dataset. For each of more than 11,000 videos of human actions from the Charades dataset, our dataset contains a dialog about the video, plus a final summary of the video by one of the dialog participants. We train several baseline systems for this task and evaluate the performance of the trained models using both qualitative and quantitative metrics. Our results indicate that models must utilize all the available inputs (video, audio, question, and dialog history) to perform best on this dataset

arXiv.org e-Print Archive

Audio Visual Scene-Aware Dialog (AVSD) Challenge at DSTC7

Author: Alamri Huda
Batra Dhruv
Cartillier Vincent
Cherian Anoop
Das Abhishek
Essa Irfan
Hori Chiori
Lopes Raphael Gontijo
Marks Tim K.
Parikh Devi
Wang Jue
Publication venue
Publication date: 01/06/2018
Field of study

Scene-aware dialog systems will be able to have conversations with users about the objects and events around them. Progress on such systems can be made by integrating state-of-the-art technologies from multiple research areas including end-to-end dialog systems visual dialog, and video description. We introduce the Audio Visual Scene Aware Dialog (AVSD) challenge and dataset. In this challenge, which is one track of the 7th Dialog System Technology Challenges (DSTC7) workshop1, the task is to build a system that generates responses in a dialog about an input vide

arXiv.org e-Print Archive

A Simple Baseline for Audio-Visual Scene-Aware Dialog

Author: Hazan Tamir
Schwartz Idan
Schwing Alexander
Publication venue
Publication date: 11/04/2019
Field of study

The recently proposed audio-visual scene-aware dialog task paves the way to a more data-driven way of learning virtual assistants, smart speakers and car navigation systems. However, very little is known to date about how to effectively extract meaningful information from a plethora of sensors that pound the computational engine of those devices. Therefore, in this paper, we provide and carefully analyze a simple baseline for audio-visual scene-aware dialog which is trained end-to-end. Our method differentiates in a data-driven manner useful signals from distracting ones using an attention mechanism. We evaluate the proposed approach on the recently introduced and challenging audio-visual scene-aware dataset, and demonstrate the key features that permit to outperform the current state-of-the-art by more than 20\% on CIDEr.Comment: Accepted to CVPR 201

arXiv.org e-Print Archive

Context, Attention and Audio Feature Explorations for Audio Visual Scene-Aware Dialog

Author: Huang Jonathan
Kumar Shachi H
Leanos Juan Jose Alvarado
Nachman Lama
Okur Eda
Sahay Saurav
Publication venue
Publication date: 20/12/2018
Field of study

With the recent advancements in AI, Intelligent Virtual Assistants (IVA) have become a ubiquitous part of every home. Going forward, we are witnessing a confluence of vision, speech and dialog system technologies that are enabling the IVAs to learn audio-visual groundings of utterances and have conversations with users about the objects, activities and events surrounding them. As a part of the 7th Dialog System Technology Challenges (DSTC7), for Audio Visual Scene-Aware Dialog (AVSD) track, We explore `topics' of the dialog as an important contextual feature into the architecture along with explorations around multimodal Attention. We also incorporate an end-to-end audio classification ConvNet, AclNet, into our models. We present detailed analysis of the experiments and show that some of our model variations outperform the baseline system presented for this task.Comment: 7 pages, 2 figures, DSTC7 workshop at AAAI 201

arXiv.org e-Print Archive

Exploring Context, Attention and Audio Features for Audio Visual Scene-Aware Dialog

Author: Huang Jonathan
Kumar Shachi H
Nachman Lama
Okur Eda
Sahay Saurav
Publication venue
Publication date: 20/12/2019
Field of study

We are witnessing a confluence of vision, speech and dialog system technologies that are enabling the IVAs to learn audio-visual groundings of utterances and have conversations with users about the objects, activities and events surrounding them. Recent progress in visual grounding techniques and Audio Understanding are enabling machines to understand shared semantic concepts and listen to the various sensory events in the environment. With audio and visual grounding methods, end-to-end multimodal SDS are trained to meaningfully communicate with us in natural language about the real dynamic audio-visual sensory world around us. In this work, we explore the role of `topics' as the context of the conversation along with multimodal attention into such an end-to-end audio-visual scene-aware dialog system architecture. We also incorporate an end-to-end audio classification ConvNet, AclNet, into our models. We develop and test our approaches on the Audio Visual Scene-Aware Dialog (AVSD) dataset released as a part of the DSTC7. We present the analysis of our experiments and show that some of our model variations outperform the baseline system released for AVSD.Comment: Presented at the Visual Question Answering and Dialog Workshop, CVPR 2019, Long Beach, USA. arXiv admin note: substantial text overlap with arXiv:1912.1013

arXiv.org e-Print Archive

End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features

Author: Alamri Huda
Batra Dhruv
Cartillier Vincent
Cherian Anoop
Das Abhishek
Essa Irfan
Hori Chiori
Hori Takaaki
Lopes Raphael Gontijo
Marks Tim K.
Parikh Devi
Wang Jue
Wichern Gordon
Publication venue
Publication date: 29/06/2018
Field of study

Dialog systems need to understand dynamic visual scenes in order to have conversations with users about the objects and events around them. Scene-aware dialog systems for real-world applications could be developed by integrating state-of-the-art technologies from multiple research areas, including: end-to-end dialog technologies, which generate system responses using models trained from dialog data; visual question answering (VQA) technologies, which answer questions about images using learned image features; and video description technologies, in which descriptions/captions are generated from videos using multimodal information. We introduce a new dataset of dialogs about videos of human behaviors. Each dialog is a typed conversation that consists of a sequence of 10 question-and-answer(QA) pairs between two Amazon Mechanical Turk (AMT) workers. In total, we collected dialogs on roughly 9,000 videos. Using this new dataset for Audio Visual Scene-aware dialog (AVSD), we trained an end-to-end conversation model that generates responses in a dialog about a video. Our experiments demonstrate that using multimodal features that were developed for multimodal attention-based video description enhances the quality of generated dialog about dynamic scenes (videos). Our dataset, model code and pretrained models will be publicly available for a new Video Scene-Aware Dialog challenge.Comment: A prototype system for the Audio Visual Scene-aware Dialog (AVSD) at DSTC

arXiv.org e-Print Archive

Leveraging Topics and Audio Features with Multimodal Attention for Audio Visual Scene-Aware Dialog

Author: Huang Jonathan
Kumar Shachi H
Nachman Lama
Okur Eda
Sahay Saurav
Publication venue
Publication date: 20/12/2019
Field of study

With the recent advancements in Artificial Intelligence (AI), Intelligent Virtual Assistants (IVA) such as Alexa, Google Home, etc., have become a ubiquitous part of many homes. Currently, such IVAs are mostly audio-based, but going forward, we are witnessing a confluence of vision, speech and dialog system technologies that are enabling the IVAs to learn audio-visual groundings of utterances. This will enable agents to have conversations with users about the objects, activities and events surrounding them. In this work, we present three main architectural explorations for the Audio Visual Scene-Aware Dialog (AVSD): 1) investigating `topics' of the dialog as an important contextual feature for the conversation, 2) exploring several multimodal attention mechanisms during response generation, 3) incorporating an end-to-end audio classification ConvNet, AclNet, into our architecture. We discuss detailed analysis of the experimental results and show that our model variations outperform the baseline system presented for the AVSD task.Comment: Presented at the 3rd Visually Grounded Interaction and Language (ViGIL) Workshop, NeurIPS 2019, Vancouver, Canada. arXiv admin note: substantial text overlap with arXiv:1812.08407, arXiv:1912.1013

arXiv.org e-Print Archive

The Eighth Dialog System Technology Challenge

Author: Adada Mahmoud
Atkinson Adam
Cherian Anoop
Galley Michel
Gao Jianfeng
Gunasekara Chulaka
Gupta Raghav
Hori Chiori
Huang Minlie
Kim Seokhwan
Kummerfeld Jonathan K.
Lasecki Walter S.
Lastras Luis
Lee Sungjin
Li Jinchao
Marks Tim K.
Peng Baolin
Rastogi Abhinav
Schulz Hannes
Sunkara Srinivas
Zang Xiaoxue
Publication venue
Publication date: 14/11/2019
Field of study

This paper introduces the Eighth Dialog System Technology Challenge. In line with recent challenges, the eighth edition focuses on applying end-to-end dialog technologies in a pragmatic way for multi-domain task-completion, noetic response selection, audio visual scene-aware dialog, and schema-guided dialog state tracking tasks. This paper describes the task definition, provided datasets, and evaluation set-up for each track. We also summarize the results of the submitted systems to highlight the overall trends of the state-of-the-art technologies for the tasks.Comment: Submitted to NeurIPS 2019 3rd Conversational AI Worksho

arXiv.org e-Print Archive

Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog

Author: Feng Yang
Li Zekang
Li Zongjia
Niu Cheng
Zhang Jinchao
Zhou Jie
Publication venue
Publication date: 01/02/2020
Field of study

Audio-Visual Scene-Aware Dialog (AVSD) is a task to generate responses when chatting about a given video, which is organized as a track of the 8th Dialog System Technology Challenge (DSTC8). To solve the task, we propose a universal multimodal transformer and introduce the multi-task learning method to learn joint representations among different modalities as well as generate informative and fluent responses. Our method extends the natural language generation pre-trained model to multimodal dialogue generation task. Our system achieves the best performance in both objective and subjective evaluations in the challenge.Comment: Accepted by AAAI2020 DSTC8 workshop. Ranked 1st in DSTC8-AVSD trac

arXiv.org e-Print Archive

Multi-step Joint-Modality Attention Network for Scene-Aware Dialogue System

Author: Chu Yun-Wei
Hsu Chao-Chun
Ku Lun-Wei
Lin Kuan-Yen
Publication venue
Publication date: 17/01/2020
Field of study

Understanding dynamic scenes and dialogue contexts in order to converse with users has been challenging for multimodal dialogue systems. The 8-th Dialog System Technology Challenge (DSTC8) proposed an Audio Visual Scene-Aware Dialog (AVSD) task, which contains multiple modalities including audio, vision, and language, to evaluate how dialogue systems understand different modalities and response to users. In this paper, we proposed a multi-step joint-modality attention network (JMAN) based on recurrent neural network (RNN) to reason on videos. Our model performs a multi-step attention mechanism and jointly considers both visual and textual representations in each reasoning process to better integrate information from the two different modalities. Compared to the baseline released by AVSD organizers, our model achieves a relative 12.1% and 22.4% improvement over the baseline on ROUGE-L score and CIDEr score.Comment: DSTC8 collocated with AAAI202

arXiv.org e-Print Archive