DeepStory: Video Story QA by Deep Embedded Memory Networks

Choi, Seong-Ho; Heo, Min-Oh; Kim, Kyung-Min; Zhang, Byoung-Tak

research

DeepStory: Video Story QA by Deep Embedded Memory Networks

Authors: Seong-Ho Choi
Min-Oh Heo
Kyung-Min Kim
Byoung-Tak Zhang
Publication date: 4 July 2017
Publisher
Doi

Abstract

Question-answering (QA) on video contents is a significant challenge for achieving human-level intelligence as it involves both vision and language in real-world settings. Here we demonstrate the possibility of an AI agent performing video story QA by learning from a large amount of cartoon videos. We develop a video-story learning model, i.e. Deep Embedded Memory Networks (DEMN), to reconstruct stories from a joint scene-dialogue video stream using a latent embedding space of observed data. The video stories are stored in a long-term memory component. For a given question, an LSTM-based attention model uses the long-term memory to recall the best question-story-answer triplet by focusing on specific words containing key information. We trained the DEMN on a novel QA dataset of children's cartoon video series, Pororo. The dataset contains 16,066 scene-dialogue pairs of 20.5-hour videos, 27,328 fine-grained sentences for scene description, and 8,913 story-related QA pairs. Our experimental results show that the DEMN outperforms other QA models. This is mainly due to 1) the reconstruction of video stories in a scene-dialogue combined form that utilize the latent embedding and 2) attention. DEMN also achieved state-of-the-art results on the MovieQA benchmark.Comment: 7 pages, accepted for IJCAI 201

Similar works

Full text

Available Versions

Crossref

info:doi/10.24963%2Fijcai.2017...

Last time updated on 01/04/2019