50,824 research outputs found
DramaQA: Character-Centered Video Story Understanding with Hierarchical QA
Despite recent progress on computer vision and natural language processing,
developing video understanding intelligence is still hard to achieve due to the
intrinsic difficulty of story in video. Moreover, there is not a theoretical
metric for evaluating the degree of video understanding. In this paper, we
propose a novel video question answering (Video QA) task, DramaQA, for a
comprehensive understanding of the video story. The DramaQA focused on two
perspectives: 1) hierarchical QAs as an evaluation metric based on the
cognitive developmental stages of human intelligence. 2) character-centered
video annotations to model local coherence of the story. Our dataset is built
upon the TV drama "Another Miss Oh" and it contains 16,191 QA pairs from 23,928
various length video clips, with each QA pair belonging to one of four
difficulty levels. We provide 217,308 annotated images with rich
character-centered annotations, including visual bounding boxes, behaviors, and
emotions of main characters, and coreference resolved scripts. Additionally, we
provide analyses of the dataset as well as Dual Matching Multistream model
which effectively learns character-centered representations of video to answer
questions about the video. We are planning to release our dataset and model
publicly for research purposes and expect that our work will provide a new
perspective on video story understanding research.Comment: 21 pages, 10 figures, submitted to ECCV 202
DeepStory: Video Story QA by Deep Embedded Memory Networks
Question-answering (QA) on video contents is a significant challenge for
achieving human-level intelligence as it involves both vision and language in
real-world settings. Here we demonstrate the possibility of an AI agent
performing video story QA by learning from a large amount of cartoon videos. We
develop a video-story learning model, i.e. Deep Embedded Memory Networks
(DEMN), to reconstruct stories from a joint scene-dialogue video stream using a
latent embedding space of observed data. The video stories are stored in a
long-term memory component. For a given question, an LSTM-based attention model
uses the long-term memory to recall the best question-story-answer triplet by
focusing on specific words containing key information. We trained the DEMN on a
novel QA dataset of children's cartoon video series, Pororo. The dataset
contains 16,066 scene-dialogue pairs of 20.5-hour videos, 27,328 fine-grained
sentences for scene description, and 8,913 story-related QA pairs. Our
experimental results show that the DEMN outperforms other QA models. This is
mainly due to 1) the reconstruction of video stories in a scene-dialogue
combined form that utilize the latent embedding and 2) attention. DEMN also
achieved state-of-the-art results on the MovieQA benchmark.Comment: 7 pages, accepted for IJCAI 201
- …