2 research outputs found
Learning Question-Guided Video Representation for Multi-Turn Video Question Answering
Understanding and conversing about dynamic scenes is one of the key
capabilities of AI agents that navigate the environment and convey useful
information to humans. Video question answering is a specific scenario of such
AI-human interaction where an agent generates a natural language response to a
question regarding the video of a dynamic scene. Incorporating features from
multiple modalities, which often provide supplementary information, is one of
the challenging aspects of video question answering. Furthermore, a question
often concerns only a small segment of the video, hence encoding the entire
video sequence using a recurrent neural network is not computationally
efficient. Our proposed question-guided video representation module efficiently
generates the token-level video summary guided by each word in the question.
The learned representations are then fused with the question to generate the
answer. Through empirical evaluation on the Audio Visual Scene-aware Dialog
(AVSD) dataset, our proposed models in single-turn and multi-turn question
answering achieve state-of-the-art performance on several automatic natural
language generation evaluation metrics.Comment: Accepted at SIGDIAL 201
Recent Advances in Video Question Answering: A Review of Datasets and Methods
Video Question Answering (VQA) is a recent emerging challenging task in the
field of Computer Vision. Several visual information retrieval techniques like
Video Captioning/Description and Video-guided Machine Translation have preceded
the task of VQA. VQA helps to retrieve temporal and spatial information from
the video scenes and interpret it. In this survey, we review a number of
methods and datasets for the task of VQA. To the best of our knowledge, no
previous survey has been conducted for the VQA task.Comment: 18 pages, 5 tables, Video and Image Question Answering Workshop, 25th
International Conference on Pattern Recognitio