Search CORE

4,430 research outputs found

Video Question Answering: Datasets, Algorithms and Challenges

Author: Chua Tat-Seng
Deng Weihong
Ji Wei
Li Yicong
Xiao Junbin
Zhong Yaoyao
Publication venue
Publication date: 02/11/2022
Field of study

Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. It has earned increasing attention with recent research trends in joint vision and language understanding. Yet, compared with ImageQA, VideoQA is largely underexplored and progresses slowly. Although different algorithms have continually been proposed and shown success on different VideoQA datasets, we find that there lacks a meaningful survey to categorize them, which seriously impedes its advancements. This paper thus provides a clear taxonomy and comprehensive analyses to VideoQA, focusing on the datasets, algorithms, and unique challenges. We then point out the research trend of studying beyond factoid QA to inference QA towards the cognition of video contents, Finally, we conclude some promising directions for future exploration.Comment: Accepted by EMNLP 202

arXiv.org e-Print Archive

Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering

Author: Cheng Yi
Fan Hehe
Kankanhalli Mohan
Lim Joo-Hwee
Lin Dongyun
Sun Ying
Publication venue
Publication date: 25/07/2023
Field of study

The main challenge in video question answering (VideoQA) is to capture and understand the complex spatial and temporal relations between objects based on given questions. Existing graph-based methods for VideoQA usually ignore keywords in questions and employ a simple graph to aggregate features without considering relative relations between objects, which may lead to inferior performance. In this paper, we propose a Keyword-aware Relative Spatio-Temporal (KRST) graph network for VideoQA. First, to make question features aware of keywords, we employ an attention mechanism to assign high weights to keywords during question encoding. The keyword-aware question features are then used to guide video graph construction. Second, because relations are relative, we integrate the relative relation modeling to better capture the spatio-temporal dynamics among object nodes. Moreover, we disentangle the spatio-temporal reasoning into an object-level spatial graph and a frame-level temporal graph, which reduces the impact of spatial and temporal relation reasoning on each other. Extensive experiments on the TGIF-QA, MSVD-QA and MSRVTT-QA datasets demonstrate the superiority of our KRST over multiple state-of-the-art methods.Comment: under revie

arXiv.org e-Print Archive