4,430 research outputs found
Video Question Answering: Datasets, Algorithms and Challenges
Video Question Answering (VideoQA) aims to answer natural language questions
according to the given videos. It has earned increasing attention with recent
research trends in joint vision and language understanding. Yet, compared with
ImageQA, VideoQA is largely underexplored and progresses slowly. Although
different algorithms have continually been proposed and shown success on
different VideoQA datasets, we find that there lacks a meaningful survey to
categorize them, which seriously impedes its advancements. This paper thus
provides a clear taxonomy and comprehensive analyses to VideoQA, focusing on
the datasets, algorithms, and unique challenges. We then point out the research
trend of studying beyond factoid QA to inference QA towards the cognition of
video contents, Finally, we conclude some promising directions for future
exploration.Comment: Accepted by EMNLP 202
Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering
The main challenge in video question answering (VideoQA) is to capture and
understand the complex spatial and temporal relations between objects based on
given questions. Existing graph-based methods for VideoQA usually ignore
keywords in questions and employ a simple graph to aggregate features without
considering relative relations between objects, which may lead to inferior
performance. In this paper, we propose a Keyword-aware Relative Spatio-Temporal
(KRST) graph network for VideoQA. First, to make question features aware of
keywords, we employ an attention mechanism to assign high weights to keywords
during question encoding. The keyword-aware question features are then used to
guide video graph construction. Second, because relations are relative, we
integrate the relative relation modeling to better capture the spatio-temporal
dynamics among object nodes. Moreover, we disentangle the spatio-temporal
reasoning into an object-level spatial graph and a frame-level temporal graph,
which reduces the impact of spatial and temporal relation reasoning on each
other. Extensive experiments on the TGIF-QA, MSVD-QA and MSRVTT-QA datasets
demonstrate the superiority of our KRST over multiple state-of-the-art methods.Comment: under revie
- …