Search CORE

2,925 research outputs found

Video Question Answering: Datasets, Algorithms and Challenges

Author: Chua Tat-Seng
Deng Weihong
Ji Wei
Li Yicong
Xiao Junbin
Zhong Yaoyao
Publication venue
Publication date: 02/11/2022
Field of study

Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. It has earned increasing attention with recent research trends in joint vision and language understanding. Yet, compared with ImageQA, VideoQA is largely underexplored and progresses slowly. Although different algorithms have continually been proposed and shown success on different VideoQA datasets, we find that there lacks a meaningful survey to categorize them, which seriously impedes its advancements. This paper thus provides a clear taxonomy and comprehensive analyses to VideoQA, focusing on the datasets, algorithms, and unique challenges. We then point out the research trend of studying beyond factoid QA to inference QA towards the cognition of video contents, Finally, we conclude some promising directions for future exploration.Comment: Accepted by EMNLP 202

arXiv.org e-Print Archive

MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding

Author: Dong Lu
Huang Yifei
Ling Zhen-Hua
Liu Yi
Qiao Yu
Wang Limin
Wang Yali
Zhang Hongjie
Publication venue
Publication date: 07/12/2023
Field of study

While several long-form VideoQA datasets have been introduced, the length of both videos used to curate questions and sub-clips of clues leveraged to answer those questions have not yet reached the criteria for genuine long-form video understanding. Moreover, their QAs are unduly narrow and modality-biased, lacking a wider view of understanding long-term video content with rich dynamics and complex narratives. To remedy this, we introduce MoVQA, a long-form movie question-answering dataset, and benchmark to assess the diverse cognitive capabilities of multimodal systems rely on multi-level temporal lengths, with considering both video length and clue length. Additionally, to take a step towards human-level understanding in long-form video, versatile and multimodal question-answering is designed from the moviegoer-perspective to assess the model capabilities on various perceptual and cognitive axes.Through analysis involving various baselines reveals a consistent trend: the performance of all methods significantly deteriorate with increasing video and clue length. Meanwhile, our established baseline method has shown some improvements, but there is still ample scope for enhancement on our challenging MoVQA dataset. We expect our MoVQA to provide a new perspective and encourage inspiring works on long-form video understanding research

arXiv.org e-Print Archive