Pano-AVQA: Grounded Audio-Visual Question Answering on 360◦ Videos

Kim, Gun Hee; Lee, Kangil; Yang, Wonsuk; Yu, Youngjae; Yun, Heeseung

Pano-AVQA: Grounded Audio-Visual Question Answering on 360◦ Videos

Authors: Gun Hee Kim
Kangil Lee
Wonsuk Yang
Youngjae Yu
Heeseung Yun
Publication date: 1 January 2021
Publisher: 'Institute of Electrical and Electronics Engineers (IEEE)'

Abstract

© 2021 IEEE360◦ videos convey holistic views for the surroundings of a scene. It provides audio-visual cues beyond predetermined normal field of views and displays distinctive spatial relations on a sphere. However, previous benchmark tasks for panoramic videos are still limited to evaluate the semantic understanding of audio-visual relationships or spherical spatial property in surroundings. We propose a novel benchmark named Pano-AVQA as a large-scale grounded audio-visual question answering dataset on panoramic videos. Using 5.4K 360◦ video clips harvested online, we collect two types of novel question-answer pairs with bounding-box grounding: spherical spatial relation QAs and audio-visual relation QAs. We train several transformer-based models from Pano-AVQA, where the results suggest that our proposed spherical spatial embeddings and multimodal training objectives fairly contribute to a better semantic understanding of the panoramic surroundings on the dataset.N

Similar works

Full text

Available Versions

SNU Open Repository and Archive

oai:s-space.snu.ac.kr:10371/18...

Last time updated on 06/07/2022