CORE
🇺🇦
make metadata, not war
Services
Services overview
Explore all CORE services
Access to raw data
API
Dataset
FastSync
Content discovery
Recommender
Discovery
OAI identifiers
OAI Resolver
Managing content
Dashboard
Bespoke contracts
Consultancy services
Support us
Support us
Membership
Sponsorship
Community governance
Advisory Board
Board of supporters
Research network
About
About us
Our mission
Team
Blog
FAQs
Contact us
Pano-AVQA: Grounded Audio-Visual Question Answering on 360◦ Videos
Authors
Gun Hee Kim
Kangil Lee
+3 more
Wonsuk Yang
Youngjae Yu
Heeseung Yun
Publication date
1 January 2021
Publisher
'Institute of Electrical and Electronics Engineers (IEEE)'
Abstract
© 2021 IEEE360◦ videos convey holistic views for the surroundings of a scene. It provides audio-visual cues beyond predetermined normal field of views and displays distinctive spatial relations on a sphere. However, previous benchmark tasks for panoramic videos are still limited to evaluate the semantic understanding of audio-visual relationships or spherical spatial property in surroundings. We propose a novel benchmark named Pano-AVQA as a large-scale grounded audio-visual question answering dataset on panoramic videos. Using 5.4K 360◦ video clips harvested online, we collect two types of novel question-answer pairs with bounding-box grounding: spherical spatial relation QAs and audio-visual relation QAs. We train several transformer-based models from Pano-AVQA, where the results suggest that our proposed spherical spatial embeddings and multimodal training objectives fairly contribute to a better semantic understanding of the panoramic surroundings on the dataset.N
Similar works
Full text
Available Versions
SNU Open Repository and Archive
See this paper in CORE
Go to the repository landing page
Download from data provider
oai:s-space.snu.ac.kr:10371/18...
Last time updated on 06/07/2022