While direction of arrival (DOA) of sound events is generally estimated from
multichannel audio data recorded in a microphone array, sound events usually
derive from visually perceptible source objects, e.g., sounds of footsteps come
from the feet of a walker. This paper proposes an audio-visual sound event
localization and detection (SELD) task, which uses multichannel audio and video
information to estimate the temporal activation and DOA of target sound events.
Audio-visual SELD systems can detect and localize sound events using signals
from a microphone array and audio-visual correspondence. We also introduce an
audio-visual dataset, Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23),
which consists of multichannel audio data recorded with a microphone array,
video data, and spatiotemporal annotation of sound events. Sound scenes in
STARSS23 are recorded with instructions, which guide recording participants to
ensure adequate activity and occurrences of sound events. STARSS23 also serves
human-annotated temporal activation labels and human-confirmed DOA labels,
which are based on tracking results of a motion capture system. Our benchmark
results demonstrate the benefits of using visual object positions in
audio-visual SELD tasks. The data is available at
https://zenodo.org/record/7880637.Comment: 27 pages, 9 figures, accepted for publication in NeurIPS 2023 Track
on Datasets and Benchmark