The audio-visual sound separation field assumes visible sources in videos,
but this excludes invisible sounds beyond the camera's view. Current methods
struggle with such sounds lacking visible cues. This paper introduces a novel
"Audio-Visual Scene-Aware Separation" (AVSA-Sep) framework. It includes a
semantic parser for visible and invisible sounds and a separator for
scene-informed separation. AVSA-Sep successfully separates both sound types,
with joint training and cross-modal alignment enhancing effectiveness.Comment: Accepted at ICCV 2023 - AV4D, 4 figures, 3 table