Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

Huang, Qiushi; Kılıç, Volkan; Ko, Tom; Kong, Qiuqiang; Li, Shengchen; Liu, Haohe; Liu, Xubo; Mei, Xinhao; Plumbley, Mark D.; Sun, Jianyuan; Tang, Lilian H.; Wang, Wenwu; Zhang, Yu

Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

Authors: Qiushi Huang
Volkan Kılıç
Tom Ko
Qiuqiang Kong
Shengchen Li
Haohe Liu
Xubo Liu
Xinhao Mei
Mark D. Plumbley
Jianyuan Sun
Lilian H. Tang
Wenwu Wang
Yu Zhang
Publication date: 28 May 2023
Publisher

Abstract

Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects. Specifically, we introduce an off-the-shelf visual encoder to extract video features and incorporate the visual features into an audio captioning system. Furthermore, to better exploit complementary audio-visual contexts, we propose an audio-visual attention mechanism that adaptively integrates audio and visual context and removes the redundant information in the latent space. Experimental results on AudioCaps, the largest audio captioning dataset, show that our proposed method achieves state-of-the-art results on machine translation metrics.Comment: INTERSPEECH 202

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2210.16428

Last time updated on 06/12/2022