Zero-shot audio captioning aims at automatically generating descriptive
textual captions for audio content without prior training for this task.
Different from speech recognition which translates audio content that contains
spoken language into text, audio captioning is commonly concerned with ambient
sounds, or sounds produced by a human performing an action. Inspired by
zero-shot image captioning methods, we propose ZerAuCap, a novel framework for
summarising such general audio signals in a text caption without requiring
task-specific training. In particular, our framework exploits a pre-trained
large language model (LLM) for generating the text which is guided by a
pre-trained audio-language model to produce captions that describe the audio
content. Additionally, we use audio context keywords that prompt the language
model to generate text that is broadly relevant to sounds. Our proposed
framework achieves state-of-the-art results in zero-shot audio captioning on
the AudioCaps and Clotho datasets. Our code is available at
https://github.com/ExplainableML/ZerAuCap.Comment: NeurIPS 2023 - Machine Learning for Audio Workshop (Oral