81 research outputs found
A Comprehensive Survey of Automated Audio Captioning
Automated audio captioning, a task that mimics human perception as well as
innovatively links audio processing and natural language processing, has
overseen much progress over the last few years. Audio captioning requires
recognizing the acoustic scene, primary audio events and sometimes the spatial
and temporal relationship between events in an audio clip. It also requires
describing these elements by a fluent and vivid sentence. Deep learning-based
approaches are widely adopted to tackle this problem. This current paper
situates itself as a comprehensive review covering the benchmark datasets,
existing deep learning techniques and the evaluation metrics in automated audio
captioning
Enhance Temporal Relations in Audio Captioning with Sound Event Detection
Automated audio captioning aims at generating natural language descriptions
for given audio clips, not only detecting and classifying sounds, but also
summarizing the relationships between audio events. Recent research advances in
audio captioning have introduced additional guidance to improve the accuracy of
audio events in generated sentences. However, temporal relations between audio
events have received little attention while revealing complex relations is a
key component in summarizing audio content. Therefore, this paper aims to
better capture temporal relationships in caption generation with sound event
detection (SED), a task that locates events' timestamps. We investigate the
best approach to integrate temporal information in a captioning model and
propose a temporal tag system to transform the timestamps into comprehensible
relations. Results evaluated by the proposed temporal metrics suggest that
great improvement is achieved in terms of temporal relation generation
Improving Audio Caption Fluency with Automatic Error Correction
Automated audio captioning (AAC) is an important cross-modality translation
task, aiming at generating descriptions for audio clips. However, captions
generated by previous AAC models have faced ``false-repetition'' errors due to
the training objective. In such scenarios, we propose a new task of AAC error
correction and hope to reduce such errors by post-processing AAC outputs. To
tackle this problem, we use observation-based rules to corrupt captions without
errors, for pseudo grammatically-erroneous sentence generation. One pair of
corrupted and clean sentences can thus be used for training. We train a neural
network-based model on the synthetic error dataset and apply the model to
correct real errors in AAC outputs. Results on two benchmark datasets indicate
that our approach significantly improves fluency while maintaining semantic
information.Comment: Accepted by NCMMSC 202
BLAT: Bootstrapping Language-Audio Pre-training based on AudioSet Tag-guided Synthetic Data
Compared with ample visual-text pre-training research, few works explore
audio-text pre-training, mostly due to the lack of sufficient parallel
audio-text data. Most existing methods incorporate the visual modality as a
pivot for audio-text pre-training, which inevitably induces data noise. In this
paper, we propose BLAT: Bootstrapping Language-Audio pre-training based on
Tag-guided synthetic data. We utilize audio captioning to generate text
directly from audio, without the aid of the visual modality so that potential
noise from modality mismatch is eliminated. Furthermore, we propose caption
generation under the guidance of AudioSet tags, leading to more accurate
captions. With the above two improvements, we curate high-quality, large-scale
parallel audio-text data, based on which we perform audio-text pre-training.
Evaluation on a series of downstream tasks indicates that BLAT achieves SOTA
zero-shot classification performance on most datasets and significant
performance improvement when fine-tuned on downstream tasks, suggesting the
effectiveness of our synthetic data
- …