799 research outputs found
Automated Audio Captioning with Recurrent Neural Networks
We present the first approach to automated audio captioning. We employ an
encoder-decoder scheme with an alignment model in between. The input to the
encoder is a sequence of log mel-band energies calculated from an audio file,
while the output is a sequence of words, i.e. a caption. The encoder is a
multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a
multi-layered GRU with a classification layer connected to the last GRU of the
decoder. The classification layer and the alignment model are fully connected
layers with shared weights between timesteps. The proposed method is evaluated
using data drawn from a commercial sound effects library, ProSound Effects. The
resulting captions were rated through metrics utilized in machine translation
and image captioning fields. Results from metrics show that the proposed method
can predict words appearing in the original caption, but not always correctly
ordered.Comment: Presented at the 11th IEEE Workshop on Applications of Signal
Processing to Audio and Acoustics (WASPAA), 201
Audio Caption: Listen and Tell
Increasing amount of research has shed light on machine perception of audio
events, most of which concerns detection and classification tasks. However,
human-like perception of audio scenes involves not only detecting and
classifying audio sounds, but also summarizing the relationship between
different audio events. Comparable research such as image caption has been
conducted, yet the audio field is still quite barren. This paper introduces a
manually-annotated dataset for audio caption. The purpose is to automatically
generate natural sentences for audio scene description and to bridge the gap
between machine perception of audio and image. The whole dataset is labelled in
Mandarin and we also include translated English annotations. A baseline
encoder-decoder model is provided for both English and Mandarin. Similar BLEU
scores are derived for both languages: our model can generate understandable
and data-related captions based on the dataset.Comment: accepted by ICASSP201
A Comprehensive Survey of Automated Audio Captioning
Automated audio captioning, a task that mimics human perception as well as
innovatively links audio processing and natural language processing, has
overseen much progress over the last few years. Audio captioning requires
recognizing the acoustic scene, primary audio events and sometimes the spatial
and temporal relationship between events in an audio clip. It also requires
describing these elements by a fluent and vivid sentence. Deep learning-based
approaches are widely adopted to tackle this problem. This current paper
situates itself as a comprehensive review covering the benchmark datasets,
existing deep learning techniques and the evaluation metrics in automated audio
captioning
Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning
Audio captioning is the task of automatically creating a textual description
for the contents of a general audio signal. Typical audio captioning methods
rely on deep neural networks (DNNs), where the target of the DNN is to map the
input audio sequence to an output sequence of words, i.e. the caption. Though,
the length of the textual description is considerably less than the length of
the audio signal, for example 10 words versus some thousands of audio feature
vectors. This clearly indicates that an output word corresponds to multiple
input feature vectors. In this work we present an approach that focuses on
explicitly taking advantage of this difference of lengths between sequences, by
applying a temporal sub-sampling to the audio input sequence. We employ a
sequence-to-sequence method, which uses a fixed-length vector as an output from
the encoder, and we apply temporal sub-sampling between the RNNs of the
encoder. We evaluate the benefit of our approach by employing the freely
available dataset Clotho and we evaluate the impact of different factors of
temporal sub-sampling. Our results show an improvement to all considered
metrics
Graph Attention for Automated Audio Captioning
State-of-the-art audio captioning methods typically use the encoder-decoder
structure with pretrained audio neural networks (PANNs) as encoders for feature
extraction. However, the convolution operation used in PANNs is limited in
capturing the long-time dependencies within an audio signal, thereby leading to
potential performance degradation in audio captioning. This letter presents a
novel method using graph attention (GraphAC) for encoder-decoder based audio
captioning. In the encoder, a graph attention module is introduced after the
PANNs to learn contextual association (i.e. the dependency among the audio
features over different time frames) through an adjacency graph, and a top-k
mask is used to mitigate the interference from noisy nodes. The learnt
contextual association leads to a more effective feature representation with
feature node aggregation. As a result, the decoder can predict important
semantic information about the acoustic scene and events based on the
contextual associations learned from the audio signal. Experimental results
show that GraphAC outperforms the state-of-the-art methods with PANNs as the
encoders, thanks to the incorporation of the graph attention module into the
encoder for capturing the long-time dependencies within the audio signal. The
source code is available at https://github.com/LittleFlyingSheep/GraphAC.Comment: Accepted by IEEE Signal Processing Letter
Enhance Temporal Relations in Audio Captioning with Sound Event Detection
Automated audio captioning aims at generating natural language descriptions
for given audio clips, not only detecting and classifying sounds, but also
summarizing the relationships between audio events. Recent research advances in
audio captioning have introduced additional guidance to improve the accuracy of
audio events in generated sentences. However, temporal relations between audio
events have received little attention while revealing complex relations is a
key component in summarizing audio content. Therefore, this paper aims to
better capture temporal relationships in caption generation with sound event
detection (SED), a task that locates events' timestamps. We investigate the
best approach to integrate temporal information in a captioning model and
propose a temporal tag system to transform the timestamps into comprehensible
relations. Results evaluated by the proposed temporal metrics suggest that
great improvement is achieved in terms of temporal relation generation
- …