2,722 research outputs found
CNN Architectures for Large-Scale Audio Classification
Convolutional Neural Networks (CNNs) have proven very effective in image
classification and show promise for audio. We use various CNN architectures to
classify the soundtracks of a dataset of 70M training videos (5.24 million
hours) with 30,871 video-level labels. We examine fully connected Deep Neural
Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We
investigate varying the size of both training set and label vocabulary, finding
that analogs of the CNNs used in image classification do well on our audio
classification task, and larger training and label sets help up to a point. A
model using embeddings from these classifiers does much better than raw
features on the Audio Set [5] Acoustic Event Detection (AED) classification
task.Comment: Accepted for publication at ICASSP 2017 Changes: Added definitions of
mAP, AUC, and d-prime. Updated mAP/AUC/d-prime numbers for Audio Set based on
changes of latest Audio Set revision. Changed wording to fit 4 page limit
with new addition
Weakly Labelled AudioSet Tagging with Attention Neural Networks
Audio tagging is the task of predicting the presence or absence of sound
classes within an audio clip. Previous work in audio tagging focused on
relatively small datasets limited to recognising a small number of sound
classes. We investigate audio tagging on AudioSet, which is a dataset
consisting of over 2 million audio clips and 527 classes. AudioSet is weakly
labelled, in that only the presence or absence of sound classes is known for
each clip, while the onset and offset times are unknown. To address the
weakly-labelled audio tagging problem, we propose attention neural networks as
a way to attend the most salient parts of an audio clip. We bridge the
connection between attention neural networks and multiple instance learning
(MIL) methods, and propose decision-level and feature-level attention neural
networks for audio tagging. We investigate attention neural networks modeled by
different functions, depths and widths. Experiments on AudioSet show that the
feature-level attention neural network achieves a state-of-the-art mean average
precision (mAP) of 0.369, outperforming the best multiple instance learning
(MIL) method of 0.317 and Google's deep neural network baseline of 0.314. In
addition, we discover that the audio tagging performance on AudioSet embedding
features has a weak correlation with the number of training samples and the
quality of labels of each sound class.Comment: 13 page
Multilingual Adaptation of RNN Based ASR Systems
In this work, we focus on multilingual systems based on recurrent neural
networks (RNNs), trained using the Connectionist Temporal Classification (CTC)
loss function. Using a multilingual set of acoustic units poses difficulties.
To address this issue, we proposed Language Feature Vectors (LFVs) to train
language adaptive multilingual systems. Language adaptation, in contrast to
speaker adaptation, needs to be applied not only on the feature level, but also
to deeper layers of the network. In this work, we therefore extended our
previous approach by introducing a novel technique which we call "modulation".
Based on this method, we modulated the hidden layers of RNNs using LFVs. We
evaluated this approach in both full and low resource conditions, as well as
for grapheme and phone based systems. Lower error rates throughout the
different conditions could be achieved by the use of the modulation.Comment: 5 pages, 1 figure, to appear in 2018 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP 2018
Sound Event Detection with Sequentially Labelled Data Based on Connectionist Temporal Classification and Unsupervised Clustering
Sound event detection (SED) methods typically rely on either strongly
labelled data or weakly labelled data. As an alternative, sequentially labelled
data (SLD) was proposed. In SLD, the events and the order of events in audio
clips are known, without knowing the occurrence time of events. This paper
proposes a connectionist temporal classification (CTC) based SED system that
uses SLD instead of strongly labelled data, with a novel unsupervised
clustering stage. Experiments on 41 classes of sound events show that the
proposed two-stage method trained on SLD achieves performance comparable to the
previous state-of-the-art SED system trained on strongly labelled data, and is
far better than another state-of-the-art SED system trained on weakly labelled
data, which indicates the effectiveness of the proposed two-stage method
trained on SLD without any onset/offset time of sound events
Reducing Model Complexity for DNN Based Large-Scale Audio Classification
Audio classification is the task of identifying the sound categories that are
associated with a given audio signal. This paper presents an investigation on
large-scale audio classification based on the recently released AudioSet
database. AudioSet comprises 2 millions of audio samples from YouTube, which
are human-annotated with 527 sound category labels. Audio classification
experiments with the balanced training set and the evaluation set of AudioSet
are carried out by applying different types of neural network models. The
classification performance and the model complexity of these models are
compared and analyzed. While the CNN models show better performance than MLP
and RNN, its model complexity is relatively high and undesirable for practical
use. We propose two different strategies that aim at constructing
low-dimensional embedding feature extractors and hence reducing the number of
model parameters. It is shown that the simplified CNN model has only 1/22 model
parameters of the original model, with only a slight degradation of
performance.Comment: Accepted by ICASSP 201
Learning Audio Sequence Representations for Acoustic Event Classification
Acoustic Event Classification (AEC) has become a significant task for
machines to perceive the surrounding auditory scene. However, extracting
effective representations that capture the underlying characteristics of the
acoustic events is still challenging. Previous methods mainly focused on
designing the audio features in a 'hand-crafted' manner. Interestingly,
data-learnt features have been recently reported to show better performance. Up
to now, these were only considered on the frame-level. In this paper, we
propose an unsupervised learning framework to learn a vector representation of
an audio sequence for AEC. This framework consists of a Recurrent Neural
Network (RNN) encoder and a RNN decoder, which respectively transforms the
variable-length audio sequence into a fixed-length vector and reconstructs the
input sequence on the generated vector. After training the encoder-decoder, we
feed the audio sequences to the encoder and then take the learnt vectors as the
audio sequence representations. Compared with previous methods, the proposed
method can not only deal with the problem of arbitrary-lengths of audio
streams, but also learn the salient information of the sequence. Extensive
evaluation on a large-size acoustic event database is performed, and the
empirical results demonstrate that the learnt audio sequence representation
yields a significant performance improvement by a large margin compared with
other state-of-the-art hand-crafted sequence features for AEC
- …