336 research outputs found
A joint separation-classification model for sound event detection of weakly labelled data
Source separation (SS) aims to separate individual sources from an audio
recording. Sound event detection (SED) aims to detect sound events from an
audio recording. We propose a joint separation-classification (JSC) model
trained only on weakly labelled audio data, that is, only the tags of an audio
recording are known but the time of the events are unknown. First, we propose a
separation mapping from the time-frequency (T-F) representation of an audio to
the T-F segmentation masks of the audio events. Second, a classification
mapping is built from each T-F segmentation mask to the presence probability of
each audio event. In the source separation stage, sources of audio events and
time of sound events can be obtained from the T-F segmentation masks. The
proposed method achieves an equal error rate (EER) of 0.14 in SED,
outperforming deep neural network baseline of 0.29. Source separation SDR of
8.08 dB is obtained by using global weighted rank pooling (GWRP) as probability
mapping, outperforming the global max pooling (GMP) based probability mapping
giving SDR at 0.03 dB. Source code of our work is published.Comment: Accepted by ICASSP 201
Large-scale weakly supervised audio classification using gated convolutional neural network
In this paper, we present a gated convolutional neural network and a temporal
attention-based localization method for audio classification, which won the 1st
place in the large-scale weakly supervised sound event detection task of
Detection and Classification of Acoustic Scenes and Events (DCASE) 2017
challenge. The audio clips in this task, which are extracted from YouTube
videos, are manually labeled with one or a few audio tags but without
timestamps of the audio events, which is called as weakly labeled data. Two
sub-tasks are defined in this challenge including audio tagging and sound event
detection using this weakly labeled data. A convolutional recurrent neural
network (CRNN) with learnable gated linear units (GLUs) non-linearity applied
on the log Mel spectrogram is proposed. In addition, a temporal attention
method is proposed along the frames to predicate the locations of each audio
event in a chunk from the weakly labeled data. We ranked the 1st and the 2nd as
a team in these two sub-tasks of DCASE 2017 challenge with F value 55.6\% and
Equal error 0.73, respectively.Comment: submitted to ICASSP2018, summary on the 1st place system in DCASE2017
task4 challeng
Audio Set classification with attention model: A probabilistic perspective
This paper investigates the classification of the Audio Set dataset. Audio
Set is a large scale weakly labelled dataset of sound clips. Previous work used
multiple instance learning (MIL) to classify weakly labelled data. In MIL, a
bag consists of several instances, and a bag is labelled positive if at least
one instances in the audio clip is positive. A bag is labelled negative if all
the instances in the bag are negative. We propose an attention model to tackle
the MIL problem and explain this attention model from a novel probabilistic
perspective. We define a probability space on each bag, where each instance in
the bag has a trainable probability measure for each class. Then the
classification of a bag is the expectation of the classification output of the
instances in the bag with respect to the learned probability measure.
Experimental results show that our proposed attention model modeled by fully
connected deep neural network obtains mAP of 0.327 on Audio Set dataset,
outperforming the Google's baseline of 0.314 and recurrent neural network of
0.325.Comment: Accepted by ICASSP 201
Surrey-cvssp system for DCASE2017 challenge task4
In this technique report, we present a bunch of methods for the task 4 of
Detection and Classification of Acoustic Scenes and Events 2017 (DCASE2017)
challenge. This task evaluates systems for the large-scale detection of sound
events using weakly labeled training data. The data are YouTube video excerpts
focusing on transportation and warnings due to their industry applications.
There are two tasks, audio tagging and sound event detection from weakly
labeled data. Convolutional neural network (CNN) and gated recurrent unit (GRU)
based recurrent neural network (RNN) are adopted as our basic framework. We
proposed a learnable gating activation function for selecting informative local
features. Attention-based scheme is used for localizing the specific events in
a weakly-supervised mode. A new batch-level balancing strategy is also proposed
to tackle the data unbalancing problem. Fusion of posteriors from different
systems are found effective to improve the performance. In a summary, we get
61% F-value for the audio tagging subtask and 0.73 error rate (ER) for the
sound event detection subtask on the development set. While the official
multilayer perceptron (MLP) based baseline just obtained 13.1% F-value for the
audio tagging and 1.02 for the sound event detection.Comment: DCASE2017 challenge ranked 1st system, task4, tech repor
Convolutional Gated Recurrent Neural Network Incorporating Spatial Features for Audio Tagging
Environmental audio tagging is a newly proposed task to predict the presence
or absence of a specific audio event in a chunk. Deep neural network (DNN)
based methods have been successfully adopted for predicting the audio tags in
the domestic audio scene. In this paper, we propose to use a convolutional
neural network (CNN) to extract robust features from mel-filter banks (MFBs),
spectrograms or even raw waveforms for audio tagging. Gated recurrent unit
(GRU) based recurrent neural networks (RNNs) are then cascaded to model the
long-term temporal structure of the audio signal. To complement the input
information, an auxiliary CNN is designed to learn on the spatial features of
stereo recordings. We evaluate our proposed methods on Task 4 (audio tagging)
of the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE
2016) challenge. Compared with our recent DNN-based method, the proposed
structure can reduce the equal error rate (EER) from 0.13 to 0.11 on the
development set. The spatial features can further reduce the EER to 0.10. The
performance of the end-to-end learning on raw waveforms is also comparable.
Finally, on the evaluation set, we get the state-of-the-art performance with
0.12 EER while the performance of the best existing system is 0.15 EER.Comment: Accepted to IJCNN2017, Anchorage, Alaska, US
Attention and Localization based on a Deep Convolutional Recurrent Model for Weakly Supervised Audio Tagging
Audio tagging aims to perform multi-label classification on audio chunks and
it is a newly proposed task in the Detection and Classification of Acoustic
Scenes and Events 2016 (DCASE 2016) challenge. This task encourages research
efforts to better analyze and understand the content of the huge amounts of
audio data on the web. The difficulty in audio tagging is that it only has a
chunk-level label without a frame-level label. This paper presents a weakly
supervised method to not only predict the tags but also indicate the temporal
locations of the occurred acoustic events. The attention scheme is found to be
effective in identifying the important frames while ignoring the unrelated
frames. The proposed framework is a deep convolutional recurrent model with two
auxiliary modules: an attention module and a localization module. The proposed
algorithm was evaluated on the Task 4 of DCASE 2016 challenge. State-of-the-art
performance was achieved on the evaluation set with equal error rate (EER)
reduced from 0.13 to 0.11, compared with the convolutional recurrent baseline
system.Comment: 5 pages, submitted to interspeech201
DCASE 2018 Challenge Surrey Cross-Task convolutional neural network baseline
The Detection and Classification of Acoustic Scenes and Events (DCASE)
consists of five audio classification and sound event detection tasks: 1)
Acoustic scene classification, 2) General-purpose audio tagging of Freesound,
3) Bird audio detection, 4) Weakly-labeled semi-supervised sound event
detection and 5) Multi-channel audio classification. In this paper, we create a
cross-task baseline system for all five tasks based on a convlutional neural
network (CNN): a "CNN Baseline" system. We implemented CNNs with 4 layers and 8
layers originating from AlexNet and VGG from computer vision. We investigated
how the performance varies from task to task with the same configuration of
neural networks. Experiments show that deeper CNN with 8 layers performs better
than CNN with 4 layers on all tasks except Task 1. Using CNN with 8 layers, we
achieve an accuracy of 0.680 on Task 1, an accuracy of 0.895 and a mean average
precision (MAP) of 0.928 on Task 2, an accuracy of 0.751 and an area under the
curve (AUC) of 0.854 on Task 3, a sound event detection F1 score of 20.8% on
Task 4, and an F1 score of 87.75% on Task 5. We released the Python source code
of the baseline systems under the MIT license for further research.Comment: Accepted by DCASE 2018 Workshop. 4 pages. Source code availabl
Weakly Labelled AudioSet Tagging with Attention Neural Networks
Audio tagging is the task of predicting the presence or absence of sound
classes within an audio clip. Previous work in audio tagging focused on
relatively small datasets limited to recognising a small number of sound
classes. We investigate audio tagging on AudioSet, which is a dataset
consisting of over 2 million audio clips and 527 classes. AudioSet is weakly
labelled, in that only the presence or absence of sound classes is known for
each clip, while the onset and offset times are unknown. To address the
weakly-labelled audio tagging problem, we propose attention neural networks as
a way to attend the most salient parts of an audio clip. We bridge the
connection between attention neural networks and multiple instance learning
(MIL) methods, and propose decision-level and feature-level attention neural
networks for audio tagging. We investigate attention neural networks modeled by
different functions, depths and widths. Experiments on AudioSet show that the
feature-level attention neural network achieves a state-of-the-art mean average
precision (mAP) of 0.369, outperforming the best multiple instance learning
(MIL) method of 0.317 and Google's deep neural network baseline of 0.314. In
addition, we discover that the audio tagging performance on AudioSet embedding
features has a weak correlation with the number of training samples and the
quality of labels of each sound class.Comment: 13 page
Simultaneous Codeword Optimization (SimCO) for Dictionary Update and Learning
We consider the data-driven dictionary learning problem. The goal is to seek
an over-complete dictionary from which every training signal can be best
approximated by a linear combination of only a few codewords. This task is
often achieved by iteratively executing two operations: sparse coding and
dictionary update. In the literature, there are two benchmark mechanisms to
update a dictionary. The first approach, such as the MOD algorithm, is
characterized by searching for the optimal codewords while fixing the sparse
coefficients. In the second approach, represented by the K-SVD method, one
codeword and the related sparse coefficients are simultaneously updated while
all other codewords and coefficients remain unchanged. We propose a novel
framework that generalizes the aforementioned two methods. The unique feature
of our approach is that one can update an arbitrary set of codewords and the
corresponding sparse coefficients simultaneously: when sparse coefficients are
fixed, the underlying optimization problem is similar to that in the MOD
algorithm; when only one codeword is selected for update, it can be proved that
the proposed algorithm is equivalent to the K-SVD method; and more importantly,
our method allows us to update all codewords and all sparse coefficients
simultaneously, hence the term simultaneous codeword optimization (SimCO).
Under the proposed framework, we design two algorithms, namely, primitive and
regularized SimCO. We implement these two algorithms based on a simple gradient
descent mechanism. Simulations are provided to demonstrate the performance of
the proposed algorithms, as compared with two baseline algorithms MOD and
K-SVD. Results show that regularized SimCO is particularly appealing in terms
of both learning performance and running speed.Comment: 13 page
- …