86 research outputs found
Polyphonic audio tagging with sequentially labelled data using CRNN with learnable gated linear units
Audio tagging aims to detect the types of sound events occurring in an audio
recording. To tag the polyphonic audio recordings, we propose to use
Connectionist Temporal Classification (CTC) loss function on the top of
Convolutional Recurrent Neural Network (CRNN) with learnable Gated Linear Units
(GLU-CTC), based on a new type of audio label data: Sequentially Labelled Data
(SLD). In GLU-CTC, CTC objective function maps the frame-level probability of
labels to clip-level probability of labels. To compare the mapping ability of
GLU-CTC for sound events, we train a CRNN with GLU based on Global Max Pooling
(GLU-GMP) and a CRNN with GLU based on Global Average Pooling (GLU-GAP). And we
also compare the proposed GLU-CTC system with the baseline system, which is a
CRNN trained using CTC loss function without GLU. The experiments show that the
GLU-CTC achieves an Area Under Curve (AUC) score of 0.882 in audio tagging,
outperforming the GLU-GMP of 0.803, GLU-GAP of 0.766 and baseline system of
0.837. That means based on the same CRNN model with GLU, the performance of CTC
mapping is better than the GMP and GAP mapping. Given both based on the CTC
mapping, the CRNN with GLU outperforms the CRNN without GLU.Comment: DCASE2018 Workshop. arXiv admin note: text overlap with
arXiv:1808.0193
Surrey-cvssp system for DCASE2017 challenge task4
In this technique report, we present a bunch of methods for the task 4 of
Detection and Classification of Acoustic Scenes and Events 2017 (DCASE2017)
challenge. This task evaluates systems for the large-scale detection of sound
events using weakly labeled training data. The data are YouTube video excerpts
focusing on transportation and warnings due to their industry applications.
There are two tasks, audio tagging and sound event detection from weakly
labeled data. Convolutional neural network (CNN) and gated recurrent unit (GRU)
based recurrent neural network (RNN) are adopted as our basic framework. We
proposed a learnable gating activation function for selecting informative local
features. Attention-based scheme is used for localizing the specific events in
a weakly-supervised mode. A new batch-level balancing strategy is also proposed
to tackle the data unbalancing problem. Fusion of posteriors from different
systems are found effective to improve the performance. In a summary, we get
61% F-value for the audio tagging subtask and 0.73 error rate (ER) for the
sound event detection subtask on the development set. While the official
multilayer perceptron (MLP) based baseline just obtained 13.1% F-value for the
audio tagging and 1.02 for the sound event detection.Comment: DCASE2017 challenge ranked 1st system, task4, tech repor
A joint separation-classification model for sound event detection of weakly labelled data
Source separation (SS) aims to separate individual sources from an audio
recording. Sound event detection (SED) aims to detect sound events from an
audio recording. We propose a joint separation-classification (JSC) model
trained only on weakly labelled audio data, that is, only the tags of an audio
recording are known but the time of the events are unknown. First, we propose a
separation mapping from the time-frequency (T-F) representation of an audio to
the T-F segmentation masks of the audio events. Second, a classification
mapping is built from each T-F segmentation mask to the presence probability of
each audio event. In the source separation stage, sources of audio events and
time of sound events can be obtained from the T-F segmentation masks. The
proposed method achieves an equal error rate (EER) of 0.14 in SED,
outperforming deep neural network baseline of 0.29. Source separation SDR of
8.08 dB is obtained by using global weighted rank pooling (GWRP) as probability
mapping, outperforming the global max pooling (GMP) based probability mapping
giving SDR at 0.03 dB. Source code of our work is published.Comment: Accepted by ICASSP 201
Large-scale weakly supervised audio classification using gated convolutional neural network
In this paper, we present a gated convolutional neural network and a temporal
attention-based localization method for audio classification, which won the 1st
place in the large-scale weakly supervised sound event detection task of
Detection and Classification of Acoustic Scenes and Events (DCASE) 2017
challenge. The audio clips in this task, which are extracted from YouTube
videos, are manually labeled with one or a few audio tags but without
timestamps of the audio events, which is called as weakly labeled data. Two
sub-tasks are defined in this challenge including audio tagging and sound event
detection using this weakly labeled data. A convolutional recurrent neural
network (CRNN) with learnable gated linear units (GLUs) non-linearity applied
on the log Mel spectrogram is proposed. In addition, a temporal attention
method is proposed along the frames to predicate the locations of each audio
event in a chunk from the weakly labeled data. We ranked the 1st and the 2nd as
a team in these two sub-tasks of DCASE 2017 challenge with F value 55.6\% and
Equal error 0.73, respectively.Comment: submitted to ICASSP2018, summary on the 1st place system in DCASE2017
task4 challeng
Audio Set classification with attention model: A probabilistic perspective
This paper investigates the classification of the Audio Set dataset. Audio
Set is a large scale weakly labelled dataset of sound clips. Previous work used
multiple instance learning (MIL) to classify weakly labelled data. In MIL, a
bag consists of several instances, and a bag is labelled positive if at least
one instances in the audio clip is positive. A bag is labelled negative if all
the instances in the bag are negative. We propose an attention model to tackle
the MIL problem and explain this attention model from a novel probabilistic
perspective. We define a probability space on each bag, where each instance in
the bag has a trainable probability measure for each class. Then the
classification of a bag is the expectation of the classification output of the
instances in the bag with respect to the learned probability measure.
Experimental results show that our proposed attention model modeled by fully
connected deep neural network obtains mAP of 0.327 on Audio Set dataset,
outperforming the Google's baseline of 0.314 and recurrent neural network of
0.325.Comment: Accepted by ICASSP 201
Sound Event Detection with Sequentially Labelled Data Based on Connectionist Temporal Classification and Unsupervised Clustering
Sound event detection (SED) methods typically rely on either strongly
labelled data or weakly labelled data. As an alternative, sequentially labelled
data (SLD) was proposed. In SLD, the events and the order of events in audio
clips are known, without knowing the occurrence time of events. This paper
proposes a connectionist temporal classification (CTC) based SED system that
uses SLD instead of strongly labelled data, with a novel unsupervised
clustering stage. Experiments on 41 classes of sound events show that the
proposed two-stage method trained on SLD achieves performance comparable to the
previous state-of-the-art SED system trained on strongly labelled data, and is
far better than another state-of-the-art SED system trained on weakly labelled
data, which indicates the effectiveness of the proposed two-stage method
trained on SLD without any onset/offset time of sound events
Convolutional Gated Recurrent Neural Network Incorporating Spatial Features for Audio Tagging
Environmental audio tagging is a newly proposed task to predict the presence
or absence of a specific audio event in a chunk. Deep neural network (DNN)
based methods have been successfully adopted for predicting the audio tags in
the domestic audio scene. In this paper, we propose to use a convolutional
neural network (CNN) to extract robust features from mel-filter banks (MFBs),
spectrograms or even raw waveforms for audio tagging. Gated recurrent unit
(GRU) based recurrent neural networks (RNNs) are then cascaded to model the
long-term temporal structure of the audio signal. To complement the input
information, an auxiliary CNN is designed to learn on the spatial features of
stereo recordings. We evaluate our proposed methods on Task 4 (audio tagging)
of the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE
2016) challenge. Compared with our recent DNN-based method, the proposed
structure can reduce the equal error rate (EER) from 0.13 to 0.11 on the
development set. The spatial features can further reduce the EER to 0.10. The
performance of the end-to-end learning on raw waveforms is also comparable.
Finally, on the evaluation set, we get the state-of-the-art performance with
0.12 EER while the performance of the best existing system is 0.15 EER.Comment: Accepted to IJCNN2017, Anchorage, Alaska, US
Joint Detection and Classification Convolutional Neural Network on Weakly Labelled Bird Audio Detection
Bird audio detection (BAD) aims to detect whether there is a bird call in an audio recording or not. One difficulty of this task is that the bird sound datasets are weakly labelled, that is only the presence or absence of a bird in a recording is known, without knowing when the birds call. We propose to apply joint detection and classification (JDC) model on the weakly labelled data (WLD) to detect and classify an audio clip at the same time. First, we apply VGG like convolutional neural network (CNN) on mel spectrogram as baseline. Then we propose a JDC-CNN model with VGG as a classifier and CNN as a detector. We report the denoising method including optimally-modified log-spectral amplitude (OM-LSA), median filter and spectral spectrogram will worse the classification accuracy on the contrary to previous work. JDC-CNN can predict the time stamps of the events from weakly labelled data, so is able to do sound event detection from WLD. We obtained area under curve (AUC) of 95.70% on the development data and 81.36% on the unseen evaluation data, which is nearly comparable to the baseline CNN model
DCASE 2018 Challenge Surrey Cross-Task convolutional neural network baseline
The Detection and Classification of Acoustic Scenes and Events (DCASE)
consists of five audio classification and sound event detection tasks: 1)
Acoustic scene classification, 2) General-purpose audio tagging of Freesound,
3) Bird audio detection, 4) Weakly-labeled semi-supervised sound event
detection and 5) Multi-channel audio classification. In this paper, we create a
cross-task baseline system for all five tasks based on a convlutional neural
network (CNN): a "CNN Baseline" system. We implemented CNNs with 4 layers and 8
layers originating from AlexNet and VGG from computer vision. We investigated
how the performance varies from task to task with the same configuration of
neural networks. Experiments show that deeper CNN with 8 layers performs better
than CNN with 4 layers on all tasks except Task 1. Using CNN with 8 layers, we
achieve an accuracy of 0.680 on Task 1, an accuracy of 0.895 and a mean average
precision (MAP) of 0.928 on Task 2, an accuracy of 0.751 and an area under the
curve (AUC) of 0.854 on Task 3, a sound event detection F1 score of 20.8% on
Task 4, and an F1 score of 87.75% on Task 5. We released the Python source code
of the baseline systems under the MIT license for further research.Comment: Accepted by DCASE 2018 Workshop. 4 pages. Source code availabl
- …