15 research outputs found
A joint separation-classification model for sound event detection of weakly labelled data
Source separation (SS) aims to separate individual sources from an audio
recording. Sound event detection (SED) aims to detect sound events from an
audio recording. We propose a joint separation-classification (JSC) model
trained only on weakly labelled audio data, that is, only the tags of an audio
recording are known but the time of the events are unknown. First, we propose a
separation mapping from the time-frequency (T-F) representation of an audio to
the T-F segmentation masks of the audio events. Second, a classification
mapping is built from each T-F segmentation mask to the presence probability of
each audio event. In the source separation stage, sources of audio events and
time of sound events can be obtained from the T-F segmentation masks. The
proposed method achieves an equal error rate (EER) of 0.14 in SED,
outperforming deep neural network baseline of 0.29. Source separation SDR of
8.08 dB is obtained by using global weighted rank pooling (GWRP) as probability
mapping, outperforming the global max pooling (GMP) based probability mapping
giving SDR at 0.03 dB. Source code of our work is published.Comment: Accepted by ICASSP 201
Audio Set classification with attention model: A probabilistic perspective
This paper investigates the classification of the Audio Set dataset. Audio
Set is a large scale weakly labelled dataset of sound clips. Previous work used
multiple instance learning (MIL) to classify weakly labelled data. In MIL, a
bag consists of several instances, and a bag is labelled positive if at least
one instances in the audio clip is positive. A bag is labelled negative if all
the instances in the bag are negative. We propose an attention model to tackle
the MIL problem and explain this attention model from a novel probabilistic
perspective. We define a probability space on each bag, where each instance in
the bag has a trainable probability measure for each class. Then the
classification of a bag is the expectation of the classification output of the
instances in the bag with respect to the learned probability measure.
Experimental results show that our proposed attention model modeled by fully
connected deep neural network obtains mAP of 0.327 on Audio Set dataset,
outperforming the Google's baseline of 0.314 and recurrent neural network of
0.325.Comment: Accepted by ICASSP 201
Large-scale weakly supervised audio classification using gated convolutional neural network
In this paper, we present a gated convolutional neural network and a temporal
attention-based localization method for audio classification, which won the 1st
place in the large-scale weakly supervised sound event detection task of
Detection and Classification of Acoustic Scenes and Events (DCASE) 2017
challenge. The audio clips in this task, which are extracted from YouTube
videos, are manually labeled with one or a few audio tags but without
timestamps of the audio events, which is called as weakly labeled data. Two
sub-tasks are defined in this challenge including audio tagging and sound event
detection using this weakly labeled data. A convolutional recurrent neural
network (CRNN) with learnable gated linear units (GLUs) non-linearity applied
on the log Mel spectrogram is proposed. In addition, a temporal attention
method is proposed along the frames to predicate the locations of each audio
event in a chunk from the weakly labeled data. We ranked the 1st and the 2nd as
a team in these two sub-tasks of DCASE 2017 challenge with F value 55.6\% and
Equal error 0.73, respectively.Comment: submitted to ICASSP2018, summary on the 1st place system in DCASE2017
task4 challeng
Sound Event Detection with Sequentially Labelled Data Based on Connectionist Temporal Classification and Unsupervised Clustering
Sound event detection (SED) methods typically rely on either strongly
labelled data or weakly labelled data. As an alternative, sequentially labelled
data (SLD) was proposed. In SLD, the events and the order of events in audio
clips are known, without knowing the occurrence time of events. This paper
proposes a connectionist temporal classification (CTC) based SED system that
uses SLD instead of strongly labelled data, with a novel unsupervised
clustering stage. Experiments on 41 classes of sound events show that the
proposed two-stage method trained on SLD achieves performance comparable to the
previous state-of-the-art SED system trained on strongly labelled data, and is
far better than another state-of-the-art SED system trained on weakly labelled
data, which indicates the effectiveness of the proposed two-stage method
trained on SLD without any onset/offset time of sound events
Weakly-Supervised Temporal Localization via Occurrence Count Learning
We propose a novel model for temporal detection and localization which allows
the training of deep neural networks using only counts of event occurrences as
training labels. This powerful weakly-supervised framework alleviates the
burden of the imprecise and time-consuming process of annotating event
locations in temporal data. Unlike existing methods, in which localization is
explicitly achieved by design, our model learns localization implicitly as a
byproduct of learning to count instances. This unique feature is a direct
consequence of the model's theoretical properties. We validate the
effectiveness of our approach in a number of experiments (drum hit and piano
onset detection in audio, digit detection in images) and demonstrate
performance comparable to that of fully-supervised state-of-the-art methods,
despite much weaker training requirements.Comment: Accepted at ICML 201
Weakly Labelled AudioSet Tagging with Attention Neural Networks
Audio tagging is the task of predicting the presence or absence of sound
classes within an audio clip. Previous work in audio tagging focused on
relatively small datasets limited to recognising a small number of sound
classes. We investigate audio tagging on AudioSet, which is a dataset
consisting of over 2 million audio clips and 527 classes. AudioSet is weakly
labelled, in that only the presence or absence of sound classes is known for
each clip, while the onset and offset times are unknown. To address the
weakly-labelled audio tagging problem, we propose attention neural networks as
a way to attend the most salient parts of an audio clip. We bridge the
connection between attention neural networks and multiple instance learning
(MIL) methods, and propose decision-level and feature-level attention neural
networks for audio tagging. We investigate attention neural networks modeled by
different functions, depths and widths. Experiments on AudioSet show that the
feature-level attention neural network achieves a state-of-the-art mean average
precision (mAP) of 0.369, outperforming the best multiple instance learning
(MIL) method of 0.317 and Google's deep neural network baseline of 0.314. In
addition, we discover that the audio tagging performance on AudioSet embedding
features has a weak correlation with the number of training samples and the
quality of labels of each sound class.Comment: 13 page