2,570 research outputs found
Sound Event Detection and Time-Frequency Segmentation from Weakly Labelled Data
Sound event detection (SED) aims to detect when and recognize what sound
events happen in an audio clip. Many supervised SED algorithms rely on strongly
labelled data which contains the onset and offset annotations of sound events.
However, many audio tagging datasets are weakly labelled, that is, only the
presence of the sound events is known, without knowing their onset and offset
annotations. In this paper, we propose a time-frequency (T-F) segmentation
framework trained on weakly labelled data to tackle the sound event detection
and separation problem. In training, a segmentation mapping is applied on a T-F
representation, such as log mel spectrogram of an audio clip to obtain T-F
segmentation masks of sound events. The T-F segmentation masks can be used for
separating the sound events from the background scenes in the time-frequency
domain. Then a classification mapping is applied on the T-F segmentation masks
to estimate the presence probabilities of the sound events. We model the
segmentation mapping using a convolutional neural network and the
classification mapping using a global weighted rank pooling (GWRP). In SED,
predicted onset and offset times can be obtained from the T-F segmentation
masks. As a byproduct, separated waveforms of sound events can be obtained from
the T-F segmentation masks. We remixed the DCASE 2018 Task 1 acoustic scene
data with the DCASE 2018 Task 2 sound events data. When mixing under 0 dB, the
proposed method achieved F1 scores of 0.534, 0.398 and 0.167 in audio tagging,
frame-wise SED and event-wise SED, outperforming the fully connected deep
neural network baseline of 0.331, 0.237 and 0.120, respectively. In T-F
segmentation, we achieved an F1 score of 0.218, where previous methods were not
able to do T-F segmentation.Comment: 12 pages, 8 figure
A joint separation-classification model for sound event detection of weakly labelled data
Source separation (SS) aims to separate individual sources from an audio
recording. Sound event detection (SED) aims to detect sound events from an
audio recording. We propose a joint separation-classification (JSC) model
trained only on weakly labelled audio data, that is, only the tags of an audio
recording are known but the time of the events are unknown. First, we propose a
separation mapping from the time-frequency (T-F) representation of an audio to
the T-F segmentation masks of the audio events. Second, a classification
mapping is built from each T-F segmentation mask to the presence probability of
each audio event. In the source separation stage, sources of audio events and
time of sound events can be obtained from the T-F segmentation masks. The
proposed method achieves an equal error rate (EER) of 0.14 in SED,
outperforming deep neural network baseline of 0.29. Source separation SDR of
8.08 dB is obtained by using global weighted rank pooling (GWRP) as probability
mapping, outperforming the global max pooling (GMP) based probability mapping
giving SDR at 0.03 dB. Source code of our work is published.Comment: Accepted by ICASSP 201
Polyphonic audio tagging with sequentially labelled data using CRNN with learnable gated linear units
Audio tagging aims to detect the types of sound events occurring in an audio
recording. To tag the polyphonic audio recordings, we propose to use
Connectionist Temporal Classification (CTC) loss function on the top of
Convolutional Recurrent Neural Network (CRNN) with learnable Gated Linear Units
(GLU-CTC), based on a new type of audio label data: Sequentially Labelled Data
(SLD). In GLU-CTC, CTC objective function maps the frame-level probability of
labels to clip-level probability of labels. To compare the mapping ability of
GLU-CTC for sound events, we train a CRNN with GLU based on Global Max Pooling
(GLU-GMP) and a CRNN with GLU based on Global Average Pooling (GLU-GAP). And we
also compare the proposed GLU-CTC system with the baseline system, which is a
CRNN trained using CTC loss function without GLU. The experiments show that the
GLU-CTC achieves an Area Under Curve (AUC) score of 0.882 in audio tagging,
outperforming the GLU-GMP of 0.803, GLU-GAP of 0.766 and baseline system of
0.837. That means based on the same CRNN model with GLU, the performance of CTC
mapping is better than the GMP and GAP mapping. Given both based on the CTC
mapping, the CRNN with GLU outperforms the CRNN without GLU.Comment: DCASE2018 Workshop. arXiv admin note: text overlap with
arXiv:1808.0193
Sound Event Detection with Sequentially Labelled Data Based on Connectionist Temporal Classification and Unsupervised Clustering
Sound event detection (SED) methods typically rely on either strongly
labelled data or weakly labelled data. As an alternative, sequentially labelled
data (SLD) was proposed. In SLD, the events and the order of events in audio
clips are known, without knowing the occurrence time of events. This paper
proposes a connectionist temporal classification (CTC) based SED system that
uses SLD instead of strongly labelled data, with a novel unsupervised
clustering stage. Experiments on 41 classes of sound events show that the
proposed two-stage method trained on SLD achieves performance comparable to the
previous state-of-the-art SED system trained on strongly labelled data, and is
far better than another state-of-the-art SED system trained on weakly labelled
data, which indicates the effectiveness of the proposed two-stage method
trained on SLD without any onset/offset time of sound events
Audio Caption: Listen and Tell
Increasing amount of research has shed light on machine perception of audio
events, most of which concerns detection and classification tasks. However,
human-like perception of audio scenes involves not only detecting and
classifying audio sounds, but also summarizing the relationship between
different audio events. Comparable research such as image caption has been
conducted, yet the audio field is still quite barren. This paper introduces a
manually-annotated dataset for audio caption. The purpose is to automatically
generate natural sentences for audio scene description and to bridge the gap
between machine perception of audio and image. The whole dataset is labelled in
Mandarin and we also include translated English annotations. A baseline
encoder-decoder model is provided for both English and Mandarin. Similar BLEU
scores are derived for both languages: our model can generate understandable
and data-related captions based on the dataset.Comment: accepted by ICASSP201
Classification of Animal Sound Using Convolutional Neural Network
Recently, labeling of acoustic events has emerged as an active topic covering a wide range of applications. High-level semantic inference can be conducted based on main audioeffects to facilitate various content-based applications for analysis, efficient recovery and content management. This paper proposes a flexible Convolutional neural network-based framework for animal audio classification. The work takes inspiration from various deep neural network developed for multimedia classification recently. The model is driven by the ideology of identifying the animal sound in the audio file by forcing the network to pay attention to core audio effect present in the audio to generate Mel-spectrogram. The designed framework achieves an accuracy of 98% while classifying the animal audio on weekly labelled datasets. The state-of-the-art in this research is to build a framework which could even run on the basic machine and do not necessarily require high end devices to run the classification
- …