60 research outputs found
DCASE 2018 Challenge Surrey Cross-Task convolutional neural network baseline
The Detection and Classification of Acoustic Scenes and Events (DCASE)
consists of five audio classification and sound event detection tasks: 1)
Acoustic scene classification, 2) General-purpose audio tagging of Freesound,
3) Bird audio detection, 4) Weakly-labeled semi-supervised sound event
detection and 5) Multi-channel audio classification. In this paper, we create a
cross-task baseline system for all five tasks based on a convlutional neural
network (CNN): a "CNN Baseline" system. We implemented CNNs with 4 layers and 8
layers originating from AlexNet and VGG from computer vision. We investigated
how the performance varies from task to task with the same configuration of
neural networks. Experiments show that deeper CNN with 8 layers performs better
than CNN with 4 layers on all tasks except Task 1. Using CNN with 8 layers, we
achieve an accuracy of 0.680 on Task 1, an accuracy of 0.895 and a mean average
precision (MAP) of 0.928 on Task 2, an accuracy of 0.751 and an area under the
curve (AUC) of 0.854 on Task 3, a sound event detection F1 score of 20.8% on
Task 4, and an F1 score of 87.75% on Task 5. We released the Python source code
of the baseline systems under the MIT license for further research.Comment: Accepted by DCASE 2018 Workshop. 4 pages. Source code availabl
Sound Event Detection with Sequentially Labelled Data Based on Connectionist Temporal Classification and Unsupervised Clustering
Sound event detection (SED) methods typically rely on either strongly
labelled data or weakly labelled data. As an alternative, sequentially labelled
data (SLD) was proposed. In SLD, the events and the order of events in audio
clips are known, without knowing the occurrence time of events. This paper
proposes a connectionist temporal classification (CTC) based SED system that
uses SLD instead of strongly labelled data, with a novel unsupervised
clustering stage. Experiments on 41 classes of sound events show that the
proposed two-stage method trained on SLD achieves performance comparable to the
previous state-of-the-art SED system trained on strongly labelled data, and is
far better than another state-of-the-art SED system trained on weakly labelled
data, which indicates the effectiveness of the proposed two-stage method
trained on SLD without any onset/offset time of sound events
Acoustic Scene Classification by Implicitly Identifying Distinct Sound Events
In this paper, we propose a new strategy for acoustic scene classification
(ASC) , namely recognizing acoustic scenes through identifying distinct sound
events. This differs from existing strategies, which focus on characterizing
global acoustical distributions of audio or the temporal evolution of
short-term audio features, without analysis down to the level of sound events.
To identify distinct sound events for each scene, we formulate ASC in a
multi-instance learning (MIL) framework, where each audio recording is mapped
into a bag-of-instances representation. Here, instances can be seen as
high-level representations for sound events inside a scene. We also propose a
MIL neural networks model, which implicitly identifies distinct instances
(i.e., sound events). Furthermore, we propose two specially designed modules
that model the multi-temporal scale and multi-modal natures of the sound events
respectively. The experiments were conducted on the official development set of
the DCASE2018 Task1 Subtask B, and our best-performing model improves over the
official baseline by 9.4% (68.3% vs 58.9%) in terms of classification accuracy.
This study indicates that recognizing acoustic scenes by identifying distinct
sound events is effective and paves the way for future studies that combine
this strategy with previous ones.Comment: code URL typo, code is available at
https://github.com/hackerekcah/distinct-events-asc.gi
Environmental Sound Classification with Parallel Temporal-spectral Attention
Convolutional neural networks (CNN) are one of the best-performing neural
network architectures for environmental sound classification (ESC). Recently,
temporal attention mechanisms have been used in CNN to capture the useful
information from the relevant time frames for audio classification, especially
for weakly labelled data where the onset and offset times of the sound events
are not applied. In these methods, however, the inherent spectral
characteristics and variations are not explicitly exploited when obtaining the
deep features. In this paper, we propose a novel parallel temporal-spectral
attention mechanism for CNN to learn discriminative sound representations,
which enhances the temporal and spectral features by capturing the importance
of different time frames and frequency bands. Parallel branches are constructed
to allow temporal attention and spectral attention to be applied respectively
in order to mitigate interference from the segments without the presence of
sound events. The experiments on three environmental sound classification (ESC)
datasets and two acoustic scene classification (ASC) datasets show that our
method improves the classification performance and also exhibits robustness to
noise.Comment: submitted to INTERSPEECH202
- …