556 research outputs found
Frame-Wise dynamic threshold based polyphonic acoustic event detection
Acoustic event detection, the determination of the acoustic event type and the localisation of the event, has been widely applied in many real-world applications. Many works adopt multi-label classification techniques to perform the polyphonic acoustic event detection with a global threshold to detect the active acoustic events. However, the global threshold has to be set manually and is highly dependent on the database being tested. To deal with this, we replaced the fixed threshold method with a frame-wise dynamic threshold approach in this paper. Two novel approaches, namely contour and regressor based dynamic threshold approaches are proposed in this work. Experimental results on the popular TUT Acoustic Scenes 2016 database of polyphonic events demonstrated the superior performance of the proposed approaches
Polyphonic Sound Event Detection by using Capsule Neural Networks
Artificial sound event detection (SED) has the aim to mimic the human ability
to perceive and understand what is happening in the surroundings. Nowadays,
Deep Learning offers valuable techniques for this goal such as Convolutional
Neural Networks (CNNs). The Capsule Neural Network (CapsNet) architecture has
been recently introduced in the image processing field with the intent to
overcome some of the known limitations of CNNs, specifically regarding the
scarce robustness to affine transformations (i.e., perspective, size,
orientation) and the detection of overlapped images. This motivated the authors
to employ CapsNets to deal with the polyphonic-SED task, in which multiple
sound events occur simultaneously. Specifically, we propose to exploit the
capsule units to represent a set of distinctive properties for each individual
sound event. Capsule units are connected through a so-called "dynamic routing"
that encourages learning part-whole relationships and improves the detection
performance in a polyphonic context. This paper reports extensive evaluations
carried out on three publicly available datasets, showing how the CapsNet-based
algorithm not only outperforms standard CNNs but also allows to achieve the
best results with respect to the state of the art algorithms
Polyphonic audio tagging with sequentially labelled data using CRNN with learnable gated linear units
Audio tagging aims to detect the types of sound events occurring in an audio
recording. To tag the polyphonic audio recordings, we propose to use
Connectionist Temporal Classification (CTC) loss function on the top of
Convolutional Recurrent Neural Network (CRNN) with learnable Gated Linear Units
(GLU-CTC), based on a new type of audio label data: Sequentially Labelled Data
(SLD). In GLU-CTC, CTC objective function maps the frame-level probability of
labels to clip-level probability of labels. To compare the mapping ability of
GLU-CTC for sound events, we train a CRNN with GLU based on Global Max Pooling
(GLU-GMP) and a CRNN with GLU based on Global Average Pooling (GLU-GAP). And we
also compare the proposed GLU-CTC system with the baseline system, which is a
CRNN trained using CTC loss function without GLU. The experiments show that the
GLU-CTC achieves an Area Under Curve (AUC) score of 0.882 in audio tagging,
outperforming the GLU-GMP of 0.803, GLU-GAP of 0.766 and baseline system of
0.837. That means based on the same CRNN model with GLU, the performance of CTC
mapping is better than the GMP and GAP mapping. Given both based on the CTC
mapping, the CRNN with GLU outperforms the CRNN without GLU.Comment: DCASE2018 Workshop. arXiv admin note: text overlap with
arXiv:1808.0193
Sound Event Detection of Weakly Labelled Data with CNN-Transformer and Automatic Threshold Optimization
Sound event detection (SED) is a task to detect sound events in an audio
recording. One challenge of the SED task is that many datasets such as the
Detection and Classification of Acoustic Scenes and Events (DCASE) datasets are
weakly labelled. That is, there are only audio tags for each audio clip without
the onset and offset times of sound events. \qk{We compare segment-wise and
clip-wise training for SED that is lacking in previous works. We propose a
convolutional neural network transformer (CNN-Transfomer) for audio tagging and
SED, and show that CNN-Transformer performs similarly to a convolutional
recurrent neural network (CRNN)}. Another challenge of SED is that thresholds
are required for detecting sound events. Previous works set thresholds
empirically, and are not an optimal approaches. To solve this problem, we
propose an automatic threshold optimization method. The first stage is to
optimize the system with respect to metrics that do not depend on thresholds,
such as mean average precision (mAP). The second stage is to optimize the
thresholds with respect to metrics that depends on those thresholds. Our
proposed automatic threshold optimization system achieves a state-of-the-art
audio tagging F1 of 0.646, outperforming that without threshold optimization of
0.629, and a sound event detection F1 of 0.584, outperforming that without
threshold optimization of 0.564.Comment: 11 page
์ํฅ ์ด๋ฒคํธ ํ์ง๋ฅผ ์ํ ํจ์จ์ ๋ฐ์ดํฐ ํ์ฉ ๋ฐ ์ฝํ ๊ต์ฌํ์ต ๊ธฐ๋ฒ
ํ์๋
ผ๋ฌธ(๋ฐ์ฌ)--์์ธ๋ํ๊ต ๋ํ์ :๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ปดํจํฐ๊ณตํ๋ถ,2020. 2. ๊น๋จ์.Conventional audio event detection (AED) models are based on supervised approaches. For supervised approaches, strongly labeled data is required. However, collecting large-scale strongly labeled data of audio events is challenging due to the diversity of audio event types and labeling difficulties. In this thesis, we propose data-efficient and weakly supervised techniques for AED.
In the first approach, a data-efficient AED system is proposed. In the proposed system, data augmentation is performed to deal with the data sparsity problem and generate polyphonic event examples. An exemplar-based noise reduction algorithm is proposed for feature enhancement. For polyphonic event detection, a multi-labeled deep neural network (DNN) classifier is employed. An adaptive thresholding algorithm is applied as a post-processing method for robust event detection in noisy conditions. From the experimental results, the proposed algorithm has shown promising performance for AED on a low-resource dataset.
In the second approach, a convolutional neural network (CNN)-based audio tagging system is proposed. The proposed model consists of a local detector and a global classifier. The local detector detects local audio words that contain distinct characteristics of events, and the global classifier summarizes the information to predict audio events on the recording. From the experimental results, we have found that the proposed model outperforms conventional artificial neural network models.
In the final approach, we propose a weakly supervised AED model. The proposed model takes advantage of strengthening feature propagation from DenseNet and modeling channel-wise relationships by SENet. Also, the correlations among segments in audio recordings are represented by a recurrent neural network (RNN) and conditional random field (CRF). RNN utilizes contextual information and CRF post-processing helps to refine segment-level predictions. We evaluate our proposed method and compare its performance with a CNN based baseline approach. From a number of experiments, it has been shown that the proposed method is effective both on audio tagging and weakly supervised AED.์ผ๋ฐ์ ์ธ ์ํฅ ์ด๋ฒคํธ ํ์ง ์์คํ
์ ๊ต์ฌํ์ต์ ํตํด ํ๋ จ๋๋ค. ๊ต์ฌํ์ต์ ์ํด์๋ ๊ฐํ ๋ ์ด๋ธ ๋ฐ์ดํฐ๊ฐ ์๊ตฌ๋๋ค. ํ์ง๋ง ๊ฐํ ๋ ์ด๋ธ ๋ฐ์ดํฐ๋ ์ํฅ ์ด๋ฒคํธ์ ๋ค์์ฑ ๋ฐ ๋ ์ด๋ธ์ ๋์ด๋๋ก ์ธํด ํฐ ๋ฐ์ดํฐ๋ฒ ์ด์ค๋ฅผ ๊ตฌ์ถํ๊ธฐ ์ด๋ ต๋ค๋ ๋ฌธ์ ๊ฐ ์๋ค. ๋ณธ ๋
ผ๋ฌธ์์๋ ์ด๋ฌํ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด ์ํฅ ์ด๋ฒคํธ ํ์ง๋ฅผ ์ํ ๋ฐ์ดํฐ ํจ์จ์ ํ์ฉ ๋ฐ ์ฝํ ๊ต์ฌํ์ต ๊ธฐ๋ฒ์ ๋ํด ์ ์ํ๋ค.
์ฒซ ๋ฒ์งธ ์ ๊ทผ๋ฒ์ผ๋ก์, ๋ฐ์ดํฐ ํจ์จ์ ์ธ ์ํฅ ์ด๋ฒคํธ ํ์ง ์์คํ
์ ์ ์ํ๋ค. ์ ์๋ ์์คํ
์์๋ ๋ฐ์ดํฐ ์ฆ๋ ๊ธฐ๋ฒ์ ์ฌ์ฉํด ๋ฐ์ดํฐ ํฌ์์ฑ ๋ฌธ์ ์ ๋์ํ๊ณ ์ค์ฒฉ ์ด๋ฒคํธ ๋ฐ์ดํฐ๋ฅผ ์์ฑํ์๋ค. ํน์ง ๋ฒกํฐ ํฅ์์ ์ํด ์ก์ ์ต์ ๊ธฐ๋ฒ์ด ์ฌ์ฉ๋์๊ณ ์ค์ฒฉ ์ํฅ ์ด๋ฒคํธ ํ์ง๋ฅผ ์ํด ๋ค์ค ๋ ์ด๋ธ ์ฌ์ธต ์ธ๊ณต์ ๊ฒฝ๋ง(DNN) ๋ถ๋ฅ๊ธฐ๊ฐ ์ฌ์ฉ๋์๋ค. ์คํ ๊ฒฐ๊ณผ, ์ ์๋ ์๊ณ ๋ฆฌ์ฆ์ ๋ถ์ถฉ๋ถํ ๋ฐ์ดํฐ์์๋ ์ฐ์ํ ์ํฅ ์ด๋ฒคํธ ํ์ง ์ฑ๋ฅ์ ๋ํ๋ด์๋ค.
๋ ๋ฒ์งธ ์ ๊ทผ๋ฒ์ผ๋ก์, ์ปจ๋ณผ๋ฃจ์
์ ๊ฒฝ๋ง(CNN) ๊ธฐ๋ฐ ์ค๋์ค ํ๊น
์์คํ
์ ์ ์ํ๋ค. ์ ์๋ ๋ชจ๋ธ์ ๋ก์ปฌ ๊ฒ์ถ๊ธฐ์ ๊ธ๋ก๋ฒ ๋ถ๋ฅ๊ธฐ๋ก ๊ตฌ์ฑ๋๋ค. ๋ก์ปฌ ๊ฒ์ถ๊ธฐ๋ ๊ณ ์ ํ ์ํฅ ์ด๋ฒคํธ ํน์ฑ์ ํฌํจํ๋ ๋ก์ปฌ ์ค๋์ค ๋จ์ด๋ฅผ ๊ฐ์งํ๊ณ ๊ธ๋ก๋ฒ ๋ถ๋ฅ๊ธฐ๋ ํ์ง๋ ์ ๋ณด๋ฅผ ์์ฝํ์ฌ ์ค๋์ค ์ด๋ฒคํธ๋ฅผ ์์ธกํ๋ค. ์คํ ๊ฒฐ๊ณผ, ์ ์๋ ๋ชจ๋ธ์ด ๊ธฐ์กด ์ธ๊ณต์ ๊ฒฝ๋ง ๊ธฐ๋ฒ๋ณด๋ค ์ฐ์ํ ์ฑ๋ฅ์ ๋ํ๋ด์๋ค.
๋ง์ง๋ง ์ ๊ทผ๋ฒ์ผ๋ก์, ์ฝํ ๊ต์ฌํ์ต ์ํฅ ์ด๋ฒคํธ ํ์ง ๋ชจ๋ธ์ ์ ์ํ๋ค. ์ ์๋ ๋ชจ๋ธ์ DenseNet์ ๊ตฌ์กฐ๋ฅผ ํ์ฉํ์ฌ ์ ๋ณด์ ์ํํ ํ๋ฆ์ ๊ฐ๋ฅํ๊ฒ ํ๊ณ SENet์ ํ์ฉํด ์ฑ๋๊ฐ์ ์๊ด๊ด๊ณ๋ฅผ ๋ชจ๋ธ๋ง ํ๋ค. ๋ํ, ์ค๋์ค ์ ํธ์์ ๋ถ๋ถ ๊ฐ์ ์๊ด๊ด๊ณ ์ ๋ณด๋ฅผ ์ฌ์ํ ์ ๊ฒฝ๋ง(RNN) ๋ฐ ์กฐ๊ฑด๋ถ ๋ฌด์์ ํ๋(CRF)๋ฅผ ์ฌ์ฉํด ํ์ฉํ์๋ค. ์ฌ๋ฌ ์คํ์ ํตํด ์ ์๋ ๋ชจ๋ธ์ด ๊ธฐ์กด CNN ๊ธฐ๋ฐ ๊ธฐ๋ฒ๋ณด๋ค ์ค๋์ค ํ๊น
๋ฐ ์ํฅ ์ด๋ฒคํธ ํ์ง ๋ชจ๋์์ ๋ ๋์ ์ฑ๋ฅ์ ๋ํ๋์ ๋ณด์๋ค.1 Introduction 1
2 Audio Event Detection 5
2.1 Data-Ecient Audio Event Detection 6
2.2 Audio Tagging 7
2.3 Weakly Supervised Audio Event Detection 9
2.4 Metrics 10
3 Data-Ecient Techniques for Audio Event Detection 17
3.1 Introduction 17
3.2 DNN-Based AED system 18
3.2.1 Data Augmentation 20
3.2.2 Exemplar-Based Approach for Noise Reduction 21
3.2.3 DNN Classier 22
3.2.4 Post-Processing 23
3.3 Experiments 24
3.4 Summary 27
4 Audio Tagging using Local Detector and Global Classier 29
4.1 Introduction 29
4.2 CNN-Based Audio Tagging Model 31
4.2.1 Local Detector and Global Classier 32
4.2.2 Temporal Localization of Events 34
4.3 Experiments 34
4.3.1 Dataset and Feature 34
4.3.2 Model Training 35
4.3.3 Results 36
4.4 Summary 39
5 Deep Convolutional Neural Network with Structured Prediction for Weakly Supervised Audio Event Detection 41
5.1 Introduction 41
5.2 CNN with Structured Prediction for Weakly Supervised AED 46
5.2.1 DenseNet 47
5.2.2 Squeeze-and-Excitation 48
5.2.3 Global Pooling for Aggregation 49
5.2.4 Structured Prediction for Accurate Event Localization 50
5.3 Experiments 53
5.3.1 Dataset 53
5.3.2 Feature Extraction 54
5.3.3 DSNet and DSNet-RNN Structures 54
5.3.4 Baseline CNN Structure 56
5.3.5 Training and Evaluation 57
5.3.6 Metrics 57
5.3.7 Results and Discussion 58
5.3.8 Comparison with the DCASE 2017 task 4 Results 61
5.4 Summary 62
6 Conclusions 65
Bibliography 67
์ ์ฝ 77
๊ฐ์ฌ์ ๊ธ 79Docto
- โฆ