556 research outputs found

    Frame-Wise dynamic threshold based polyphonic acoustic event detection

    Get PDF
    Acoustic event detection, the determination of the acoustic event type and the localisation of the event, has been widely applied in many real-world applications. Many works adopt multi-label classification techniques to perform the polyphonic acoustic event detection with a global threshold to detect the active acoustic events. However, the global threshold has to be set manually and is highly dependent on the database being tested. To deal with this, we replaced the fixed threshold method with a frame-wise dynamic threshold approach in this paper. Two novel approaches, namely contour and regressor based dynamic threshold approaches are proposed in this work. Experimental results on the popular TUT Acoustic Scenes 2016 database of polyphonic events demonstrated the superior performance of the proposed approaches

    Polyphonic Sound Event Detection by using Capsule Neural Networks

    Full text link
    Artificial sound event detection (SED) has the aim to mimic the human ability to perceive and understand what is happening in the surroundings. Nowadays, Deep Learning offers valuable techniques for this goal such as Convolutional Neural Networks (CNNs). The Capsule Neural Network (CapsNet) architecture has been recently introduced in the image processing field with the intent to overcome some of the known limitations of CNNs, specifically regarding the scarce robustness to affine transformations (i.e., perspective, size, orientation) and the detection of overlapped images. This motivated the authors to employ CapsNets to deal with the polyphonic-SED task, in which multiple sound events occur simultaneously. Specifically, we propose to exploit the capsule units to represent a set of distinctive properties for each individual sound event. Capsule units are connected through a so-called "dynamic routing" that encourages learning part-whole relationships and improves the detection performance in a polyphonic context. This paper reports extensive evaluations carried out on three publicly available datasets, showing how the CapsNet-based algorithm not only outperforms standard CNNs but also allows to achieve the best results with respect to the state of the art algorithms

    Polyphonic audio tagging with sequentially labelled data using CRNN with learnable gated linear units

    Get PDF
    Audio tagging aims to detect the types of sound events occurring in an audio recording. To tag the polyphonic audio recordings, we propose to use Connectionist Temporal Classification (CTC) loss function on the top of Convolutional Recurrent Neural Network (CRNN) with learnable Gated Linear Units (GLU-CTC), based on a new type of audio label data: Sequentially Labelled Data (SLD). In GLU-CTC, CTC objective function maps the frame-level probability of labels to clip-level probability of labels. To compare the mapping ability of GLU-CTC for sound events, we train a CRNN with GLU based on Global Max Pooling (GLU-GMP) and a CRNN with GLU based on Global Average Pooling (GLU-GAP). And we also compare the proposed GLU-CTC system with the baseline system, which is a CRNN trained using CTC loss function without GLU. The experiments show that the GLU-CTC achieves an Area Under Curve (AUC) score of 0.882 in audio tagging, outperforming the GLU-GMP of 0.803, GLU-GAP of 0.766 and baseline system of 0.837. That means based on the same CRNN model with GLU, the performance of CTC mapping is better than the GMP and GAP mapping. Given both based on the CTC mapping, the CRNN with GLU outperforms the CRNN without GLU.Comment: DCASE2018 Workshop. arXiv admin note: text overlap with arXiv:1808.0193

    Sound Event Detection of Weakly Labelled Data with CNN-Transformer and Automatic Threshold Optimization

    Get PDF
    Sound event detection (SED) is a task to detect sound events in an audio recording. One challenge of the SED task is that many datasets such as the Detection and Classification of Acoustic Scenes and Events (DCASE) datasets are weakly labelled. That is, there are only audio tags for each audio clip without the onset and offset times of sound events. \qk{We compare segment-wise and clip-wise training for SED that is lacking in previous works. We propose a convolutional neural network transformer (CNN-Transfomer) for audio tagging and SED, and show that CNN-Transformer performs similarly to a convolutional recurrent neural network (CRNN)}. Another challenge of SED is that thresholds are required for detecting sound events. Previous works set thresholds empirically, and are not an optimal approaches. To solve this problem, we propose an automatic threshold optimization method. The first stage is to optimize the system with respect to metrics that do not depend on thresholds, such as mean average precision (mAP). The second stage is to optimize the thresholds with respect to metrics that depends on those thresholds. Our proposed automatic threshold optimization system achieves a state-of-the-art audio tagging F1 of 0.646, outperforming that without threshold optimization of 0.629, and a sound event detection F1 of 0.584, outperforming that without threshold optimization of 0.564.Comment: 11 page

    ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€๋ฅผ ์œ„ํ•œ ํšจ์œจ์  ๋ฐ์ดํ„ฐ ํ™œ์šฉ ๋ฐ ์•ฝํ•œ ๊ต์‚ฌํ•™์Šต ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€,2020. 2. ๊น€๋‚จ์ˆ˜.Conventional audio event detection (AED) models are based on supervised approaches. For supervised approaches, strongly labeled data is required. However, collecting large-scale strongly labeled data of audio events is challenging due to the diversity of audio event types and labeling difficulties. In this thesis, we propose data-efficient and weakly supervised techniques for AED. In the first approach, a data-efficient AED system is proposed. In the proposed system, data augmentation is performed to deal with the data sparsity problem and generate polyphonic event examples. An exemplar-based noise reduction algorithm is proposed for feature enhancement. For polyphonic event detection, a multi-labeled deep neural network (DNN) classifier is employed. An adaptive thresholding algorithm is applied as a post-processing method for robust event detection in noisy conditions. From the experimental results, the proposed algorithm has shown promising performance for AED on a low-resource dataset. In the second approach, a convolutional neural network (CNN)-based audio tagging system is proposed. The proposed model consists of a local detector and a global classifier. The local detector detects local audio words that contain distinct characteristics of events, and the global classifier summarizes the information to predict audio events on the recording. From the experimental results, we have found that the proposed model outperforms conventional artificial neural network models. In the final approach, we propose a weakly supervised AED model. The proposed model takes advantage of strengthening feature propagation from DenseNet and modeling channel-wise relationships by SENet. Also, the correlations among segments in audio recordings are represented by a recurrent neural network (RNN) and conditional random field (CRF). RNN utilizes contextual information and CRF post-processing helps to refine segment-level predictions. We evaluate our proposed method and compare its performance with a CNN based baseline approach. From a number of experiments, it has been shown that the proposed method is effective both on audio tagging and weakly supervised AED.์ผ๋ฐ˜์ ์ธ ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€ ์‹œ์Šคํ…œ์€ ๊ต์‚ฌํ•™์Šต์„ ํ†ตํ•ด ํ›ˆ๋ จ๋œ๋‹ค. ๊ต์‚ฌํ•™์Šต์„ ์œ„ํ•ด์„œ๋Š” ๊ฐ•ํ•œ ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ๊ฐ€ ์š”๊ตฌ๋œ๋‹ค. ํ•˜์ง€๋งŒ ๊ฐ•ํ•œ ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ๋Š” ์Œํ–ฅ ์ด๋ฒคํŠธ์˜ ๋‹ค์–‘์„ฑ ๋ฐ ๋ ˆ์ด๋ธ”์˜ ๋‚œ์ด๋„๋กœ ์ธํ•ด ํฐ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋ฅผ ๊ตฌ์ถ•ํ•˜๊ธฐ ์–ด๋ ต๋‹ค๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€๋ฅผ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ํšจ์œจ์  ํ™œ์šฉ ๋ฐ ์•ฝํ•œ ๊ต์‚ฌํ•™์Šต ๊ธฐ๋ฒ•์— ๋Œ€ํ•ด ์ œ์•ˆํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, ๋ฐ์ดํ„ฐ ํšจ์œจ์ ์ธ ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€ ์‹œ์Šคํ…œ์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆ๋œ ์‹œ์Šคํ…œ์—์„œ๋Š” ๋ฐ์ดํ„ฐ ์ฆ๋Œ€ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•ด ๋ฐ์ดํ„ฐ ํฌ์†Œ์„ฑ ๋ฌธ์ œ์— ๋Œ€์‘ํ•˜๊ณ  ์ค‘์ฒฉ ์ด๋ฒคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜์˜€๋‹ค. ํŠน์ง• ๋ฒกํ„ฐ ํ–ฅ์ƒ์„ ์œ„ํ•ด ์žก์Œ ์–ต์ œ ๊ธฐ๋ฒ•์ด ์‚ฌ์šฉ๋˜์—ˆ๊ณ  ์ค‘์ฒฉ ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€๋ฅผ ์œ„ํ•ด ๋‹ค์ค‘ ๋ ˆ์ด๋ธ” ์‹ฌ์ธต ์ธ๊ณต์‹ ๊ฒฝ๋ง(DNN) ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ์‚ฌ์šฉ๋˜์—ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, ์ œ์•ˆ๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋ถˆ์ถฉ๋ถ„ํ•œ ๋ฐ์ดํ„ฐ์—์„œ๋„ ์šฐ์ˆ˜ํ•œ ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋‚ด์—ˆ๋‹ค. ๋‘ ๋ฒˆ์งธ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, ์ปจ๋ณผ๋ฃจ์…˜ ์‹ ๊ฒฝ๋ง(CNN) ๊ธฐ๋ฐ˜ ์˜ค๋””์˜ค ํƒœ๊น… ์‹œ์Šคํ…œ์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆ๋œ ๋ชจ๋ธ์€ ๋กœ์ปฌ ๊ฒ€์ถœ๊ธฐ์™€ ๊ธ€๋กœ๋ฒŒ ๋ถ„๋ฅ˜๊ธฐ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. ๋กœ์ปฌ ๊ฒ€์ถœ๊ธฐ๋Š” ๊ณ ์œ ํ•œ ์Œํ–ฅ ์ด๋ฒคํŠธ ํŠน์„ฑ์„ ํฌํ•จํ•˜๋Š” ๋กœ์ปฌ ์˜ค๋””์˜ค ๋‹จ์–ด๋ฅผ ๊ฐ์ง€ํ•˜๊ณ  ๊ธ€๋กœ๋ฒŒ ๋ถ„๋ฅ˜๊ธฐ๋Š” ํƒ์ง€๋œ ์ •๋ณด๋ฅผ ์š”์•ฝํ•˜์—ฌ ์˜ค๋””์˜ค ์ด๋ฒคํŠธ๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, ์ œ์•ˆ๋œ ๋ชจ๋ธ์ด ๊ธฐ์กด ์ธ๊ณต์‹ ๊ฒฝ๋ง ๊ธฐ๋ฒ•๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋‚ด์—ˆ๋‹ค. ๋งˆ์ง€๋ง‰ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, ์•ฝํ•œ ๊ต์‚ฌํ•™์Šต ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€ ๋ชจ๋ธ์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆ๋œ ๋ชจ๋ธ์€ DenseNet์˜ ๊ตฌ์กฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ •๋ณด์˜ ์›ํ™œํ•œ ํ๋ฆ„์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๊ณ  SENet์„ ํ™œ์šฉํ•ด ์ฑ„๋„๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ชจ๋ธ๋ง ํ•œ๋‹ค. ๋˜ํ•œ, ์˜ค๋””์˜ค ์‹ ํ˜ธ์—์„œ ๋ถ€๋ถ„ ๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„ ์ •๋ณด๋ฅผ ์žฌ์ˆœํ™˜ ์‹ ๊ฒฝ๋ง(RNN) ๋ฐ ์กฐ๊ฑด๋ถ€ ๋ฌด์ž‘์œ„ ํ•„๋“œ(CRF)๋ฅผ ์‚ฌ์šฉํ•ด ํ™œ์šฉํ•˜์˜€๋‹ค. ์—ฌ๋Ÿฌ ์‹คํ—˜์„ ํ†ตํ•ด ์ œ์•ˆ๋œ ๋ชจ๋ธ์ด ๊ธฐ์กด CNN ๊ธฐ๋ฐ˜ ๊ธฐ๋ฒ•๋ณด๋‹ค ์˜ค๋””์˜ค ํƒœ๊น… ๋ฐ ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€ ๋ชจ๋‘์—์„œ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋ƒ„์„ ๋ณด์˜€๋‹ค.1 Introduction 1 2 Audio Event Detection 5 2.1 Data-Ecient Audio Event Detection 6 2.2 Audio Tagging 7 2.3 Weakly Supervised Audio Event Detection 9 2.4 Metrics 10 3 Data-Ecient Techniques for Audio Event Detection 17 3.1 Introduction 17 3.2 DNN-Based AED system 18 3.2.1 Data Augmentation 20 3.2.2 Exemplar-Based Approach for Noise Reduction 21 3.2.3 DNN Classier 22 3.2.4 Post-Processing 23 3.3 Experiments 24 3.4 Summary 27 4 Audio Tagging using Local Detector and Global Classier 29 4.1 Introduction 29 4.2 CNN-Based Audio Tagging Model 31 4.2.1 Local Detector and Global Classier 32 4.2.2 Temporal Localization of Events 34 4.3 Experiments 34 4.3.1 Dataset and Feature 34 4.3.2 Model Training 35 4.3.3 Results 36 4.4 Summary 39 5 Deep Convolutional Neural Network with Structured Prediction for Weakly Supervised Audio Event Detection 41 5.1 Introduction 41 5.2 CNN with Structured Prediction for Weakly Supervised AED 46 5.2.1 DenseNet 47 5.2.2 Squeeze-and-Excitation 48 5.2.3 Global Pooling for Aggregation 49 5.2.4 Structured Prediction for Accurate Event Localization 50 5.3 Experiments 53 5.3.1 Dataset 53 5.3.2 Feature Extraction 54 5.3.3 DSNet and DSNet-RNN Structures 54 5.3.4 Baseline CNN Structure 56 5.3.5 Training and Evaluation 57 5.3.6 Metrics 57 5.3.7 Results and Discussion 58 5.3.8 Comparison with the DCASE 2017 task 4 Results 61 5.4 Summary 62 6 Conclusions 65 Bibliography 67 ์š” ์•ฝ 77 ๊ฐ์‚ฌ์˜ ๊ธ€ 79Docto
    • โ€ฆ
    corecore