767 research outputs found

    ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€๋ฅผ ์œ„ํ•œ ํšจ์œจ์  ๋ฐ์ดํ„ฐ ํ™œ์šฉ ๋ฐ ์•ฝํ•œ ๊ต์‚ฌํ•™์Šต ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€,2020. 2. ๊น€๋‚จ์ˆ˜.Conventional audio event detection (AED) models are based on supervised approaches. For supervised approaches, strongly labeled data is required. However, collecting large-scale strongly labeled data of audio events is challenging due to the diversity of audio event types and labeling difficulties. In this thesis, we propose data-efficient and weakly supervised techniques for AED. In the first approach, a data-efficient AED system is proposed. In the proposed system, data augmentation is performed to deal with the data sparsity problem and generate polyphonic event examples. An exemplar-based noise reduction algorithm is proposed for feature enhancement. For polyphonic event detection, a multi-labeled deep neural network (DNN) classifier is employed. An adaptive thresholding algorithm is applied as a post-processing method for robust event detection in noisy conditions. From the experimental results, the proposed algorithm has shown promising performance for AED on a low-resource dataset. In the second approach, a convolutional neural network (CNN)-based audio tagging system is proposed. The proposed model consists of a local detector and a global classifier. The local detector detects local audio words that contain distinct characteristics of events, and the global classifier summarizes the information to predict audio events on the recording. From the experimental results, we have found that the proposed model outperforms conventional artificial neural network models. In the final approach, we propose a weakly supervised AED model. The proposed model takes advantage of strengthening feature propagation from DenseNet and modeling channel-wise relationships by SENet. Also, the correlations among segments in audio recordings are represented by a recurrent neural network (RNN) and conditional random field (CRF). RNN utilizes contextual information and CRF post-processing helps to refine segment-level predictions. We evaluate our proposed method and compare its performance with a CNN based baseline approach. From a number of experiments, it has been shown that the proposed method is effective both on audio tagging and weakly supervised AED.์ผ๋ฐ˜์ ์ธ ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€ ์‹œ์Šคํ…œ์€ ๊ต์‚ฌํ•™์Šต์„ ํ†ตํ•ด ํ›ˆ๋ จ๋œ๋‹ค. ๊ต์‚ฌํ•™์Šต์„ ์œ„ํ•ด์„œ๋Š” ๊ฐ•ํ•œ ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ๊ฐ€ ์š”๊ตฌ๋œ๋‹ค. ํ•˜์ง€๋งŒ ๊ฐ•ํ•œ ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ๋Š” ์Œํ–ฅ ์ด๋ฒคํŠธ์˜ ๋‹ค์–‘์„ฑ ๋ฐ ๋ ˆ์ด๋ธ”์˜ ๋‚œ์ด๋„๋กœ ์ธํ•ด ํฐ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋ฅผ ๊ตฌ์ถ•ํ•˜๊ธฐ ์–ด๋ ต๋‹ค๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€๋ฅผ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ํšจ์œจ์  ํ™œ์šฉ ๋ฐ ์•ฝํ•œ ๊ต์‚ฌํ•™์Šต ๊ธฐ๋ฒ•์— ๋Œ€ํ•ด ์ œ์•ˆํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, ๋ฐ์ดํ„ฐ ํšจ์œจ์ ์ธ ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€ ์‹œ์Šคํ…œ์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆ๋œ ์‹œ์Šคํ…œ์—์„œ๋Š” ๋ฐ์ดํ„ฐ ์ฆ๋Œ€ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•ด ๋ฐ์ดํ„ฐ ํฌ์†Œ์„ฑ ๋ฌธ์ œ์— ๋Œ€์‘ํ•˜๊ณ  ์ค‘์ฒฉ ์ด๋ฒคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜์˜€๋‹ค. ํŠน์ง• ๋ฒกํ„ฐ ํ–ฅ์ƒ์„ ์œ„ํ•ด ์žก์Œ ์–ต์ œ ๊ธฐ๋ฒ•์ด ์‚ฌ์šฉ๋˜์—ˆ๊ณ  ์ค‘์ฒฉ ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€๋ฅผ ์œ„ํ•ด ๋‹ค์ค‘ ๋ ˆ์ด๋ธ” ์‹ฌ์ธต ์ธ๊ณต์‹ ๊ฒฝ๋ง(DNN) ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ์‚ฌ์šฉ๋˜์—ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, ์ œ์•ˆ๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋ถˆ์ถฉ๋ถ„ํ•œ ๋ฐ์ดํ„ฐ์—์„œ๋„ ์šฐ์ˆ˜ํ•œ ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋‚ด์—ˆ๋‹ค. ๋‘ ๋ฒˆ์งธ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, ์ปจ๋ณผ๋ฃจ์…˜ ์‹ ๊ฒฝ๋ง(CNN) ๊ธฐ๋ฐ˜ ์˜ค๋””์˜ค ํƒœ๊น… ์‹œ์Šคํ…œ์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆ๋œ ๋ชจ๋ธ์€ ๋กœ์ปฌ ๊ฒ€์ถœ๊ธฐ์™€ ๊ธ€๋กœ๋ฒŒ ๋ถ„๋ฅ˜๊ธฐ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. ๋กœ์ปฌ ๊ฒ€์ถœ๊ธฐ๋Š” ๊ณ ์œ ํ•œ ์Œํ–ฅ ์ด๋ฒคํŠธ ํŠน์„ฑ์„ ํฌํ•จํ•˜๋Š” ๋กœ์ปฌ ์˜ค๋””์˜ค ๋‹จ์–ด๋ฅผ ๊ฐ์ง€ํ•˜๊ณ  ๊ธ€๋กœ๋ฒŒ ๋ถ„๋ฅ˜๊ธฐ๋Š” ํƒ์ง€๋œ ์ •๋ณด๋ฅผ ์š”์•ฝํ•˜์—ฌ ์˜ค๋””์˜ค ์ด๋ฒคํŠธ๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, ์ œ์•ˆ๋œ ๋ชจ๋ธ์ด ๊ธฐ์กด ์ธ๊ณต์‹ ๊ฒฝ๋ง ๊ธฐ๋ฒ•๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋‚ด์—ˆ๋‹ค. ๋งˆ์ง€๋ง‰ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, ์•ฝํ•œ ๊ต์‚ฌํ•™์Šต ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€ ๋ชจ๋ธ์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆ๋œ ๋ชจ๋ธ์€ DenseNet์˜ ๊ตฌ์กฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ •๋ณด์˜ ์›ํ™œํ•œ ํ๋ฆ„์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๊ณ  SENet์„ ํ™œ์šฉํ•ด ์ฑ„๋„๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ชจ๋ธ๋ง ํ•œ๋‹ค. ๋˜ํ•œ, ์˜ค๋””์˜ค ์‹ ํ˜ธ์—์„œ ๋ถ€๋ถ„ ๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„ ์ •๋ณด๋ฅผ ์žฌ์ˆœํ™˜ ์‹ ๊ฒฝ๋ง(RNN) ๋ฐ ์กฐ๊ฑด๋ถ€ ๋ฌด์ž‘์œ„ ํ•„๋“œ(CRF)๋ฅผ ์‚ฌ์šฉํ•ด ํ™œ์šฉํ•˜์˜€๋‹ค. ์—ฌ๋Ÿฌ ์‹คํ—˜์„ ํ†ตํ•ด ์ œ์•ˆ๋œ ๋ชจ๋ธ์ด ๊ธฐ์กด CNN ๊ธฐ๋ฐ˜ ๊ธฐ๋ฒ•๋ณด๋‹ค ์˜ค๋””์˜ค ํƒœ๊น… ๋ฐ ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€ ๋ชจ๋‘์—์„œ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋ƒ„์„ ๋ณด์˜€๋‹ค.1 Introduction 1 2 Audio Event Detection 5 2.1 Data-Ecient Audio Event Detection 6 2.2 Audio Tagging 7 2.3 Weakly Supervised Audio Event Detection 9 2.4 Metrics 10 3 Data-Ecient Techniques for Audio Event Detection 17 3.1 Introduction 17 3.2 DNN-Based AED system 18 3.2.1 Data Augmentation 20 3.2.2 Exemplar-Based Approach for Noise Reduction 21 3.2.3 DNN Classier 22 3.2.4 Post-Processing 23 3.3 Experiments 24 3.4 Summary 27 4 Audio Tagging using Local Detector and Global Classier 29 4.1 Introduction 29 4.2 CNN-Based Audio Tagging Model 31 4.2.1 Local Detector and Global Classier 32 4.2.2 Temporal Localization of Events 34 4.3 Experiments 34 4.3.1 Dataset and Feature 34 4.3.2 Model Training 35 4.3.3 Results 36 4.4 Summary 39 5 Deep Convolutional Neural Network with Structured Prediction for Weakly Supervised Audio Event Detection 41 5.1 Introduction 41 5.2 CNN with Structured Prediction for Weakly Supervised AED 46 5.2.1 DenseNet 47 5.2.2 Squeeze-and-Excitation 48 5.2.3 Global Pooling for Aggregation 49 5.2.4 Structured Prediction for Accurate Event Localization 50 5.3 Experiments 53 5.3.1 Dataset 53 5.3.2 Feature Extraction 54 5.3.3 DSNet and DSNet-RNN Structures 54 5.3.4 Baseline CNN Structure 56 5.3.5 Training and Evaluation 57 5.3.6 Metrics 57 5.3.7 Results and Discussion 58 5.3.8 Comparison with the DCASE 2017 task 4 Results 61 5.4 Summary 62 6 Conclusions 65 Bibliography 67 ์š” ์•ฝ 77 ๊ฐ์‚ฌ์˜ ๊ธ€ 79Docto

    Deep Learning for Audio Signal Processing

    Full text link
    Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.Comment: 15 pages, 2 pdf figure

    A Closer Look at Weak Label Learning for Audio Events

    Get PDF
    โ€”Audio content analysis in terms of sound events is an important research problem for a variety of applications. Recently, the development of weak labeling approaches for audio or sound event detection (AED) and availability of large scale weakly labeled dataset have finally opened up the possibility of large scale AED. However, a deeper understanding of how weak labels affect the learning for sound events is still missing from literature. In this work, we first describe a CNN based approach for weakly supervised training of audio events. The approach follows some basic design principle desirable in a learning method relying on weakly labeled audio. We then describe important characteristics, which naturally arise in weakly supervised learning of sound events. We show how these aspects of weak labels affect the generalization of models. More specifically, we study how characteristics such as label density and corruption of labels affects weakly supervised training for audio events. We also study the feasibility of directly obtaining weak labeled data from the web without any manual label and compare it with a dataset which has been manually labeled. The analysis and understanding of these factors should be taken into picture in the development of future weak label learning methods. Audioset, a large scale weakly labeled dataset for sound events is used in our experiments
    • โ€ฆ
    corecore