728 research outputs found

    Sample Mixed-Based Data Augmentation for Domestic Audio Tagging

    Get PDF
    Audio tagging has attracted increasing attention since last decade and has various potential applications in many fields. The objective of audio tagging is to predict the labels of an audio clip. Recently deep learning methods have been applied to audio tagging and have achieved state-of-the-art performance, which provides a poor generalization ability on new data. However due to the limited size of audio tagging data such as DCASE data, the trained models tend to result in overfitting of the network. Previous data augmentation methods such as pitch shifting, time stretching and adding background noise do not show much improvement in audio tagging. In this paper, we explore the sample mixed data augmentation for the domestic audio tagging task, including mixup, SamplePairing and extrapolation. We apply a convolutional recurrent neural network (CRNN) with attention module with log-scaled mel spectrum as a baseline system. In our experiments, we achieve an state-of-the-art of equal error rate (EER) of 0.10 on DCASE 2016 task4 dataset with mixup approach, outperforming the baseline system without data augmentation.Comment: submitted to the workshop of Detection and Classification of Acoustic Scenes and Events 2018 (DCASE 2018), 19-20 November 2018, Surrey, U

    DCASE 2019 Task 2: Multitask Learning, Semi-supervised Learning and Model Ensemble with Noisy Data for Audio Tagging

    Get PDF
    This paper describes our approach to the DCASE 2019 challenge Task 2: Audio tagging with noisy labels and minimal supervision. This task is a multi-label audio classification with 80 classes. The training data is composed of a small amount of reliably labeled data (curated data) and a larger amount of data with unreliable labels (noisy data). Additionally, there is a difference in data distribution between curated data and noisy data. To tackle this difficulty, we propose three strategies. The first is multitask learning using noisy data. The second is semi-supervised learning using noisy data and labels that are relabeled using trained modelsโ€™ predictions. The third is an ensemble method that averages models trained with different time length. By using these methods, our solution was ranked in 3rd place on the public leaderboard (LB) with a label-weighted label-ranking average precision (lwlrap) score of 0.750 and ranked in 4th place on the private LB with a lwlrap score of 0.75787. The code of our solution is available at https://github.com/OsciiArt/Freesound-Audio-Tagging-2019.252

    ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€๋ฅผ ์œ„ํ•œ ํšจ์œจ์  ๋ฐ์ดํ„ฐ ํ™œ์šฉ ๋ฐ ์•ฝํ•œ ๊ต์‚ฌํ•™์Šต ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€,2020. 2. ๊น€๋‚จ์ˆ˜.Conventional audio event detection (AED) models are based on supervised approaches. For supervised approaches, strongly labeled data is required. However, collecting large-scale strongly labeled data of audio events is challenging due to the diversity of audio event types and labeling difficulties. In this thesis, we propose data-efficient and weakly supervised techniques for AED. In the first approach, a data-efficient AED system is proposed. In the proposed system, data augmentation is performed to deal with the data sparsity problem and generate polyphonic event examples. An exemplar-based noise reduction algorithm is proposed for feature enhancement. For polyphonic event detection, a multi-labeled deep neural network (DNN) classifier is employed. An adaptive thresholding algorithm is applied as a post-processing method for robust event detection in noisy conditions. From the experimental results, the proposed algorithm has shown promising performance for AED on a low-resource dataset. In the second approach, a convolutional neural network (CNN)-based audio tagging system is proposed. The proposed model consists of a local detector and a global classifier. The local detector detects local audio words that contain distinct characteristics of events, and the global classifier summarizes the information to predict audio events on the recording. From the experimental results, we have found that the proposed model outperforms conventional artificial neural network models. In the final approach, we propose a weakly supervised AED model. The proposed model takes advantage of strengthening feature propagation from DenseNet and modeling channel-wise relationships by SENet. Also, the correlations among segments in audio recordings are represented by a recurrent neural network (RNN) and conditional random field (CRF). RNN utilizes contextual information and CRF post-processing helps to refine segment-level predictions. We evaluate our proposed method and compare its performance with a CNN based baseline approach. From a number of experiments, it has been shown that the proposed method is effective both on audio tagging and weakly supervised AED.์ผ๋ฐ˜์ ์ธ ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€ ์‹œ์Šคํ…œ์€ ๊ต์‚ฌํ•™์Šต์„ ํ†ตํ•ด ํ›ˆ๋ จ๋œ๋‹ค. ๊ต์‚ฌํ•™์Šต์„ ์œ„ํ•ด์„œ๋Š” ๊ฐ•ํ•œ ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ๊ฐ€ ์š”๊ตฌ๋œ๋‹ค. ํ•˜์ง€๋งŒ ๊ฐ•ํ•œ ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ๋Š” ์Œํ–ฅ ์ด๋ฒคํŠธ์˜ ๋‹ค์–‘์„ฑ ๋ฐ ๋ ˆ์ด๋ธ”์˜ ๋‚œ์ด๋„๋กœ ์ธํ•ด ํฐ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋ฅผ ๊ตฌ์ถ•ํ•˜๊ธฐ ์–ด๋ ต๋‹ค๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€๋ฅผ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ํšจ์œจ์  ํ™œ์šฉ ๋ฐ ์•ฝํ•œ ๊ต์‚ฌํ•™์Šต ๊ธฐ๋ฒ•์— ๋Œ€ํ•ด ์ œ์•ˆํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, ๋ฐ์ดํ„ฐ ํšจ์œจ์ ์ธ ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€ ์‹œ์Šคํ…œ์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆ๋œ ์‹œ์Šคํ…œ์—์„œ๋Š” ๋ฐ์ดํ„ฐ ์ฆ๋Œ€ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•ด ๋ฐ์ดํ„ฐ ํฌ์†Œ์„ฑ ๋ฌธ์ œ์— ๋Œ€์‘ํ•˜๊ณ  ์ค‘์ฒฉ ์ด๋ฒคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜์˜€๋‹ค. ํŠน์ง• ๋ฒกํ„ฐ ํ–ฅ์ƒ์„ ์œ„ํ•ด ์žก์Œ ์–ต์ œ ๊ธฐ๋ฒ•์ด ์‚ฌ์šฉ๋˜์—ˆ๊ณ  ์ค‘์ฒฉ ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€๋ฅผ ์œ„ํ•ด ๋‹ค์ค‘ ๋ ˆ์ด๋ธ” ์‹ฌ์ธต ์ธ๊ณต์‹ ๊ฒฝ๋ง(DNN) ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ์‚ฌ์šฉ๋˜์—ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, ์ œ์•ˆ๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋ถˆ์ถฉ๋ถ„ํ•œ ๋ฐ์ดํ„ฐ์—์„œ๋„ ์šฐ์ˆ˜ํ•œ ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋‚ด์—ˆ๋‹ค. ๋‘ ๋ฒˆ์งธ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, ์ปจ๋ณผ๋ฃจ์…˜ ์‹ ๊ฒฝ๋ง(CNN) ๊ธฐ๋ฐ˜ ์˜ค๋””์˜ค ํƒœ๊น… ์‹œ์Šคํ…œ์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆ๋œ ๋ชจ๋ธ์€ ๋กœ์ปฌ ๊ฒ€์ถœ๊ธฐ์™€ ๊ธ€๋กœ๋ฒŒ ๋ถ„๋ฅ˜๊ธฐ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. ๋กœ์ปฌ ๊ฒ€์ถœ๊ธฐ๋Š” ๊ณ ์œ ํ•œ ์Œํ–ฅ ์ด๋ฒคํŠธ ํŠน์„ฑ์„ ํฌํ•จํ•˜๋Š” ๋กœ์ปฌ ์˜ค๋””์˜ค ๋‹จ์–ด๋ฅผ ๊ฐ์ง€ํ•˜๊ณ  ๊ธ€๋กœ๋ฒŒ ๋ถ„๋ฅ˜๊ธฐ๋Š” ํƒ์ง€๋œ ์ •๋ณด๋ฅผ ์š”์•ฝํ•˜์—ฌ ์˜ค๋””์˜ค ์ด๋ฒคํŠธ๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, ์ œ์•ˆ๋œ ๋ชจ๋ธ์ด ๊ธฐ์กด ์ธ๊ณต์‹ ๊ฒฝ๋ง ๊ธฐ๋ฒ•๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋‚ด์—ˆ๋‹ค. ๋งˆ์ง€๋ง‰ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, ์•ฝํ•œ ๊ต์‚ฌํ•™์Šต ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€ ๋ชจ๋ธ์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆ๋œ ๋ชจ๋ธ์€ DenseNet์˜ ๊ตฌ์กฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ •๋ณด์˜ ์›ํ™œํ•œ ํ๋ฆ„์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๊ณ  SENet์„ ํ™œ์šฉํ•ด ์ฑ„๋„๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ชจ๋ธ๋ง ํ•œ๋‹ค. ๋˜ํ•œ, ์˜ค๋””์˜ค ์‹ ํ˜ธ์—์„œ ๋ถ€๋ถ„ ๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„ ์ •๋ณด๋ฅผ ์žฌ์ˆœํ™˜ ์‹ ๊ฒฝ๋ง(RNN) ๋ฐ ์กฐ๊ฑด๋ถ€ ๋ฌด์ž‘์œ„ ํ•„๋“œ(CRF)๋ฅผ ์‚ฌ์šฉํ•ด ํ™œ์šฉํ•˜์˜€๋‹ค. ์—ฌ๋Ÿฌ ์‹คํ—˜์„ ํ†ตํ•ด ์ œ์•ˆ๋œ ๋ชจ๋ธ์ด ๊ธฐ์กด CNN ๊ธฐ๋ฐ˜ ๊ธฐ๋ฒ•๋ณด๋‹ค ์˜ค๋””์˜ค ํƒœ๊น… ๋ฐ ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€ ๋ชจ๋‘์—์„œ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋ƒ„์„ ๋ณด์˜€๋‹ค.1 Introduction 1 2 Audio Event Detection 5 2.1 Data-Ecient Audio Event Detection 6 2.2 Audio Tagging 7 2.3 Weakly Supervised Audio Event Detection 9 2.4 Metrics 10 3 Data-Ecient Techniques for Audio Event Detection 17 3.1 Introduction 17 3.2 DNN-Based AED system 18 3.2.1 Data Augmentation 20 3.2.2 Exemplar-Based Approach for Noise Reduction 21 3.2.3 DNN Classier 22 3.2.4 Post-Processing 23 3.3 Experiments 24 3.4 Summary 27 4 Audio Tagging using Local Detector and Global Classier 29 4.1 Introduction 29 4.2 CNN-Based Audio Tagging Model 31 4.2.1 Local Detector and Global Classier 32 4.2.2 Temporal Localization of Events 34 4.3 Experiments 34 4.3.1 Dataset and Feature 34 4.3.2 Model Training 35 4.3.3 Results 36 4.4 Summary 39 5 Deep Convolutional Neural Network with Structured Prediction for Weakly Supervised Audio Event Detection 41 5.1 Introduction 41 5.2 CNN with Structured Prediction for Weakly Supervised AED 46 5.2.1 DenseNet 47 5.2.2 Squeeze-and-Excitation 48 5.2.3 Global Pooling for Aggregation 49 5.2.4 Structured Prediction for Accurate Event Localization 50 5.3 Experiments 53 5.3.1 Dataset 53 5.3.2 Feature Extraction 54 5.3.3 DSNet and DSNet-RNN Structures 54 5.3.4 Baseline CNN Structure 56 5.3.5 Training and Evaluation 57 5.3.6 Metrics 57 5.3.7 Results and Discussion 58 5.3.8 Comparison with the DCASE 2017 task 4 Results 61 5.4 Summary 62 6 Conclusions 65 Bibliography 67 ์š” ์•ฝ 77 ๊ฐ์‚ฌ์˜ ๊ธ€ 79Docto

    Data augmentation for speech separation

    Get PDF
    • โ€ฆ
    corecore