4,767 research outputs found

    Audio Event Detection using Weakly Labeled Data

    Full text link
    Acoustic event detection is essential for content analysis and description of multimedia recordings. The majority of current literature on the topic learns the detectors through fully-supervised techniques employing strongly labeled data. However, the labels available for majority of multimedia data are generally weak and do not provide sufficient detail for such methods to be employed. In this paper we propose a framework for learning acoustic event detectors using only weakly labeled data. We first show that audio event detection using weak labels can be formulated as an Multiple Instance Learning problem. We then suggest two frameworks for solving multiple-instance learning, one based on support vector machines, and the other on neural networks. The proposed methods can help in removing the time consuming and expensive process of manually annotating data to facilitate fully supervised learning. Moreover, it can not only detect events in a recording but can also provide temporal locations of events in the recording. This helps in obtaining a complete description of the recording and is notable since temporal information was never known in the first place in weakly labeled data.Comment: ACM Multimedia 201

    Large-scale weakly supervised audio classification using gated convolutional neural network

    Get PDF
    In this paper, we present a gated convolutional neural network and a temporal attention-based localization method for audio classification, which won the 1st place in the large-scale weakly supervised sound event detection task of Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 challenge. The audio clips in this task, which are extracted from YouTube videos, are manually labeled with one or a few audio tags but without timestamps of the audio events, which is called as weakly labeled data. Two sub-tasks are defined in this challenge including audio tagging and sound event detection using this weakly labeled data. A convolutional recurrent neural network (CRNN) with learnable gated linear units (GLUs) non-linearity applied on the log Mel spectrogram is proposed. In addition, a temporal attention method is proposed along the frames to predicate the locations of each audio event in a chunk from the weakly labeled data. We ranked the 1st and the 2nd as a team in these two sub-tasks of DCASE 2017 challenge with F value 55.6\% and Equal error 0.73, respectively.Comment: submitted to ICASSP2018, summary on the 1st place system in DCASE2017 task4 challeng

    Surrey-cvssp system for DCASE2017 challenge task4

    Get PDF
    In this technique report, we present a bunch of methods for the task 4 of Detection and Classification of Acoustic Scenes and Events 2017 (DCASE2017) challenge. This task evaluates systems for the large-scale detection of sound events using weakly labeled training data. The data are YouTube video excerpts focusing on transportation and warnings due to their industry applications. There are two tasks, audio tagging and sound event detection from weakly labeled data. Convolutional neural network (CNN) and gated recurrent unit (GRU) based recurrent neural network (RNN) are adopted as our basic framework. We proposed a learnable gating activation function for selecting informative local features. Attention-based scheme is used for localizing the specific events in a weakly-supervised mode. A new batch-level balancing strategy is also proposed to tackle the data unbalancing problem. Fusion of posteriors from different systems are found effective to improve the performance. In a summary, we get 61% F-value for the audio tagging subtask and 0.73 error rate (ER) for the sound event detection subtask on the development set. While the official multilayer perceptron (MLP) based baseline just obtained 13.1% F-value for the audio tagging and 1.02 for the sound event detection.Comment: DCASE2017 challenge ranked 1st system, task4, tech repor

    Knowledge Transfer from Weakly Labeled Audio using Convolutional Neural Network for Sound Events and Scenes

    Full text link
    In this work we propose approaches to effectively transfer knowledge from weakly labeled web audio data. We first describe a convolutional neural network (CNN) based framework for sound event detection and classification using weakly labeled audio data. Our model trains efficiently from audios of variable lengths; hence, it is well suited for transfer learning. We then propose methods to learn representations using this model which can be effectively used for solving the target task. We study both transductive and inductive transfer learning tasks, showing the effectiveness of our methods for both domain and task adaptation. We show that the learned representations using the proposed CNN model generalizes well enough to reach human level accuracy on ESC-50 sound events dataset and set state of art results on this dataset. We further use them for acoustic scene classification task and once again show that our proposed approaches suit well for this task as well. We also show that our methods are helpful in capturing semantic meanings and relations as well. Moreover, in this process we also set state-of-art results on Audioset dataset, relying on balanced training set.Comment: ICASSP 201

    ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€๋ฅผ ์œ„ํ•œ ํšจ์œจ์  ๋ฐ์ดํ„ฐ ํ™œ์šฉ ๋ฐ ์•ฝํ•œ ๊ต์‚ฌํ•™์Šต ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€,2020. 2. ๊น€๋‚จ์ˆ˜.Conventional audio event detection (AED) models are based on supervised approaches. For supervised approaches, strongly labeled data is required. However, collecting large-scale strongly labeled data of audio events is challenging due to the diversity of audio event types and labeling difficulties. In this thesis, we propose data-efficient and weakly supervised techniques for AED. In the first approach, a data-efficient AED system is proposed. In the proposed system, data augmentation is performed to deal with the data sparsity problem and generate polyphonic event examples. An exemplar-based noise reduction algorithm is proposed for feature enhancement. For polyphonic event detection, a multi-labeled deep neural network (DNN) classifier is employed. An adaptive thresholding algorithm is applied as a post-processing method for robust event detection in noisy conditions. From the experimental results, the proposed algorithm has shown promising performance for AED on a low-resource dataset. In the second approach, a convolutional neural network (CNN)-based audio tagging system is proposed. The proposed model consists of a local detector and a global classifier. The local detector detects local audio words that contain distinct characteristics of events, and the global classifier summarizes the information to predict audio events on the recording. From the experimental results, we have found that the proposed model outperforms conventional artificial neural network models. In the final approach, we propose a weakly supervised AED model. The proposed model takes advantage of strengthening feature propagation from DenseNet and modeling channel-wise relationships by SENet. Also, the correlations among segments in audio recordings are represented by a recurrent neural network (RNN) and conditional random field (CRF). RNN utilizes contextual information and CRF post-processing helps to refine segment-level predictions. We evaluate our proposed method and compare its performance with a CNN based baseline approach. From a number of experiments, it has been shown that the proposed method is effective both on audio tagging and weakly supervised AED.์ผ๋ฐ˜์ ์ธ ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€ ์‹œ์Šคํ…œ์€ ๊ต์‚ฌํ•™์Šต์„ ํ†ตํ•ด ํ›ˆ๋ จ๋œ๋‹ค. ๊ต์‚ฌํ•™์Šต์„ ์œ„ํ•ด์„œ๋Š” ๊ฐ•ํ•œ ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ๊ฐ€ ์š”๊ตฌ๋œ๋‹ค. ํ•˜์ง€๋งŒ ๊ฐ•ํ•œ ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ๋Š” ์Œํ–ฅ ์ด๋ฒคํŠธ์˜ ๋‹ค์–‘์„ฑ ๋ฐ ๋ ˆ์ด๋ธ”์˜ ๋‚œ์ด๋„๋กœ ์ธํ•ด ํฐ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋ฅผ ๊ตฌ์ถ•ํ•˜๊ธฐ ์–ด๋ ต๋‹ค๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€๋ฅผ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ํšจ์œจ์  ํ™œ์šฉ ๋ฐ ์•ฝํ•œ ๊ต์‚ฌํ•™์Šต ๊ธฐ๋ฒ•์— ๋Œ€ํ•ด ์ œ์•ˆํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, ๋ฐ์ดํ„ฐ ํšจ์œจ์ ์ธ ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€ ์‹œ์Šคํ…œ์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆ๋œ ์‹œ์Šคํ…œ์—์„œ๋Š” ๋ฐ์ดํ„ฐ ์ฆ๋Œ€ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•ด ๋ฐ์ดํ„ฐ ํฌ์†Œ์„ฑ ๋ฌธ์ œ์— ๋Œ€์‘ํ•˜๊ณ  ์ค‘์ฒฉ ์ด๋ฒคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜์˜€๋‹ค. ํŠน์ง• ๋ฒกํ„ฐ ํ–ฅ์ƒ์„ ์œ„ํ•ด ์žก์Œ ์–ต์ œ ๊ธฐ๋ฒ•์ด ์‚ฌ์šฉ๋˜์—ˆ๊ณ  ์ค‘์ฒฉ ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€๋ฅผ ์œ„ํ•ด ๋‹ค์ค‘ ๋ ˆ์ด๋ธ” ์‹ฌ์ธต ์ธ๊ณต์‹ ๊ฒฝ๋ง(DNN) ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ์‚ฌ์šฉ๋˜์—ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, ์ œ์•ˆ๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋ถˆ์ถฉ๋ถ„ํ•œ ๋ฐ์ดํ„ฐ์—์„œ๋„ ์šฐ์ˆ˜ํ•œ ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋‚ด์—ˆ๋‹ค. ๋‘ ๋ฒˆ์งธ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, ์ปจ๋ณผ๋ฃจ์…˜ ์‹ ๊ฒฝ๋ง(CNN) ๊ธฐ๋ฐ˜ ์˜ค๋””์˜ค ํƒœ๊น… ์‹œ์Šคํ…œ์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆ๋œ ๋ชจ๋ธ์€ ๋กœ์ปฌ ๊ฒ€์ถœ๊ธฐ์™€ ๊ธ€๋กœ๋ฒŒ ๋ถ„๋ฅ˜๊ธฐ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. ๋กœ์ปฌ ๊ฒ€์ถœ๊ธฐ๋Š” ๊ณ ์œ ํ•œ ์Œํ–ฅ ์ด๋ฒคํŠธ ํŠน์„ฑ์„ ํฌํ•จํ•˜๋Š” ๋กœ์ปฌ ์˜ค๋””์˜ค ๋‹จ์–ด๋ฅผ ๊ฐ์ง€ํ•˜๊ณ  ๊ธ€๋กœ๋ฒŒ ๋ถ„๋ฅ˜๊ธฐ๋Š” ํƒ์ง€๋œ ์ •๋ณด๋ฅผ ์š”์•ฝํ•˜์—ฌ ์˜ค๋””์˜ค ์ด๋ฒคํŠธ๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, ์ œ์•ˆ๋œ ๋ชจ๋ธ์ด ๊ธฐ์กด ์ธ๊ณต์‹ ๊ฒฝ๋ง ๊ธฐ๋ฒ•๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋‚ด์—ˆ๋‹ค. ๋งˆ์ง€๋ง‰ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, ์•ฝํ•œ ๊ต์‚ฌํ•™์Šต ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€ ๋ชจ๋ธ์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆ๋œ ๋ชจ๋ธ์€ DenseNet์˜ ๊ตฌ์กฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ •๋ณด์˜ ์›ํ™œํ•œ ํ๋ฆ„์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๊ณ  SENet์„ ํ™œ์šฉํ•ด ์ฑ„๋„๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ชจ๋ธ๋ง ํ•œ๋‹ค. ๋˜ํ•œ, ์˜ค๋””์˜ค ์‹ ํ˜ธ์—์„œ ๋ถ€๋ถ„ ๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„ ์ •๋ณด๋ฅผ ์žฌ์ˆœํ™˜ ์‹ ๊ฒฝ๋ง(RNN) ๋ฐ ์กฐ๊ฑด๋ถ€ ๋ฌด์ž‘์œ„ ํ•„๋“œ(CRF)๋ฅผ ์‚ฌ์šฉํ•ด ํ™œ์šฉํ•˜์˜€๋‹ค. ์—ฌ๋Ÿฌ ์‹คํ—˜์„ ํ†ตํ•ด ์ œ์•ˆ๋œ ๋ชจ๋ธ์ด ๊ธฐ์กด CNN ๊ธฐ๋ฐ˜ ๊ธฐ๋ฒ•๋ณด๋‹ค ์˜ค๋””์˜ค ํƒœ๊น… ๋ฐ ์Œํ–ฅ ์ด๋ฒคํŠธ ํƒ์ง€ ๋ชจ๋‘์—์„œ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋ƒ„์„ ๋ณด์˜€๋‹ค.1 Introduction 1 2 Audio Event Detection 5 2.1 Data-Ecient Audio Event Detection 6 2.2 Audio Tagging 7 2.3 Weakly Supervised Audio Event Detection 9 2.4 Metrics 10 3 Data-Ecient Techniques for Audio Event Detection 17 3.1 Introduction 17 3.2 DNN-Based AED system 18 3.2.1 Data Augmentation 20 3.2.2 Exemplar-Based Approach for Noise Reduction 21 3.2.3 DNN Classier 22 3.2.4 Post-Processing 23 3.3 Experiments 24 3.4 Summary 27 4 Audio Tagging using Local Detector and Global Classier 29 4.1 Introduction 29 4.2 CNN-Based Audio Tagging Model 31 4.2.1 Local Detector and Global Classier 32 4.2.2 Temporal Localization of Events 34 4.3 Experiments 34 4.3.1 Dataset and Feature 34 4.3.2 Model Training 35 4.3.3 Results 36 4.4 Summary 39 5 Deep Convolutional Neural Network with Structured Prediction for Weakly Supervised Audio Event Detection 41 5.1 Introduction 41 5.2 CNN with Structured Prediction for Weakly Supervised AED 46 5.2.1 DenseNet 47 5.2.2 Squeeze-and-Excitation 48 5.2.3 Global Pooling for Aggregation 49 5.2.4 Structured Prediction for Accurate Event Localization 50 5.3 Experiments 53 5.3.1 Dataset 53 5.3.2 Feature Extraction 54 5.3.3 DSNet and DSNet-RNN Structures 54 5.3.4 Baseline CNN Structure 56 5.3.5 Training and Evaluation 57 5.3.6 Metrics 57 5.3.7 Results and Discussion 58 5.3.8 Comparison with the DCASE 2017 task 4 Results 61 5.4 Summary 62 6 Conclusions 65 Bibliography 67 ์š” ์•ฝ 77 ๊ฐ์‚ฌ์˜ ๊ธ€ 79Docto
    • โ€ฆ
    corecore