4,767 research outputs found
Audio Event Detection using Weakly Labeled Data
Acoustic event detection is essential for content analysis and description of
multimedia recordings. The majority of current literature on the topic learns
the detectors through fully-supervised techniques employing strongly labeled
data. However, the labels available for majority of multimedia data are
generally weak and do not provide sufficient detail for such methods to be
employed. In this paper we propose a framework for learning acoustic event
detectors using only weakly labeled data. We first show that audio event
detection using weak labels can be formulated as an Multiple Instance Learning
problem. We then suggest two frameworks for solving multiple-instance learning,
one based on support vector machines, and the other on neural networks. The
proposed methods can help in removing the time consuming and expensive process
of manually annotating data to facilitate fully supervised learning. Moreover,
it can not only detect events in a recording but can also provide temporal
locations of events in the recording. This helps in obtaining a complete
description of the recording and is notable since temporal information was
never known in the first place in weakly labeled data.Comment: ACM Multimedia 201
Large-scale weakly supervised audio classification using gated convolutional neural network
In this paper, we present a gated convolutional neural network and a temporal
attention-based localization method for audio classification, which won the 1st
place in the large-scale weakly supervised sound event detection task of
Detection and Classification of Acoustic Scenes and Events (DCASE) 2017
challenge. The audio clips in this task, which are extracted from YouTube
videos, are manually labeled with one or a few audio tags but without
timestamps of the audio events, which is called as weakly labeled data. Two
sub-tasks are defined in this challenge including audio tagging and sound event
detection using this weakly labeled data. A convolutional recurrent neural
network (CRNN) with learnable gated linear units (GLUs) non-linearity applied
on the log Mel spectrogram is proposed. In addition, a temporal attention
method is proposed along the frames to predicate the locations of each audio
event in a chunk from the weakly labeled data. We ranked the 1st and the 2nd as
a team in these two sub-tasks of DCASE 2017 challenge with F value 55.6\% and
Equal error 0.73, respectively.Comment: submitted to ICASSP2018, summary on the 1st place system in DCASE2017
task4 challeng
Surrey-cvssp system for DCASE2017 challenge task4
In this technique report, we present a bunch of methods for the task 4 of
Detection and Classification of Acoustic Scenes and Events 2017 (DCASE2017)
challenge. This task evaluates systems for the large-scale detection of sound
events using weakly labeled training data. The data are YouTube video excerpts
focusing on transportation and warnings due to their industry applications.
There are two tasks, audio tagging and sound event detection from weakly
labeled data. Convolutional neural network (CNN) and gated recurrent unit (GRU)
based recurrent neural network (RNN) are adopted as our basic framework. We
proposed a learnable gating activation function for selecting informative local
features. Attention-based scheme is used for localizing the specific events in
a weakly-supervised mode. A new batch-level balancing strategy is also proposed
to tackle the data unbalancing problem. Fusion of posteriors from different
systems are found effective to improve the performance. In a summary, we get
61% F-value for the audio tagging subtask and 0.73 error rate (ER) for the
sound event detection subtask on the development set. While the official
multilayer perceptron (MLP) based baseline just obtained 13.1% F-value for the
audio tagging and 1.02 for the sound event detection.Comment: DCASE2017 challenge ranked 1st system, task4, tech repor
Knowledge Transfer from Weakly Labeled Audio using Convolutional Neural Network for Sound Events and Scenes
In this work we propose approaches to effectively transfer knowledge from
weakly labeled web audio data. We first describe a convolutional neural network
(CNN) based framework for sound event detection and classification using weakly
labeled audio data. Our model trains efficiently from audios of variable
lengths; hence, it is well suited for transfer learning. We then propose
methods to learn representations using this model which can be effectively used
for solving the target task. We study both transductive and inductive transfer
learning tasks, showing the effectiveness of our methods for both domain and
task adaptation. We show that the learned representations using the proposed
CNN model generalizes well enough to reach human level accuracy on ESC-50 sound
events dataset and set state of art results on this dataset. We further use
them for acoustic scene classification task and once again show that our
proposed approaches suit well for this task as well. We also show that our
methods are helpful in capturing semantic meanings and relations as well.
Moreover, in this process we also set state-of-art results on Audioset dataset,
relying on balanced training set.Comment: ICASSP 201
์ํฅ ์ด๋ฒคํธ ํ์ง๋ฅผ ์ํ ํจ์จ์ ๋ฐ์ดํฐ ํ์ฉ ๋ฐ ์ฝํ ๊ต์ฌํ์ต ๊ธฐ๋ฒ
ํ์๋
ผ๋ฌธ(๋ฐ์ฌ)--์์ธ๋ํ๊ต ๋ํ์ :๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ปดํจํฐ๊ณตํ๋ถ,2020. 2. ๊น๋จ์.Conventional audio event detection (AED) models are based on supervised approaches. For supervised approaches, strongly labeled data is required. However, collecting large-scale strongly labeled data of audio events is challenging due to the diversity of audio event types and labeling difficulties. In this thesis, we propose data-efficient and weakly supervised techniques for AED.
In the first approach, a data-efficient AED system is proposed. In the proposed system, data augmentation is performed to deal with the data sparsity problem and generate polyphonic event examples. An exemplar-based noise reduction algorithm is proposed for feature enhancement. For polyphonic event detection, a multi-labeled deep neural network (DNN) classifier is employed. An adaptive thresholding algorithm is applied as a post-processing method for robust event detection in noisy conditions. From the experimental results, the proposed algorithm has shown promising performance for AED on a low-resource dataset.
In the second approach, a convolutional neural network (CNN)-based audio tagging system is proposed. The proposed model consists of a local detector and a global classifier. The local detector detects local audio words that contain distinct characteristics of events, and the global classifier summarizes the information to predict audio events on the recording. From the experimental results, we have found that the proposed model outperforms conventional artificial neural network models.
In the final approach, we propose a weakly supervised AED model. The proposed model takes advantage of strengthening feature propagation from DenseNet and modeling channel-wise relationships by SENet. Also, the correlations among segments in audio recordings are represented by a recurrent neural network (RNN) and conditional random field (CRF). RNN utilizes contextual information and CRF post-processing helps to refine segment-level predictions. We evaluate our proposed method and compare its performance with a CNN based baseline approach. From a number of experiments, it has been shown that the proposed method is effective both on audio tagging and weakly supervised AED.์ผ๋ฐ์ ์ธ ์ํฅ ์ด๋ฒคํธ ํ์ง ์์คํ
์ ๊ต์ฌํ์ต์ ํตํด ํ๋ จ๋๋ค. ๊ต์ฌํ์ต์ ์ํด์๋ ๊ฐํ ๋ ์ด๋ธ ๋ฐ์ดํฐ๊ฐ ์๊ตฌ๋๋ค. ํ์ง๋ง ๊ฐํ ๋ ์ด๋ธ ๋ฐ์ดํฐ๋ ์ํฅ ์ด๋ฒคํธ์ ๋ค์์ฑ ๋ฐ ๋ ์ด๋ธ์ ๋์ด๋๋ก ์ธํด ํฐ ๋ฐ์ดํฐ๋ฒ ์ด์ค๋ฅผ ๊ตฌ์ถํ๊ธฐ ์ด๋ ต๋ค๋ ๋ฌธ์ ๊ฐ ์๋ค. ๋ณธ ๋
ผ๋ฌธ์์๋ ์ด๋ฌํ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด ์ํฅ ์ด๋ฒคํธ ํ์ง๋ฅผ ์ํ ๋ฐ์ดํฐ ํจ์จ์ ํ์ฉ ๋ฐ ์ฝํ ๊ต์ฌํ์ต ๊ธฐ๋ฒ์ ๋ํด ์ ์ํ๋ค.
์ฒซ ๋ฒ์งธ ์ ๊ทผ๋ฒ์ผ๋ก์, ๋ฐ์ดํฐ ํจ์จ์ ์ธ ์ํฅ ์ด๋ฒคํธ ํ์ง ์์คํ
์ ์ ์ํ๋ค. ์ ์๋ ์์คํ
์์๋ ๋ฐ์ดํฐ ์ฆ๋ ๊ธฐ๋ฒ์ ์ฌ์ฉํด ๋ฐ์ดํฐ ํฌ์์ฑ ๋ฌธ์ ์ ๋์ํ๊ณ ์ค์ฒฉ ์ด๋ฒคํธ ๋ฐ์ดํฐ๋ฅผ ์์ฑํ์๋ค. ํน์ง ๋ฒกํฐ ํฅ์์ ์ํด ์ก์ ์ต์ ๊ธฐ๋ฒ์ด ์ฌ์ฉ๋์๊ณ ์ค์ฒฉ ์ํฅ ์ด๋ฒคํธ ํ์ง๋ฅผ ์ํด ๋ค์ค ๋ ์ด๋ธ ์ฌ์ธต ์ธ๊ณต์ ๊ฒฝ๋ง(DNN) ๋ถ๋ฅ๊ธฐ๊ฐ ์ฌ์ฉ๋์๋ค. ์คํ ๊ฒฐ๊ณผ, ์ ์๋ ์๊ณ ๋ฆฌ์ฆ์ ๋ถ์ถฉ๋ถํ ๋ฐ์ดํฐ์์๋ ์ฐ์ํ ์ํฅ ์ด๋ฒคํธ ํ์ง ์ฑ๋ฅ์ ๋ํ๋ด์๋ค.
๋ ๋ฒ์งธ ์ ๊ทผ๋ฒ์ผ๋ก์, ์ปจ๋ณผ๋ฃจ์
์ ๊ฒฝ๋ง(CNN) ๊ธฐ๋ฐ ์ค๋์ค ํ๊น
์์คํ
์ ์ ์ํ๋ค. ์ ์๋ ๋ชจ๋ธ์ ๋ก์ปฌ ๊ฒ์ถ๊ธฐ์ ๊ธ๋ก๋ฒ ๋ถ๋ฅ๊ธฐ๋ก ๊ตฌ์ฑ๋๋ค. ๋ก์ปฌ ๊ฒ์ถ๊ธฐ๋ ๊ณ ์ ํ ์ํฅ ์ด๋ฒคํธ ํน์ฑ์ ํฌํจํ๋ ๋ก์ปฌ ์ค๋์ค ๋จ์ด๋ฅผ ๊ฐ์งํ๊ณ ๊ธ๋ก๋ฒ ๋ถ๋ฅ๊ธฐ๋ ํ์ง๋ ์ ๋ณด๋ฅผ ์์ฝํ์ฌ ์ค๋์ค ์ด๋ฒคํธ๋ฅผ ์์ธกํ๋ค. ์คํ ๊ฒฐ๊ณผ, ์ ์๋ ๋ชจ๋ธ์ด ๊ธฐ์กด ์ธ๊ณต์ ๊ฒฝ๋ง ๊ธฐ๋ฒ๋ณด๋ค ์ฐ์ํ ์ฑ๋ฅ์ ๋ํ๋ด์๋ค.
๋ง์ง๋ง ์ ๊ทผ๋ฒ์ผ๋ก์, ์ฝํ ๊ต์ฌํ์ต ์ํฅ ์ด๋ฒคํธ ํ์ง ๋ชจ๋ธ์ ์ ์ํ๋ค. ์ ์๋ ๋ชจ๋ธ์ DenseNet์ ๊ตฌ์กฐ๋ฅผ ํ์ฉํ์ฌ ์ ๋ณด์ ์ํํ ํ๋ฆ์ ๊ฐ๋ฅํ๊ฒ ํ๊ณ SENet์ ํ์ฉํด ์ฑ๋๊ฐ์ ์๊ด๊ด๊ณ๋ฅผ ๋ชจ๋ธ๋ง ํ๋ค. ๋ํ, ์ค๋์ค ์ ํธ์์ ๋ถ๋ถ ๊ฐ์ ์๊ด๊ด๊ณ ์ ๋ณด๋ฅผ ์ฌ์ํ ์ ๊ฒฝ๋ง(RNN) ๋ฐ ์กฐ๊ฑด๋ถ ๋ฌด์์ ํ๋(CRF)๋ฅผ ์ฌ์ฉํด ํ์ฉํ์๋ค. ์ฌ๋ฌ ์คํ์ ํตํด ์ ์๋ ๋ชจ๋ธ์ด ๊ธฐ์กด CNN ๊ธฐ๋ฐ ๊ธฐ๋ฒ๋ณด๋ค ์ค๋์ค ํ๊น
๋ฐ ์ํฅ ์ด๋ฒคํธ ํ์ง ๋ชจ๋์์ ๋ ๋์ ์ฑ๋ฅ์ ๋ํ๋์ ๋ณด์๋ค.1 Introduction 1
2 Audio Event Detection 5
2.1 Data-Ecient Audio Event Detection 6
2.2 Audio Tagging 7
2.3 Weakly Supervised Audio Event Detection 9
2.4 Metrics 10
3 Data-Ecient Techniques for Audio Event Detection 17
3.1 Introduction 17
3.2 DNN-Based AED system 18
3.2.1 Data Augmentation 20
3.2.2 Exemplar-Based Approach for Noise Reduction 21
3.2.3 DNN Classier 22
3.2.4 Post-Processing 23
3.3 Experiments 24
3.4 Summary 27
4 Audio Tagging using Local Detector and Global Classier 29
4.1 Introduction 29
4.2 CNN-Based Audio Tagging Model 31
4.2.1 Local Detector and Global Classier 32
4.2.2 Temporal Localization of Events 34
4.3 Experiments 34
4.3.1 Dataset and Feature 34
4.3.2 Model Training 35
4.3.3 Results 36
4.4 Summary 39
5 Deep Convolutional Neural Network with Structured Prediction for Weakly Supervised Audio Event Detection 41
5.1 Introduction 41
5.2 CNN with Structured Prediction for Weakly Supervised AED 46
5.2.1 DenseNet 47
5.2.2 Squeeze-and-Excitation 48
5.2.3 Global Pooling for Aggregation 49
5.2.4 Structured Prediction for Accurate Event Localization 50
5.3 Experiments 53
5.3.1 Dataset 53
5.3.2 Feature Extraction 54
5.3.3 DSNet and DSNet-RNN Structures 54
5.3.4 Baseline CNN Structure 56
5.3.5 Training and Evaluation 57
5.3.6 Metrics 57
5.3.7 Results and Discussion 58
5.3.8 Comparison with the DCASE 2017 task 4 Results 61
5.4 Summary 62
6 Conclusions 65
Bibliography 67
์ ์ฝ 77
๊ฐ์ฌ์ ๊ธ 79Docto
- โฆ