Search CORE

4,767 research outputs found

Audio Event Detection using Weakly Labeled Data

Author: Gencoglu O.
J. F.
Kons Z.
Kumar A.
Mandel M. I.
Pancoast S.
Pikrakis A.
Rumelhart D. E.
Stowell D.
Wang F.
Wang J.
Werbos P. J.
Zhou Z.-H.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 06/07/2016
Field of study

Acoustic event detection is essential for content analysis and description of multimedia recordings. The majority of current literature on the topic learns the detectors through fully-supervised techniques employing strongly labeled data. However, the labels available for majority of multimedia data are generally weak and do not provide sufficient detail for such methods to be employed. In this paper we propose a framework for learning acoustic event detectors using only weakly labeled data. We first show that audio event detection using weak labels can be formulated as an Multiple Instance Learning problem. We then suggest two frameworks for solving multiple-instance learning, one based on support vector machines, and the other on neural networks. The proposed methods can help in removing the time consuming and expensive process of manually annotating data to facilitate fully supervised learning. Moreover, it can not only detect events in a recording but can also provide temporal locations of events in the recording. This helps in obtaining a complete description of the recording and is notable since temporal information was never known in the first place in weakly labeled data.Comment: ACM Multimedia 201

arXiv.org e-Print Archive

Crossref

Large-scale weakly supervised audio classification using gated convolutional neural network

Author: Kong Qiuqiang
Plumbley Mark D.
Wang Wenwu
Xu Yong
Publication venue
Publication date: 01/10/2017
Field of study

In this paper, we present a gated convolutional neural network and a temporal attention-based localization method for audio classification, which won the 1st place in the large-scale weakly supervised sound event detection task of Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 challenge. The audio clips in this task, which are extracted from YouTube videos, are manually labeled with one or a few audio tags but without timestamps of the audio events, which is called as weakly labeled data. Two sub-tasks are defined in this challenge including audio tagging and sound event detection using this weakly labeled data. A convolutional recurrent neural network (CRNN) with learnable gated linear units (GLUs) non-linearity applied on the log Mel spectrogram is proposed. In addition, a temporal attention method is proposed along the frames to predicate the locations of each audio event in a chunk from the weakly labeled data. We ranked the 1st and the 2nd as a team in these two sub-tasks of DCASE 2017 challenge with F value 55.6\% and Equal error 0.73, respectively.Comment: submitted to ICASSP2018, summary on the 1st place system in DCASE2017 task4 challeng

arXiv.org e-Print Archive

Crossref

University of Surrey

Surrey Research Insight

Surrey-cvssp system for DCASE2017 challenge task4

Author: Kong Qiuqiang
Plumbley Mark D.
Wang Wenwu
Xu Yong
Publication venue
Publication date: 06/11/2017
Field of study

In this technique report, we present a bunch of methods for the task 4 of Detection and Classification of Acoustic Scenes and Events 2017 (DCASE2017) challenge. This task evaluates systems for the large-scale detection of sound events using weakly labeled training data. The data are YouTube video excerpts focusing on transportation and warnings due to their industry applications. There are two tasks, audio tagging and sound event detection from weakly labeled data. Convolutional neural network (CNN) and gated recurrent unit (GRU) based recurrent neural network (RNN) are adopted as our basic framework. We proposed a learnable gating activation function for selecting informative local features. Attention-based scheme is used for localizing the specific events in a weakly-supervised mode. A new batch-level balancing strategy is also proposed to tackle the data unbalancing problem. Fusion of posteriors from different systems are found effective to improve the performance. In a summary, we get 61% F-value for the audio tagging subtask and 0.73 error rate (ER) for the sound event detection subtask on the development set. While the official multilayer perceptron (MLP) based baseline just obtained 13.1% F-value for the audio tagging and 1.02 for the sound event detection.Comment: DCASE2017 challenge ranked 1st system, task4, tech repor

arXiv.org e-Print Archive

University of Surrey

Surrey Research Insight

Knowledge Transfer from Weakly Labeled Audio using Convolutional Neural Network for Sound Events and Scenes

Author: Fugen Christian
Khadkevich Maksim
Kumar Anurag
Publication venue
Publication date: 07/09/2018
Field of study

In this work we propose approaches to effectively transfer knowledge from weakly labeled web audio data. We first describe a convolutional neural network (CNN) based framework for sound event detection and classification using weakly labeled audio data. Our model trains efficiently from audios of variable lengths; hence, it is well suited for transfer learning. We then propose methods to learn representations using this model which can be effectively used for solving the target task. We study both transductive and inductive transfer learning tasks, showing the effectiveness of our methods for both domain and task adaptation. We show that the learned representations using the proposed CNN model generalizes well enough to reach human level accuracy on ESC-50 sound events dataset and set state of art results on this dataset. We further use them for acoustic scene classification task and once again show that our proposed approaches suit well for this task as well. We also show that our methods are helpful in capturing semantic meanings and relations as well. Moreover, in this process we also set state-of-art results on Audioset dataset, relying on balanced training set.Comment: ICASSP 201

arXiv.org e-Print Archive

Crossref

음향 이벤트 탐지를 위한 효율적 데이터 활용 및 약한 교사학습 기법

Author: 최인규
Publication venue: 서울대학교 대학원
Publication date: 01/02/2020
Field of study

학위논문(박사)--서울대학교 대학원 :공과대학 전기·컴퓨터공학부,2020. 2. 김남수.Conventional audio event detection (AED) models are based on supervised approaches. For supervised approaches, strongly labeled data is required. However, collecting large-scale strongly labeled data of audio events is challenging due to the diversity of audio event types and labeling difficulties. In this thesis, we propose data-efficient and weakly supervised techniques for AED. In the first approach, a data-efficient AED system is proposed. In the proposed system, data augmentation is performed to deal with the data sparsity problem and generate polyphonic event examples. An exemplar-based noise reduction algorithm is proposed for feature enhancement. For polyphonic event detection, a multi-labeled deep neural network (DNN) classifier is employed. An adaptive thresholding algorithm is applied as a post-processing method for robust event detection in noisy conditions. From the experimental results, the proposed algorithm has shown promising performance for AED on a low-resource dataset. In the second approach, a convolutional neural network (CNN)-based audio tagging system is proposed. The proposed model consists of a local detector and a global classifier. The local detector detects local audio words that contain distinct characteristics of events, and the global classifier summarizes the information to predict audio events on the recording. From the experimental results, we have found that the proposed model outperforms conventional artificial neural network models. In the final approach, we propose a weakly supervised AED model. The proposed model takes advantage of strengthening feature propagation from DenseNet and modeling channel-wise relationships by SENet. Also, the correlations among segments in audio recordings are represented by a recurrent neural network (RNN) and conditional random field (CRF). RNN utilizes contextual information and CRF post-processing helps to refine segment-level predictions. We evaluate our proposed method and compare its performance with a CNN based baseline approach. From a number of experiments, it has been shown that the proposed method is effective both on audio tagging and weakly supervised AED.일반적인 음향 이벤트 탐지 시스템은 교사학습을 통해 훈련된다. 교사학습을 위해서는 강한 레이블 데이터가 요구된다. 하지만 강한 레이블 데이터는 음향 이벤트의 다양성 및 레이블의 난이도로 인해 큰 데이터베이스를 구축하기 어렵다는 문제가 있다. 본 논문에서는 이러한 문제를 해결하기 위해 음향 이벤트 탐지를 위한 데이터 효율적 활용 및 약한 교사학습 기법에 대해 제안한다. 첫 번째 접근법으로서, 데이터 효율적인 음향 이벤트 탐지 시스템을 제안한다. 제안된 시스템에서는 데이터 증대 기법을 사용해 데이터 희소성 문제에 대응하고 중첩 이벤트 데이터를 생성하였다. 특징 벡터 향상을 위해 잡음 억제 기법이 사용되었고 중첩 음향 이벤트 탐지를 위해 다중 레이블 심층 인공신경망(DNN) 분류기가 사용되었다. 실험 결과, 제안된 알고리즘은 불충분한 데이터에서도 우수한 음향 이벤트 탐지 성능을 나타내었다. 두 번째 접근법으로서, 컨볼루션 신경망(CNN) 기반 오디오 태깅 시스템을 제안한다. 제안된 모델은 로컬 검출기와 글로벌 분류기로 구성된다. 로컬 검출기는 고유한 음향 이벤트 특성을 포함하는 로컬 오디오 단어를 감지하고 글로벌 분류기는 탐지된 정보를 요약하여 오디오 이벤트를 예측한다. 실험 결과, 제안된 모델이 기존 인공신경망 기법보다 우수한 성능을 나타내었다. 마지막 접근법으로서, 약한 교사학습 음향 이벤트 탐지 모델을 제안한다. 제안된 모델은 DenseNet의 구조를 활용하여 정보의 원활한 흐름을 가능하게 하고 SENet을 활용해 채널간의 상관관계를 모델링 한다. 또한, 오디오 신호에서 부분 간의 상관관계 정보를 재순환 신경망(RNN) 및 조건부 무작위 필드(CRF)를 사용해 활용하였다. 여러 실험을 통해 제안된 모델이 기존 CNN 기반 기법보다 오디오 태깅 및 음향 이벤트 탐지 모두에서 더 나은 성능을 나타냄을 보였다.1 Introduction 1 2 Audio Event Detection 5 2.1 Data-Ecient Audio Event Detection 6 2.2 Audio Tagging 7 2.3 Weakly Supervised Audio Event Detection 9 2.4 Metrics 10 3 Data-Ecient Techniques for Audio Event Detection 17 3.1 Introduction 17 3.2 DNN-Based AED system 18 3.2.1 Data Augmentation 20 3.2.2 Exemplar-Based Approach for Noise Reduction 21 3.2.3 DNN Classier 22 3.2.4 Post-Processing 23 3.3 Experiments 24 3.4 Summary 27 4 Audio Tagging using Local Detector and Global Classier 29 4.1 Introduction 29 4.2 CNN-Based Audio Tagging Model 31 4.2.1 Local Detector and Global Classier 32 4.2.2 Temporal Localization of Events 34 4.3 Experiments 34 4.3.1 Dataset and Feature 34 4.3.2 Model Training 35 4.3.3 Results 36 4.4 Summary 39 5 Deep Convolutional Neural Network with Structured Prediction for Weakly Supervised Audio Event Detection 41 5.1 Introduction 41 5.2 CNN with Structured Prediction for Weakly Supervised AED 46 5.2.1 DenseNet 47 5.2.2 Squeeze-and-Excitation 48 5.2.3 Global Pooling for Aggregation 49 5.2.4 Structured Prediction for Accurate Event Localization 50 5.3 Experiments 53 5.3.1 Dataset 53 5.3.2 Feature Extraction 54 5.3.3 DSNet and DSNet-RNN Structures 54 5.3.4 Baseline CNN Structure 56 5.3.5 Training and Evaluation 57 5.3.6 Metrics 57 5.3.7 Results and Discussion 58 5.3.8 Comparison with the DCASE 2017 task 4 Results 61 5.4 Summary 62 6 Conclusions 65 Bibliography 67 요 약 77 감사의 글 79Docto

SNU Open Repository and Archive