208 research outputs found

    음향 이벤트 탐지를 위한 효율적 데이터 활용 및 약한 교사학습 기법

    Get PDF
    학위논문(박사)--서울대학교 대학원 :공과대학 전기·컴퓨터공학부,2020. 2. 김남수.Conventional audio event detection (AED) models are based on supervised approaches. For supervised approaches, strongly labeled data is required. However, collecting large-scale strongly labeled data of audio events is challenging due to the diversity of audio event types and labeling difficulties. In this thesis, we propose data-efficient and weakly supervised techniques for AED. In the first approach, a data-efficient AED system is proposed. In the proposed system, data augmentation is performed to deal with the data sparsity problem and generate polyphonic event examples. An exemplar-based noise reduction algorithm is proposed for feature enhancement. For polyphonic event detection, a multi-labeled deep neural network (DNN) classifier is employed. An adaptive thresholding algorithm is applied as a post-processing method for robust event detection in noisy conditions. From the experimental results, the proposed algorithm has shown promising performance for AED on a low-resource dataset. In the second approach, a convolutional neural network (CNN)-based audio tagging system is proposed. The proposed model consists of a local detector and a global classifier. The local detector detects local audio words that contain distinct characteristics of events, and the global classifier summarizes the information to predict audio events on the recording. From the experimental results, we have found that the proposed model outperforms conventional artificial neural network models. In the final approach, we propose a weakly supervised AED model. The proposed model takes advantage of strengthening feature propagation from DenseNet and modeling channel-wise relationships by SENet. Also, the correlations among segments in audio recordings are represented by a recurrent neural network (RNN) and conditional random field (CRF). RNN utilizes contextual information and CRF post-processing helps to refine segment-level predictions. We evaluate our proposed method and compare its performance with a CNN based baseline approach. From a number of experiments, it has been shown that the proposed method is effective both on audio tagging and weakly supervised AED.일반적인 음향 이벤트 탐지 시스템은 교사학습을 통해 훈련된다. 교사학습을 위해서는 강한 레이블 데이터가 요구된다. 하지만 강한 레이블 데이터는 음향 이벤트의 다양성 및 레이블의 난이도로 인해 큰 데이터베이스를 구축하기 어렵다는 문제가 있다. 본 논문에서는 이러한 문제를 해결하기 위해 음향 이벤트 탐지를 위한 데이터 효율적 활용 및 약한 교사학습 기법에 대해 제안한다. 첫 번째 접근법으로서, 데이터 효율적인 음향 이벤트 탐지 시스템을 제안한다. 제안된 시스템에서는 데이터 증대 기법을 사용해 데이터 희소성 문제에 대응하고 중첩 이벤트 데이터를 생성하였다. 특징 벡터 향상을 위해 잡음 억제 기법이 사용되었고 중첩 음향 이벤트 탐지를 위해 다중 레이블 심층 인공신경망(DNN) 분류기가 사용되었다. 실험 결과, 제안된 알고리즘은 불충분한 데이터에서도 우수한 음향 이벤트 탐지 성능을 나타내었다. 두 번째 접근법으로서, 컨볼루션 신경망(CNN) 기반 오디오 태깅 시스템을 제안한다. 제안된 모델은 로컬 검출기와 글로벌 분류기로 구성된다. 로컬 검출기는 고유한 음향 이벤트 특성을 포함하는 로컬 오디오 단어를 감지하고 글로벌 분류기는 탐지된 정보를 요약하여 오디오 이벤트를 예측한다. 실험 결과, 제안된 모델이 기존 인공신경망 기법보다 우수한 성능을 나타내었다. 마지막 접근법으로서, 약한 교사학습 음향 이벤트 탐지 모델을 제안한다. 제안된 모델은 DenseNet의 구조를 활용하여 정보의 원활한 흐름을 가능하게 하고 SENet을 활용해 채널간의 상관관계를 모델링 한다. 또한, 오디오 신호에서 부분 간의 상관관계 정보를 재순환 신경망(RNN) 및 조건부 무작위 필드(CRF)를 사용해 활용하였다. 여러 실험을 통해 제안된 모델이 기존 CNN 기반 기법보다 오디오 태깅 및 음향 이벤트 탐지 모두에서 더 나은 성능을 나타냄을 보였다.1 Introduction 1 2 Audio Event Detection 5 2.1 Data-Ecient Audio Event Detection 6 2.2 Audio Tagging 7 2.3 Weakly Supervised Audio Event Detection 9 2.4 Metrics 10 3 Data-Ecient Techniques for Audio Event Detection 17 3.1 Introduction 17 3.2 DNN-Based AED system 18 3.2.1 Data Augmentation 20 3.2.2 Exemplar-Based Approach for Noise Reduction 21 3.2.3 DNN Classier 22 3.2.4 Post-Processing 23 3.3 Experiments 24 3.4 Summary 27 4 Audio Tagging using Local Detector and Global Classier 29 4.1 Introduction 29 4.2 CNN-Based Audio Tagging Model 31 4.2.1 Local Detector and Global Classier 32 4.2.2 Temporal Localization of Events 34 4.3 Experiments 34 4.3.1 Dataset and Feature 34 4.3.2 Model Training 35 4.3.3 Results 36 4.4 Summary 39 5 Deep Convolutional Neural Network with Structured Prediction for Weakly Supervised Audio Event Detection 41 5.1 Introduction 41 5.2 CNN with Structured Prediction for Weakly Supervised AED 46 5.2.1 DenseNet 47 5.2.2 Squeeze-and-Excitation 48 5.2.3 Global Pooling for Aggregation 49 5.2.4 Structured Prediction for Accurate Event Localization 50 5.3 Experiments 53 5.3.1 Dataset 53 5.3.2 Feature Extraction 54 5.3.3 DSNet and DSNet-RNN Structures 54 5.3.4 Baseline CNN Structure 56 5.3.5 Training and Evaluation 57 5.3.6 Metrics 57 5.3.7 Results and Discussion 58 5.3.8 Comparison with the DCASE 2017 task 4 Results 61 5.4 Summary 62 6 Conclusions 65 Bibliography 67 요 약 77 감사의 글 79Docto

    Shennong: a Python toolbox for audio speech features extraction

    Get PDF
    International audienceWe introduce Shennong, a Python toolbox and command-line utility for speech features extraction. It implements a wide range of well-established state of art algorithms including spectro-temporal filters such as Mel-Frequency Cepstral Filterbanks or Predictive Linear Filters, pre-trained neural networks, pitch estimators as well as speaker normalization methods and post-processing algorithms. Shennong is an open source, easy-to-use, reliable and extensible framework. The use of Python makes the integration to others speech modeling and machine learning tools easy. It aims to replace or complement several heterogeneous software, such as Kaldi or Praat. After describing the Shennong software architecture, its core components and implemented algorithms, this paper illustrates its use on three applications: a comparison of speech features performances on a phones discrimination task, an analysis of a Vocal Tract Length Normalization model as a function of the speech duration used for training and a comparison of pitch estimation algorithms under various noise conditions

    Shennong: a Python toolbox for audio speech features extraction

    Get PDF
    International audienceWe introduce Shennong, a Python toolbox and command-line utility for speech features extraction. It implements a wide range of well-established state of art algorithms including spectro-temporal filters such as Mel-Frequency Cepstral Filterbanks or Predictive Linear Filters, pre-trained neural networks, pitch estimators as well as speaker normalization methods and post-processing algorithms. Shennong is an open source, easy-to-use, reliable and extensible framework. The use of Python makes the integration to others speech modeling and machine learning tools easy. It aims to replace or complement several heterogeneous software, such as Kaldi or Praat. After describing the Shennong software architecture, its core components and implemented algorithms, this paper illustrates its use on three applications: a comparison of speech features performances on a phones discrimination task, an analysis of a Vocal Tract Length Normalization model as a function of the speech duration used for training and a comparison of pitch estimation algorithms under various noise conditions

    Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

    Get PDF
    Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition that stills remains an important challenge. Data-driven supervised approaches, including ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches and with sufficient training, can alleviate the shortcomings of the unsupervised methods in various real-life acoustic environments. In this light, we review recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems. We separately discuss single- and multi-channel techniques developed for the front-end and back-end of speech recognition systems, as well as joint front-end and back-end training frameworks

    Simple acoustic features can explain phoneme-based predictions of cortical responses to speech

    Get PDF
    When we listen to speech, we have to make sense of a waveform of sound pressure. Hierarchical models of speech perception assume that, to extract semantic meaning, the signal is transformed into unknown, intermediate neuronal representations. Traditionally, studies of such intermediate representations are guided by linguistically defined concepts, such as phonemes. Here, we argue that in order to arrive at an unbiased understanding of the neuronal responses to speech, we should focus instead on representations obtained directly from the stimulus. We illustrate our view with a data-driven, information theoretic analysis of a dataset of 24 young, healthy humans who listened to a 1 h narrative while their magnetoencephalogram (MEG) was recorded. We find that two recent results, the improved performance of an encoding model in which annotated linguistic and acoustic features were combined and the decoding of phoneme subgroups from phoneme-locked responses, can be explained by an encoding model that is based entirely on acoustic features. These acoustic features capitalize on acoustic edges and outperform Gabor-filtered spectrograms, which can explicitly describe the spectrotemporal characteristics of individual phonemes. By replicating our results in publicly available electroencephalography (EEG) data, we conclude that models of brain responses based on linguistic features can serve as excellent benchmarks. However, we believe that in order to further our understanding of human cortical responses to speech, we should also explore low-level and parsimonious explanations for apparent high-level phenomena
    corecore