Search CORE

3,136 research outputs found

Robust Speech Recognition for Adverse Environments

Author: Liu Chao-Hong
Wu Chung-Hsien
Publication venue: 'IntechOpen'
Publication date: 28/11/2012
Field of study

A Bayesian Network View on Acoustic Model-Based Techniques for Robust Speech Recognition

Author: Huemmer Christian
Kellermann Walter
Maas Roland
Sehr Armin
Publication venue
Publication date: 22/09/2014
Field of study

This article provides a unifying Bayesian network view on various approaches for acoustic model adaptation, missing feature, and uncertainty decoding that are well-known in the literature of robust automatic speech recognition. The representatives of these classes can often be deduced from a Bayesian network that extends the conventional hidden Markov models used in speech recognition. These extensions, in turn, can in many cases be motivated from an underlying observation model that relates clean and distorted feature vectors. By converting the observation models into a Bayesian network representation, we formulate the corresponding compensation rules leading to a unified view on known derivations as well as to new formulations for certain approaches. The generic Bayesian perspective provided in this contribution thus highlights structural differences and similarities between the analyzed approaches

arXiv.org e-Print Archive

Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

Author: Geiger Jürgen
Jin Wenyu
Mousa Amr El-Desoky
Pohjalainen Jouni
Schuller Björn
Zhang Zixing
Publication venue
Publication date: 01/01/2018
Field of study

Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition that stills remains an important challenge. Data-driven supervised approaches, including ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches and with sufficient training, can alleviate the shortcomings of the unsupervised methods in various real-life acoustic environments. In this light, we review recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems. We separately discuss single- and multi-channel techniques developed for the front-end and back-end of speech recognition systems, as well as joint front-end and back-end training frameworks

arXiv.org e-Print Archive

OPUS Augsburg

강인한 음성인식을 위한 DNN 기반 음향 모델링

Author: 이강현
Publication venue: 서울대학교 대학원
Publication date: 01/02/2019
Field of study

학위논문 (박사)-- 서울대학교 대학원 : 공과대학 전기·컴퓨터공학부, 2019. 2. 김남수.본 논문에서는 강인한 음성인식을 위해서 DNN을 활용한 음향 모델링 기법들을 제안한다. 본 논문에서는 크게 세 가지의 DNN 기반 기법을 제안한다. 첫 번째는 DNN이 가지고 있는 잡음 환경에 대한 강인함을 보조 특징 벡터들을 통하여 최대로 활용하는 음향 모델링 기법이다. 이러한 기법을 통하여 DNN은 왜곡된 음성, 깨끗한 음성, 잡음 추정치, 그리고 음소 타겟과의 복잡한 관계를 보다 원활하게 학습하게 된다. 본 기법은 Aurora-5 DB 에서 기존의 보조 잡음 특징 벡터를 활용한 모델 적응 기법인 잡음 인지 학습 (noise-aware training, NAT) 기법을 크게 뛰어넘는 성능을 보였다. 두 번째는 DNN을 활용한 다 채널 특징 향상 기법이다. 기존의 다 채널 시나리오에서는 전통적인 신호 처리 기법인 빔포밍 기법을 통하여 향상된 단일 소스 음성 신호를 추출하고 그를 통하여 음성인식을 수행한다. 우리는 기존의 빔포밍 중에서 가장 기본적 기법 중 하나인 delay-and-sum (DS) 빔포밍 기법과 DNN을 결합한 다 채널 특징 향상 기법을 제안한다. 제안하는 DNN은 중간 단계 특징 벡터를 활용한 공동 학습 기법을 통하여 왜곡된 다 채널 입력 음성 신호들과 깨끗한 음성 신호와의 관계를 효과적으로 표현한다. 제안된 기법은 multichannel wall street journal audio visual (MC-WSJAV) corpus에서의 실험을 통하여, 기존의 다채널 향상 기법들보다 뛰어난 성능을 보임을 확인하였다. 마지막으로, 불확정성 인지 학습 (Uncertainty-aware training, UAT) 기법이다. 위에서 소개된 기법들을 포함하여 강인한 음성인식을 위한 기존의 DNN 기반 기법들은 각각의 네트워크의 타겟을 추정하는데 있어서 결정론적인 추정 방식을 사용한다. 이는 추정치의 불확정성 문제 혹은 신뢰도 문제를 야기한다. 이러한 문제점을 극복하기 위하여 제안하는 UAT 기법은 확률론적인 변화 추정을 학습하고 수행할 수 있는 뉴럴 네트워크 모델인 변화 오토인코더 (variational autoencoder, VAE) 모델을 사용한다. UAT는 왜곡된 음성 특징 벡터와 음소 타겟과의 관계를 매개하는 강인한 은닉 변수를 깨끗한 음성 특징 벡터 추정치의 분포 정보를 이용하여 모델링한다. UAT의 은닉 변수들은 딥 러닝 기반 음향 모델에 최적화된 uncertainty decoding (UD) 프레임워크로부터 유도된 최대 우도 기준에 따라서 학습된다. 제안된 기법은 Aurora-4 DB와 CHiME-4 DB에서 기존의 DNN 기반 기법들을 크게 뛰어넘는 성능을 보였다.In this thesis, we propose three acoustic modeling techniques for robust automatic speech recognition (ASR). Firstly, we propose a DNN-based acoustic modeling technique which makes the best use of the inherent noise-robustness of DNN is proposed. By applying this technique, the DNN can automatically learn the complicated relationship among the noisy, clean speech and noise estimate to phonetic target smoothly. The proposed method outperformed noise-aware training (NAT), i.e., the conventional auxiliary-feature-based model adaptation technique in Aurora-5 DB. The second method is multi-channel feature enhancement technique. In the general multi-channel speech recognition scenario, the enhanced single speech signal source is extracted from the multiple inputs using beamforming, i.e., the conventional signal-processing-based technique and the speech recognition process is performed by feeding that source into the acoustic model. We propose the multi-channel feature enhancement DNN algorithm by properly combining the delay-and-sum (DS) beamformer, which is one of the conventional beamforming techniques and DNN. Through the experiments using multichannel wall street journal audio visual (MC-WSJ-AV) corpus, it has been shown that the proposed method outperformed the conventional multi-channel feature enhancement techniques. Finally, uncertainty-aware training (UAT) technique is proposed. The most of the existing DNN-based techniques including the techniques introduced above, aim to optimize the point estimates of the targets (e.g., clean features, and acoustic model parameters). This tampers with the reliability of the estimates. In order to overcome this issue, UAT employs a modified structure of variational autoencoder (VAE), a neural network model which learns and performs stochastic variational inference (VIF). UAT models the robust latent variables which intervene the mapping between the noisy observed features and the phonetic target using the distributive information of the clean feature estimates. The proposed technique outperforms the conventional DNN-based techniques on Aurora-4 and CHiME-4 databases.Abstract i Contents iv List of Figures ix List of Tables xiii 1 Introduction 1 2 Background 9 2.1 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Experimental Database . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Aurora-4 DB . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 Aurora-5 DB . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.3 MC-WSJ-AV DB . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.4 CHiME-4 DB . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3 Two-stage Noise-aware Training for Environment-robust Speech Recognition 25 iii 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Noise-aware Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3 Two-stage NAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.1 Lower DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.2 Upper DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.3 Joint Training . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4.1 GMM-HMM System . . . . . . . . . . . . . . . . . . . . . . . 37 3.4.2 Training and Structures of DNN-based Techniques . . . . . . 37 3.4.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 40 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4 DNN-based Feature Enhancement for Robust Multichannel Speech Recognition 45 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Observation Model in Multi-Channel Reverberant Noisy Environment 49 4.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3.1 Lower DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3.2 Upper DNN and Joint Training . . . . . . . . . . . . . . . . . 54 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.4.1 Recognition System and Feature Extraction . . . . . . . . . . 56 4.4.2 Training and Structures of DNN-based Techniques . . . . . . 58 4.4.3 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.4.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 62 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 iv 5 Uncertainty-aware Training for DNN-HMM System using Varia- tional Inference 67 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.2 Uncertainty Decoding for Noise Robustness . . . . . . . . . . . . . . 72 5.3 Variational Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.4 VIF-based uncertainty-aware Training . . . . . . . . . . . . . . . . . 83 5.4.1 Clean Uncertainty Network . . . . . . . . . . . . . . . . . . . 91 5.4.2 Environment Uncertainty Network . . . . . . . . . . . . . . . 93 5.4.3 Prediction Network and Joint Training . . . . . . . . . . . . . 95 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.5.1 Experimental Setup: Feature Extraction and ASR System . . 96 5.5.2 Network Structures . . . . . . . . . . . . . . . . . . . . . . . . 98 5.5.3 Eects of CUN on the Noise Robustness . . . . . . . . . . . . 104 5.5.4 Uncertainty Representation in Dierent SNR Condition . . . 105 5.5.5 Result of Speech Recognition . . . . . . . . . . . . . . . . . . 112 5.5.6 Result of Speech Recognition with LSTM-HMM . . . . . . . 114 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6 Conclusions 127 Bibliography 131 요약 145Docto

SNU Open Repository and Archive

Time-domain speech enhancement using generative adversarial networks

Author: Bonafonte Cávez Antonio
Pascual de la Puente Santiago
Serra Joan
Publication venue: 'Elsevier BV'
Publication date: 01/11/2019
Field of study

Speech enhancement improves recorded voice utterances to eliminate noise that might be impeding their intelligibility or compromising their quality. Typical speech enhancement systems are based on regression approaches that subtract noise or predict clean signals. Most of them do not operate directly on waveforms. In this work, we propose a generative approach to regenerate corrupted signals into a clean version by using generative adversarial networks on the raw signal. We also explore several variations of the proposed system, obtaining insights into proper architectural choices for an adversarially trained, convolutional autoencoder applied to speech. We conduct both objective and subjective evaluations to assess the performance of the proposed method. The former helps us choose among variations and better tune hyperparameters, while the latter is used in a listening experiment with 42 subjects, confirming the effectiveness of the approach in the real world. We also demonstrate the applicability of the approach for more generalized speech enhancement, where we have to regenerate voices from whispered signals.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC