1,636 research outputs found

    Band-pass filtering of the time sequences of spectral parameters for robust wireless speech recognition

    Get PDF
    In this paper we address the problem of automatic speech recognition when wireless speech communication systems are involved. In this context, three main sources of distortion should be considered: acoustic environment, speech coding and transmission errors. Whilst the first one has already received a lot of attention, the last two deserve further investigation in our opinion. We have found out that band-pass filtering of the recognition features improves ASR performance when distortions due to these particular communication systems are present. Furthermore, we have evaluated two alternative configurations at different bit error rates (BER) typical of these channels: band-pass filtering the LP-MFCC parameters or a modification of the RASTA-PLP using a sharper low-pass section perform consistently better than LP-MFCC and RASTA-PLP, respectively.Publicad

    Temporally Varying Weight Regression for Speech Recognition

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation

    Get PDF
    This paper investigates deep neural networks (DNN) based on nonlinear feature mapping and statistical linear feature adaptation approaches for reducing reverberation in speech signals. In the nonlinear feature mapping approach, DNN is trained from parallel clean/distorted speech corpus to map reverberant and noisy speech coefficients (such as log magnitude spectrum) to the underlying clean speech coefficients. The constraint imposed by dynamic features (i.e., the time derivatives of the speech coefficients) are used to enhance the smoothness of predicted coefficient trajectories in two ways. One is to obtain the enhanced speech coefficients with a least square estimation from the coefficients and dynamic features predicted by DNN. The other is to incorporate the constraint of dynamic features directly into the DNN training process using a sequential cost function. In the linear feature adaptation approach, a sparse linear transform, called cross transform, is used to transform multiple frames of speech coefficients to a new feature space. The transform is estimated to maximize the likelihood of the transformed coefficients given a model of clean speech coefficients. Unlike the DNN approach, no parallel corpus is used and no assumption on distortion types is made. The two approaches are evaluated on the REVERB Challenge 2014 tasks. Both speech enhancement and automatic speech recognition (ASR) results show that the DNN-based mappings significantly reduce the reverberation in speech and improve both speech quality and ASR performance. For the speech enhancement task, the proposed dynamic feature constraint help to improve cepstral distance, frequency-weighted segmental signal-to-noise ratio (SNR), and log likelihood ratio metrics while moderately degrades the speech-to-reverberation modulation energy ratio. In addition, the cross transform feature adaptation improves the ASR performance significantly for clean-condition trained acoustic models.Published versio

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    On adaptive decision rules and decision parameter adaptation for automatic speech recognition

    Get PDF
    Recent advances in automatic speech recognition are accomplished by designing a plug-in maximum a posteriori decision rule such that the forms of the acoustic and language model distributions are specified and the parameters of the assumed distributions are estimated from a collection of speech and language training corpora. Maximum-likelihood point estimation is by far the most prevailing training method. However, due to the problems of unknown speech distributions, sparse training data, high spectral and temporal variabilities in speech, and possible mismatch between training and testing conditions, a dynamic training strategy is needed. To cope with the changing speakers and speaking conditions in real operational conditions for high-performance speech recognition, such paradigms incorporate a small amount of speaker and environment specific adaptation data into the training process. Bayesian adaptive learning is an optimal way to combine prior knowledge in an existing collection of general models with a new set of condition-specific adaptation data. In this paper, the mathematical framework for Bayesian adaptation of acoustic and language model parameters is first described. Maximum a posteriori point estimation is then developed for hidden Markov models and a number of useful parameters densities commonly used in automatic speech recognition and natural language processing.published_or_final_versio

    강인한 음성인식을 위한 DNN 기반 음향 모델링

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 공과대학 전기·컴퓨터공학부, 2019. 2. 김남수.본 논문에서는 강인한 음성인식을 위해서 DNN을 활용한 음향 모델링 기법들을 제안한다. 본 논문에서는 크게 세 가지의 DNN 기반 기법을 제안한다. 첫 번째는 DNN이 가지고 있는 잡음 환경에 대한 강인함을 보조 특징 벡터들을 통하여 최대로 활용하는 음향 모델링 기법이다. 이러한 기법을 통하여 DNN은 왜곡된 음성, 깨끗한 음성, 잡음 추정치, 그리고 음소 타겟과의 복잡한 관계를 보다 원활하게 학습하게 된다. 본 기법은 Aurora-5 DB 에서 기존의 보조 잡음 특징 벡터를 활용한 모델 적응 기법인 잡음 인지 학습 (noise-aware training, NAT) 기법을 크게 뛰어넘는 성능을 보였다. 두 번째는 DNN을 활용한 다 채널 특징 향상 기법이다. 기존의 다 채널 시나리오에서는 전통적인 신호 처리 기법인 빔포밍 기법을 통하여 향상된 단일 소스 음성 신호를 추출하고 그를 통하여 음성인식을 수행한다. 우리는 기존의 빔포밍 중에서 가장 기본적 기법 중 하나인 delay-and-sum (DS) 빔포밍 기법과 DNN을 결합한 다 채널 특징 향상 기법을 제안한다. 제안하는 DNN은 중간 단계 특징 벡터를 활용한 공동 학습 기법을 통하여 왜곡된 다 채널 입력 음성 신호들과 깨끗한 음성 신호와의 관계를 효과적으로 표현한다. 제안된 기법은 multichannel wall street journal audio visual (MC-WSJAV) corpus에서의 실험을 통하여, 기존의 다채널 향상 기법들보다 뛰어난 성능을 보임을 확인하였다. 마지막으로, 불확정성 인지 학습 (Uncertainty-aware training, UAT) 기법이다. 위에서 소개된 기법들을 포함하여 강인한 음성인식을 위한 기존의 DNN 기반 기법들은 각각의 네트워크의 타겟을 추정하는데 있어서 결정론적인 추정 방식을 사용한다. 이는 추정치의 불확정성 문제 혹은 신뢰도 문제를 야기한다. 이러한 문제점을 극복하기 위하여 제안하는 UAT 기법은 확률론적인 변화 추정을 학습하고 수행할 수 있는 뉴럴 네트워크 모델인 변화 오토인코더 (variational autoencoder, VAE) 모델을 사용한다. UAT는 왜곡된 음성 특징 벡터와 음소 타겟과의 관계를 매개하는 강인한 은닉 변수를 깨끗한 음성 특징 벡터 추정치의 분포 정보를 이용하여 모델링한다. UAT의 은닉 변수들은 딥 러닝 기반 음향 모델에 최적화된 uncertainty decoding (UD) 프레임워크로부터 유도된 최대 우도 기준에 따라서 학습된다. 제안된 기법은 Aurora-4 DB와 CHiME-4 DB에서 기존의 DNN 기반 기법들을 크게 뛰어넘는 성능을 보였다.In this thesis, we propose three acoustic modeling techniques for robust automatic speech recognition (ASR). Firstly, we propose a DNN-based acoustic modeling technique which makes the best use of the inherent noise-robustness of DNN is proposed. By applying this technique, the DNN can automatically learn the complicated relationship among the noisy, clean speech and noise estimate to phonetic target smoothly. The proposed method outperformed noise-aware training (NAT), i.e., the conventional auxiliary-feature-based model adaptation technique in Aurora-5 DB. The second method is multi-channel feature enhancement technique. In the general multi-channel speech recognition scenario, the enhanced single speech signal source is extracted from the multiple inputs using beamforming, i.e., the conventional signal-processing-based technique and the speech recognition process is performed by feeding that source into the acoustic model. We propose the multi-channel feature enhancement DNN algorithm by properly combining the delay-and-sum (DS) beamformer, which is one of the conventional beamforming techniques and DNN. Through the experiments using multichannel wall street journal audio visual (MC-WSJ-AV) corpus, it has been shown that the proposed method outperformed the conventional multi-channel feature enhancement techniques. Finally, uncertainty-aware training (UAT) technique is proposed. The most of the existing DNN-based techniques including the techniques introduced above, aim to optimize the point estimates of the targets (e.g., clean features, and acoustic model parameters). This tampers with the reliability of the estimates. In order to overcome this issue, UAT employs a modified structure of variational autoencoder (VAE), a neural network model which learns and performs stochastic variational inference (VIF). UAT models the robust latent variables which intervene the mapping between the noisy observed features and the phonetic target using the distributive information of the clean feature estimates. The proposed technique outperforms the conventional DNN-based techniques on Aurora-4 and CHiME-4 databases.Abstract i Contents iv List of Figures ix List of Tables xiii 1 Introduction 1 2 Background 9 2.1 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Experimental Database . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Aurora-4 DB . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 Aurora-5 DB . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.3 MC-WSJ-AV DB . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.4 CHiME-4 DB . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3 Two-stage Noise-aware Training for Environment-robust Speech Recognition 25 iii 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Noise-aware Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3 Two-stage NAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.1 Lower DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.2 Upper DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.3 Joint Training . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4.1 GMM-HMM System . . . . . . . . . . . . . . . . . . . . . . . 37 3.4.2 Training and Structures of DNN-based Techniques . . . . . . 37 3.4.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 40 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4 DNN-based Feature Enhancement for Robust Multichannel Speech Recognition 45 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Observation Model in Multi-Channel Reverberant Noisy Environment 49 4.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3.1 Lower DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3.2 Upper DNN and Joint Training . . . . . . . . . . . . . . . . . 54 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.4.1 Recognition System and Feature Extraction . . . . . . . . . . 56 4.4.2 Training and Structures of DNN-based Techniques . . . . . . 58 4.4.3 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.4.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 62 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 iv 5 Uncertainty-aware Training for DNN-HMM System using Varia- tional Inference 67 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.2 Uncertainty Decoding for Noise Robustness . . . . . . . . . . . . . . 72 5.3 Variational Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.4 VIF-based uncertainty-aware Training . . . . . . . . . . . . . . . . . 83 5.4.1 Clean Uncertainty Network . . . . . . . . . . . . . . . . . . . 91 5.4.2 Environment Uncertainty Network . . . . . . . . . . . . . . . 93 5.4.3 Prediction Network and Joint Training . . . . . . . . . . . . . 95 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.5.1 Experimental Setup: Feature Extraction and ASR System . . 96 5.5.2 Network Structures . . . . . . . . . . . . . . . . . . . . . . . . 98 5.5.3 Eects of CUN on the Noise Robustness . . . . . . . . . . . . 104 5.5.4 Uncertainty Representation in Dierent SNR Condition . . . 105 5.5.5 Result of Speech Recognition . . . . . . . . . . . . . . . . . . 112 5.5.6 Result of Speech Recognition with LSTM-HMM . . . . . . . 114 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6 Conclusions 127 Bibliography 131 요약 145Docto

    Hidden Markov model-based speech enhancement

    Get PDF
    This work proposes a method of model-based speech enhancement that uses a network of HMMs to first decode noisy speech and to then synthesise a set of features that enables a speech production model to reconstruct clean speech. The motivation is to remove the distortion and residual and musical noises that are associated with conventional filteringbased methods of speech enhancement. STRAIGHT forms the speech production model for speech reconstruction and requires a time-frequency spectral surface, aperiodicity and a fundamental frequency contour. The technique of HMM-based synthesis is used to create the estimate of the timefrequency surface, and aperiodicity after the model and state sequence is obtained from HMM decoding of the input noisy speech. Fundamental frequency were found to be best estimated using the PEFAC method rather than synthesis from the HMMs. For the robust HMM decoding in noisy conditions it is necessary for the HMMs to model noisy speech and consequently noise adaptation is investigated to achieve this and its resulting effect on the reconstructed speech measured. Even with such noise adaptation to match the HMMs to the noisy conditions, decoding errors arise, both in terms of incorrect decoding and time alignment errors. Confidence measures are developed to identify such errors and then compensation methods developed to conceal these errors in the enhanced speech signal. Speech quality and intelligibility analysis is first applied in terms of PESQ and NCM showing the superiority of the proposed method against conventional methods at low SNRs. Three way subjective MOS listening test then discovers the performance of the proposed method overwhelmingly surpass the conventional methods over all noise conditions and then a subjective word recognition test shows an advantage of the proposed method over speech intelligibility to the conventional methods at low SNRs
    corecore