1,465 research outputs found
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments
Eliminating the negative effect of non-stationary environmental noise is a
long-standing research topic for automatic speech recognition that stills
remains an important challenge. Data-driven supervised approaches, including
ones based on deep neural networks, have recently emerged as potential
alternatives to traditional unsupervised approaches and with sufficient
training, can alleviate the shortcomings of the unsupervised methods in various
real-life acoustic environments. In this light, we review recently developed,
representative deep learning approaches for tackling non-stationary additive
and convolutional degradation of speech with the aim of providing guidelines
for those involved in the development of environmentally robust speech
recognition systems. We separately discuss single- and multi-channel techniques
developed for the front-end and back-end of speech recognition systems, as well
as joint front-end and back-end training frameworks
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
Audio-visual speech processing system for Polish applicable to human-computer interaction
This paper describes audio-visual speech recognition system for Polish language and a set of performance tests under various acoustic conditions. We first present the overall structure of AVASR systems with three main areas: audio features extraction, visual features extraction and subsequently, audiovisual speech integration. We present MFCC features for audio stream with standard HMM modeling technique, then we describe appearance and shape based visual features. Subsequently we present two feature integration techniques, feature concatenation and model fusion. We also discuss the results of a set of experiments conducted to select best system setup for Polish, under noisy audio conditions. Experiments are simulating human-computer interaction in computer control case with voice commands in difficult audio environments. With Active Appearance Model (AAM) and multistream Hidden Markov Model (HMM) we can improve system accuracy by reducing Word Error Rate for more than 30%, comparing to audio-only speech recognition, when Signal-to-Noise Ratio goes down to 0dB
Generalized multi-stream hidden Markov models.
For complex classification systems, data is usually gathered from multiple sources of information that have varying degree of reliability. In fact, assuming that the different sources have the same relevance in describing all the data might lead to an erroneous behavior. The classification error accumulates and can be more severe for temporal data where each sample is represented by a sequence of observations. Thus, there is compelling evidence that learning algorithms should include a relevance weight for each source of information (stream) as a parameter that needs to be learned. In this dissertation, we assumed that the multi-stream temporal data is generated by independent and synchronous streams. Using this assumption, we develop, implement, and test multi- stream continuous and discrete hidden Markov model (HMM) algorithms. For the discrete case, we propose two new approaches to generalize the baseline discrete HMM. The first one combines unsupervised learning, feature discrimination, standard discrete HMMs and weighted distances to learn the codebook with feature-dependent weights for each symbol. The second approach consists of modifying the HMM structure to include stream relevance weights, generalizing the standard discrete Baum-Welch learning algorithm, and deriving the necessary conditions to optimize all model parameters simultaneously. We also generalize the minimum classification error (MCE) discriminative training algorithm to include stream relevance weights. For the continuous HMM, we introduce a. new approach that integrates the stream relevance weights in the objective function. Our approach is based on the linearization of the probability density function. Two variations are proposed: the mixture and state level variations. As in the discrete case, we generalize the continuous Baum-Welch learning algorithm to accommodate these changes, and we derive the necessary conditions for updating the model parameters. We also generalize the MCE learning algorithm to derive the necessary conditions for the model parameters\u27 update. The proposed discrete and continuous HMM are tested on synthetic data sets. They are also validated on various applications including Australian Sign Language, audio classification, face classification, and more extensively on the problem of landmine detection using ground penetrating radar data. For all applications, we show that considerable improvement can be achieved compared to the baseline HMM and the existing multi-stream HMM algorithms
Recommended from our members
Evaluation and analysis of hybrid intelligent pattern recognition techniques for speaker identification
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.The rapid momentum of the technology progress in the recent years has led to a tremendous rise in the use of biometric authentication systems. The objective of this research is to investigate the problem
of identifying a speaker from its voice regardless of the content (i.e.
text-independent), and to design efficient methods of combining face and voice in producing a robust authentication system.
A novel approach towards speaker identification is developed using
wavelet analysis, and multiple neural networks including Probabilistic
Neural Network (PNN), General Regressive Neural Network (GRNN)and Radial Basis Function-Neural Network (RBF NN) with the AND
voting scheme. This approach is tested on GRID and VidTIMIT cor-pora and comprehensive test results have been validated with state-
of-the-art approaches. The system was found to be competitive and it improved the recognition rate by 15% as compared to the classical Mel-frequency Cepstral Coeยฑcients (MFCC), and reduced the recognition time by 40% compared to Back Propagation Neural Network (BPNN), Gaussian Mixture Models (GMM) and Principal Component Analysis (PCA).
Another novel approach using vowel formant analysis is implemented using Linear Discriminant Analysis (LDA). Vowel formant based speaker identification is best suitable for real-time implementation and requires only a few bytes of information to be stored for each speaker, making it both storage and time efficient. Tested on GRID and Vid-TIMIT, the proposed scheme was found to be 85.05% accurate when Linear Predictive Coding (LPC) is used to extract the vowel formants, which is much higher than the accuracy of BPNN and GMM. Since the proposed scheme does not require any training time other than creating a small database of vowel formants, it is faster as well. Furthermore, an increasing number of speakers makes it diยฑcult for BPNN and GMM to sustain their accuracy, but the proposed score-based methodology stays almost linear.
Finally, a novel audio-visual fusion based identification system is implemented using GMM and MFCC for speaker identiยฏcation and PCA for face recognition. The results of speaker identification and face recognition are fused at different levels, namely the feature, score and decision levels. Both the score-level and decision-level (with OR voting) fusions were shown to outperform the feature-level fusion in terms of accuracy and error resilience. The result is in line with the distinct nature of the two modalities which lose themselves when combined at the feature-level. The GRID and VidTIMIT test results validate that
the proposed scheme is one of the best candidates for the fusion of
face and voice due to its low computational time and high recognition accuracy
๊ฐ์ธํ ์์ฑ์ธ์์ ์ํ DNN ๊ธฐ๋ฐ ์ํฅ ๋ชจ๋ธ๋ง
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ)-- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ปดํจํฐ๊ณตํ๋ถ, 2019. 2. ๊น๋จ์.๋ณธ ๋
ผ๋ฌธ์์๋ ๊ฐ์ธํ ์์ฑ์ธ์์ ์ํด์ DNN์ ํ์ฉํ ์ํฅ ๋ชจ๋ธ๋ง ๊ธฐ๋ฒ๋ค์ ์ ์ํ๋ค. ๋ณธ ๋
ผ๋ฌธ์์๋ ํฌ๊ฒ ์ธ ๊ฐ์ง์ DNN ๊ธฐ๋ฐ ๊ธฐ๋ฒ์ ์ ์ํ๋ค. ์ฒซ ๋ฒ์งธ๋ DNN์ด ๊ฐ์ง๊ณ ์๋ ์ก์ ํ๊ฒฝ์ ๋ํ ๊ฐ์ธํจ์ ๋ณด์กฐ ํน์ง ๋ฒกํฐ๋ค์ ํตํ์ฌ ์ต๋๋ก ํ์ฉํ๋ ์ํฅ ๋ชจ๋ธ๋ง ๊ธฐ๋ฒ์ด๋ค. ์ด๋ฌํ ๊ธฐ๋ฒ์ ํตํ์ฌ DNN์ ์๊ณก๋ ์์ฑ, ๊นจ๋ํ ์์ฑ, ์ก์ ์ถ์ ์น, ๊ทธ๋ฆฌ๊ณ ์์ ํ๊ฒ๊ณผ์ ๋ณต์กํ ๊ด๊ณ๋ฅผ ๋ณด๋ค ์ํํ๊ฒ ํ์ตํ๊ฒ ๋๋ค. ๋ณธ ๊ธฐ๋ฒ์ Aurora-5 DB ์์ ๊ธฐ์กด์ ๋ณด์กฐ ์ก์ ํน์ง ๋ฒกํฐ๋ฅผ ํ์ฉํ ๋ชจ๋ธ ์ ์ ๊ธฐ๋ฒ์ธ ์ก์ ์ธ์ง ํ์ต (noise-aware training, NAT) ๊ธฐ๋ฒ์ ํฌ๊ฒ ๋ฐ์ด๋๋ ์ฑ๋ฅ์ ๋ณด์๋ค.
๋ ๋ฒ์งธ๋ DNN์ ํ์ฉํ ๋ค ์ฑ๋ ํน์ง ํฅ์ ๊ธฐ๋ฒ์ด๋ค. ๊ธฐ์กด์ ๋ค ์ฑ๋ ์๋๋ฆฌ์ค์์๋ ์ ํต์ ์ธ ์ ํธ ์ฒ๋ฆฌ ๊ธฐ๋ฒ์ธ ๋นํฌ๋ฐ ๊ธฐ๋ฒ์ ํตํ์ฌ ํฅ์๋ ๋จ์ผ ์์ค ์์ฑ ์ ํธ๋ฅผ ์ถ์ถํ๊ณ ๊ทธ๋ฅผ ํตํ์ฌ ์์ฑ์ธ์์ ์ํํ๋ค. ์ฐ๋ฆฌ๋ ๊ธฐ์กด์ ๋นํฌ๋ฐ ์ค์์ ๊ฐ์ฅ ๊ธฐ๋ณธ์ ๊ธฐ๋ฒ ์ค ํ๋์ธ delay-and-sum (DS) ๋นํฌ๋ฐ ๊ธฐ๋ฒ๊ณผ DNN์ ๊ฒฐํฉํ ๋ค ์ฑ๋ ํน์ง ํฅ์ ๊ธฐ๋ฒ์ ์ ์ํ๋ค. ์ ์ํ๋ DNN์ ์ค๊ฐ ๋จ๊ณ ํน์ง ๋ฒกํฐ๋ฅผ ํ์ฉํ ๊ณต๋ ํ์ต ๊ธฐ๋ฒ์ ํตํ์ฌ ์๊ณก๋ ๋ค ์ฑ๋ ์
๋ ฅ ์์ฑ ์ ํธ๋ค๊ณผ ๊นจ๋ํ ์์ฑ ์ ํธ์์ ๊ด๊ณ๋ฅผ ํจ๊ณผ์ ์ผ๋ก ํํํ๋ค. ์ ์๋ ๊ธฐ๋ฒ์ multichannel wall street journal audio visual (MC-WSJAV) corpus์์์ ์คํ์ ํตํ์ฌ, ๊ธฐ์กด์ ๋ค์ฑ๋ ํฅ์ ๊ธฐ๋ฒ๋ค๋ณด๋ค ๋ฐ์ด๋ ์ฑ๋ฅ์ ๋ณด์์ ํ์ธํ์๋ค.
๋ง์ง๋ง์ผ๋ก, ๋ถํ์ ์ฑ ์ธ์ง ํ์ต (Uncertainty-aware training, UAT) ๊ธฐ๋ฒ์ด๋ค. ์์์ ์๊ฐ๋ ๊ธฐ๋ฒ๋ค์ ํฌํจํ์ฌ ๊ฐ์ธํ ์์ฑ์ธ์์ ์ํ ๊ธฐ์กด์ DNN ๊ธฐ๋ฐ ๊ธฐ๋ฒ๋ค์ ๊ฐ๊ฐ์ ๋คํธ์ํฌ์ ํ๊ฒ์ ์ถ์ ํ๋๋ฐ ์์ด์ ๊ฒฐ์ ๋ก ์ ์ธ ์ถ์ ๋ฐฉ์์ ์ฌ์ฉํ๋ค. ์ด๋ ์ถ์ ์น์ ๋ถํ์ ์ฑ ๋ฌธ์ ํน์ ์ ๋ขฐ๋ ๋ฌธ์ ๋ฅผ ์ผ๊ธฐํ๋ค. ์ด๋ฌํ ๋ฌธ์ ์ ์ ๊ทน๋ณตํ๊ธฐ ์ํ์ฌ ์ ์ํ๋ UAT ๊ธฐ๋ฒ์ ํ๋ฅ ๋ก ์ ์ธ ๋ณํ ์ถ์ ์ ํ์ตํ๊ณ ์ํํ ์ ์๋ ๋ด๋ด ๋คํธ์ํฌ ๋ชจ๋ธ์ธ ๋ณํ ์คํ ์ธ์ฝ๋ (variational autoencoder, VAE) ๋ชจ๋ธ์ ์ฌ์ฉํ๋ค. UAT๋ ์๊ณก๋ ์์ฑ ํน์ง ๋ฒกํฐ์ ์์ ํ๊ฒ๊ณผ์ ๊ด๊ณ๋ฅผ ๋งค๊ฐํ๋ ๊ฐ์ธํ ์๋ ๋ณ์๋ฅผ ๊นจ๋ํ ์์ฑ ํน์ง ๋ฒกํฐ ์ถ์ ์น์ ๋ถํฌ ์ ๋ณด๋ฅผ ์ด์ฉํ์ฌ ๋ชจ๋ธ๋งํ๋ค. UAT์ ์๋ ๋ณ์๋ค์ ๋ฅ ๋ฌ๋ ๊ธฐ๋ฐ ์ํฅ ๋ชจ๋ธ์ ์ต์ ํ๋ uncertainty decoding (UD) ํ๋ ์์ํฌ๋ก๋ถํฐ ์ ๋๋ ์ต๋ ์ฐ๋ ๊ธฐ์ค์ ๋ฐ๋ผ์ ํ์ต๋๋ค. ์ ์๋ ๊ธฐ๋ฒ์ Aurora-4 DB์ CHiME-4 DB์์ ๊ธฐ์กด์ DNN ๊ธฐ๋ฐ ๊ธฐ๋ฒ๋ค์ ํฌ๊ฒ ๋ฐ์ด๋๋ ์ฑ๋ฅ์ ๋ณด์๋ค.In this thesis, we propose three acoustic modeling techniques for robust automatic speech recognition (ASR). Firstly, we propose a DNN-based acoustic modeling technique which makes the best use of the inherent noise-robustness of DNN is proposed. By applying this technique, the DNN can automatically learn the complicated relationship among the noisy, clean speech and noise estimate to phonetic target smoothly. The proposed method outperformed noise-aware training (NAT), i.e., the conventional auxiliary-feature-based model adaptation technique in Aurora-5 DB.
The second method is multi-channel feature enhancement technique. In the general multi-channel speech recognition scenario, the enhanced single speech signal source is extracted from the multiple inputs using beamforming, i.e., the conventional signal-processing-based technique and the speech recognition process is performed by feeding that source into the acoustic model. We propose the multi-channel feature enhancement DNN algorithm by properly combining the delay-and-sum (DS) beamformer, which is one of the conventional beamforming techniques and DNN. Through the experiments using multichannel wall street journal audio visual (MC-WSJ-AV) corpus, it has been shown that the proposed method outperformed the conventional multi-channel feature enhancement techniques.
Finally, uncertainty-aware training (UAT) technique is proposed. The most of the existing DNN-based techniques including the techniques introduced above, aim to optimize the point estimates of the targets (e.g., clean features, and acoustic model parameters). This tampers with the reliability of the estimates. In order to overcome this issue, UAT employs a modified structure of variational autoencoder (VAE), a neural network model which learns and performs stochastic variational inference (VIF). UAT models the robust latent variables which intervene the mapping between the noisy observed features and the phonetic target using the distributive information of the clean feature estimates. The proposed technique outperforms the conventional DNN-based techniques on Aurora-4 and CHiME-4 databases.Abstract i
Contents iv
List of Figures ix
List of Tables xiii
1 Introduction 1
2 Background 9
2.1 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Experimental Database . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Aurora-4 DB . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Aurora-5 DB . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 MC-WSJ-AV DB . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.4 CHiME-4 DB . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Two-stage Noise-aware Training for Environment-robust Speech
Recognition 25
iii
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Noise-aware Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Two-stage NAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.1 Lower DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.2 Upper DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.3 Joint Training . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.1 GMM-HMM System . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.2 Training and Structures of DNN-based Techniques . . . . . . 37
3.4.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 40
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4 DNN-based Feature Enhancement for Robust Multichannel Speech
Recognition 45
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Observation Model in Multi-Channel Reverberant Noisy Environment 49
4.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.1 Lower DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3.2 Upper DNN and Joint Training . . . . . . . . . . . . . . . . . 54
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4.1 Recognition System and Feature Extraction . . . . . . . . . . 56
4.4.2 Training and Structures of DNN-based Techniques . . . . . . 58
4.4.3 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 62
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
iv
5 Uncertainty-aware Training for DNN-HMM System using Varia-
tional Inference 67
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Uncertainty Decoding for Noise Robustness . . . . . . . . . . . . . . 72
5.3 Variational Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.4 VIF-based uncertainty-aware Training . . . . . . . . . . . . . . . . . 83
5.4.1 Clean Uncertainty Network . . . . . . . . . . . . . . . . . . . 91
5.4.2 Environment Uncertainty Network . . . . . . . . . . . . . . . 93
5.4.3 Prediction Network and Joint Training . . . . . . . . . . . . . 95
5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.5.1 Experimental Setup: Feature Extraction and ASR System . . 96
5.5.2 Network Structures . . . . . . . . . . . . . . . . . . . . . . . . 98
5.5.3 Eects of CUN on the Noise Robustness . . . . . . . . . . . . 104
5.5.4 Uncertainty Representation in Dierent SNR Condition . . . 105
5.5.5 Result of Speech Recognition . . . . . . . . . . . . . . . . . . 112
5.5.6 Result of Speech Recognition with LSTM-HMM . . . . . . . 114
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6 Conclusions 127
Bibliography 131
์์ฝ 145Docto
- โฆ