305 research outputs found
Independent Component Analysis and Time-Frequency Masking for Speech Recognition in Multitalker Conditions
When a number of speakers are simultaneously active, for example in meetings or noisy public places, the sources of interest need to be separated from interfering speakers and from each other in order to be robustly recognized. Independent component analysis (ICA) has proven a valuable tool for this purpose. However, ICA outputs can still contain strong residual components of the interfering speakers whenever noise or reverberation is high. In such cases, nonlinear postprocessing can be applied to the ICA outputs, for the purpose of reducing remaining interferences. In order to improve robustness to the artefacts and loss of information caused by this process, recognition can be greatly enhanced by considering the processed speech feature vector as a random variable with time-varying uncertainty, rather than as deterministic. The aim of this paper is to show the potential to improve recognition of multiple overlapping speech signals through nonlinear postprocessing together with uncertainty-based decoding techniques
Deep Learning for Distant Speech Recognition
Deep learning is an emerging technology that is considered one of the most
promising directions for reaching higher levels of artificial intelligence.
Among the other achievements, building computers that understand speech
represents a crucial leap towards intelligent machines. Despite the great
efforts of the past decades, however, a natural and robust human-machine speech
interaction still appears to be out of reach, especially when users interact
with a distant microphone in noisy and reverberant environments. The latter
disturbances severely hamper the intelligibility of a speech signal, making
Distant Speech Recognition (DSR) one of the major open challenges in the field.
This thesis addresses the latter scenario and proposes some novel techniques,
architectures, and algorithms to improve the robustness of distant-talking
acoustic models. We first elaborate on methodologies for realistic data
contamination, with a particular emphasis on DNN training with simulated data.
We then investigate on approaches for better exploiting speech contexts,
proposing some original methodologies for both feed-forward and recurrent
neural networks. Lastly, inspired by the idea that cooperation across different
DNNs could be the key for counteracting the harmful effects of noise and
reverberation, we propose a novel deep learning paradigm called network of deep
neural networks. The analysis of the original concepts were based on extensive
experimental validations conducted on both real and simulated data, considering
different corpora, microphone configurations, environments, noisy conditions,
and ASR tasks.Comment: PhD Thesis Unitn, 201
Sparse and Low-rank Modeling for Automatic Speech Recognition
This thesis deals with exploiting the low-dimensional multi-subspace structure of speech towards the goal of improving acoustic modeling for automatic speech recognition (ASR). Leveraging the parsimonious hierarchical nature of speech, we hypothesize that whenever a speech signal is measured in a high-dimensional feature space, the true class information is embedded in low-dimensional subspaces whereas noise is scattered as random high-dimensional erroneous estimations in the features. In this context, the contribution of this thesis is twofold: (i) identify sparse and low-rank modeling approaches as excellent tools for extracting the class-specific low-dimensional subspaces in speech features, and (ii) employ these tools under novel ASR frameworks to enrich the acoustic information present in the speech features towards the goal of improving ASR. Techniques developed in this thesis focus on deep neural network (DNN) based posterior features which, under the sparse and low-rank modeling approaches, unveil the underlying class-specific low-dimensional subspaces very elegantly.
In this thesis, we tackle ASR tasks of varying difficulty, ranging from isolated word recognition (IWR) and connected digit recognition (CDR) to large-vocabulary continuous speech recognition (LVCSR). For IWR and CDR, we propose a novel \textit{Compressive Sensing} (CS) perspective towards ASR. Here exemplar-based speech recognition is posed as a problem of recovering sparse high-dimensional word representations from compressed low-dimensional phonetic representations. In the context of LVCSR, this thesis argues that albeit their power in representation learning, DNN based acoustic models still have room for improvement in exploiting the \textit{union of low-dimensional subspaces} structure of speech data. Therefore, this thesis proposes to enhance DNN posteriors by projecting them onto the manifolds of the underlying classes using principal component analysis (PCA) or compressive sensing based dictionaries. Projected posteriors are shown to be more accurate training targets for learning better acoustic models, resulting in improved ASR performance. The proposed approach is evaluated on both close-talk and far-field conditions, confirming the importance of sparse and low-rank modeling of speech in building a robust ASR framework. Finally, the conclusions of this thesis are further consolidated by an information theoretic analysis approach which explicitly quantifies the contribution of proposed techniques in improving ASR
Automatic speech recognition: from study to practice
Today, automatic speech recognition (ASR) is widely used for different purposes such as robotics, multimedia, medical and industrial application. Although many researches have been performed in this field in the past decades, there is still a lot of room to work. In order to start working in this area, complete knowledge of ASR systems as well as their weak points and problems is inevitable. Besides that, practical experience improves the theoretical knowledge understanding in a reliable way. Regarding to these facts, in this master thesis, we have first reviewed the principal structure of the standard HMM-based ASR systems from technical point of view. This includes, feature extraction, acoustic modeling, language modeling and decoding. Then, the most significant challenging points in ASR systems is discussed. These challenging points address different internal components characteristics or external agents which affect the ASR systems performance. Furthermore, we have implemented a Spanish language recognizer using HTK toolkit. Finally, two open research lines according to the studies of different sources in the field of ASR has been suggested for future work
Improving Dysarthric Speech Recognition by Enriching Training Datasets
Dysarthria is a motor speech disorder that results from disruptions in the neuro-motor interface and is characterised by poor articulation of phonemes and hyper-nasality and is characteristically different from normal speech. Many modern automatic speech recognition systems focus on a narrow range of speech diversity therefore as a consequence of this they exclude a groups of speakers who deviate in aspects of gender, race, age and speech impairment when building training datasets. This study attempts to develop an automatic speech recognition system that deals with dysarthric speech with limited dysarthric speech data. Speech utterances collected from the TORGO database are used to conduct experiments on a wav2vec2.0 model only trained on the Librispeech 960h dataset to obtain a baseline performance of the word error rate (WER) when recognising dysarthric speech. A version of the Librispeech model fine-tuned on multi-language datasets was tested to see if it would improve accuracy and achieved a top reduction of 24.15% in the WER for one of the male dysarthric speakers in the dataset. Transfer learning with speech recognition models and preprocessing dysarthric speech to improve its intelligibility by using general adversarial networks were limited in their potential due to a lack of dysarthric speech dataset of adequate size to use these technologies. The main conclusion drawn from this study is that a large diverse dysarthric speech dataset comparable to the size of datasets used to train machine learning ASR systems like Librispeech,with different types of speech, scripted and unscripted, is required to improve performance.
๊ฐ์ธํ ์์ฑ์ธ์์ ์ํ DNN ๊ธฐ๋ฐ ์ํฅ ๋ชจ๋ธ๋ง
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ)-- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ปดํจํฐ๊ณตํ๋ถ, 2019. 2. ๊น๋จ์.๋ณธ ๋
ผ๋ฌธ์์๋ ๊ฐ์ธํ ์์ฑ์ธ์์ ์ํด์ DNN์ ํ์ฉํ ์ํฅ ๋ชจ๋ธ๋ง ๊ธฐ๋ฒ๋ค์ ์ ์ํ๋ค. ๋ณธ ๋
ผ๋ฌธ์์๋ ํฌ๊ฒ ์ธ ๊ฐ์ง์ DNN ๊ธฐ๋ฐ ๊ธฐ๋ฒ์ ์ ์ํ๋ค. ์ฒซ ๋ฒ์งธ๋ DNN์ด ๊ฐ์ง๊ณ ์๋ ์ก์ ํ๊ฒฝ์ ๋ํ ๊ฐ์ธํจ์ ๋ณด์กฐ ํน์ง ๋ฒกํฐ๋ค์ ํตํ์ฌ ์ต๋๋ก ํ์ฉํ๋ ์ํฅ ๋ชจ๋ธ๋ง ๊ธฐ๋ฒ์ด๋ค. ์ด๋ฌํ ๊ธฐ๋ฒ์ ํตํ์ฌ DNN์ ์๊ณก๋ ์์ฑ, ๊นจ๋ํ ์์ฑ, ์ก์ ์ถ์ ์น, ๊ทธ๋ฆฌ๊ณ ์์ ํ๊ฒ๊ณผ์ ๋ณต์กํ ๊ด๊ณ๋ฅผ ๋ณด๋ค ์ํํ๊ฒ ํ์ตํ๊ฒ ๋๋ค. ๋ณธ ๊ธฐ๋ฒ์ Aurora-5 DB ์์ ๊ธฐ์กด์ ๋ณด์กฐ ์ก์ ํน์ง ๋ฒกํฐ๋ฅผ ํ์ฉํ ๋ชจ๋ธ ์ ์ ๊ธฐ๋ฒ์ธ ์ก์ ์ธ์ง ํ์ต (noise-aware training, NAT) ๊ธฐ๋ฒ์ ํฌ๊ฒ ๋ฐ์ด๋๋ ์ฑ๋ฅ์ ๋ณด์๋ค.
๋ ๋ฒ์งธ๋ DNN์ ํ์ฉํ ๋ค ์ฑ๋ ํน์ง ํฅ์ ๊ธฐ๋ฒ์ด๋ค. ๊ธฐ์กด์ ๋ค ์ฑ๋ ์๋๋ฆฌ์ค์์๋ ์ ํต์ ์ธ ์ ํธ ์ฒ๋ฆฌ ๊ธฐ๋ฒ์ธ ๋นํฌ๋ฐ ๊ธฐ๋ฒ์ ํตํ์ฌ ํฅ์๋ ๋จ์ผ ์์ค ์์ฑ ์ ํธ๋ฅผ ์ถ์ถํ๊ณ ๊ทธ๋ฅผ ํตํ์ฌ ์์ฑ์ธ์์ ์ํํ๋ค. ์ฐ๋ฆฌ๋ ๊ธฐ์กด์ ๋นํฌ๋ฐ ์ค์์ ๊ฐ์ฅ ๊ธฐ๋ณธ์ ๊ธฐ๋ฒ ์ค ํ๋์ธ delay-and-sum (DS) ๋นํฌ๋ฐ ๊ธฐ๋ฒ๊ณผ DNN์ ๊ฒฐํฉํ ๋ค ์ฑ๋ ํน์ง ํฅ์ ๊ธฐ๋ฒ์ ์ ์ํ๋ค. ์ ์ํ๋ DNN์ ์ค๊ฐ ๋จ๊ณ ํน์ง ๋ฒกํฐ๋ฅผ ํ์ฉํ ๊ณต๋ ํ์ต ๊ธฐ๋ฒ์ ํตํ์ฌ ์๊ณก๋ ๋ค ์ฑ๋ ์
๋ ฅ ์์ฑ ์ ํธ๋ค๊ณผ ๊นจ๋ํ ์์ฑ ์ ํธ์์ ๊ด๊ณ๋ฅผ ํจ๊ณผ์ ์ผ๋ก ํํํ๋ค. ์ ์๋ ๊ธฐ๋ฒ์ multichannel wall street journal audio visual (MC-WSJAV) corpus์์์ ์คํ์ ํตํ์ฌ, ๊ธฐ์กด์ ๋ค์ฑ๋ ํฅ์ ๊ธฐ๋ฒ๋ค๋ณด๋ค ๋ฐ์ด๋ ์ฑ๋ฅ์ ๋ณด์์ ํ์ธํ์๋ค.
๋ง์ง๋ง์ผ๋ก, ๋ถํ์ ์ฑ ์ธ์ง ํ์ต (Uncertainty-aware training, UAT) ๊ธฐ๋ฒ์ด๋ค. ์์์ ์๊ฐ๋ ๊ธฐ๋ฒ๋ค์ ํฌํจํ์ฌ ๊ฐ์ธํ ์์ฑ์ธ์์ ์ํ ๊ธฐ์กด์ DNN ๊ธฐ๋ฐ ๊ธฐ๋ฒ๋ค์ ๊ฐ๊ฐ์ ๋คํธ์ํฌ์ ํ๊ฒ์ ์ถ์ ํ๋๋ฐ ์์ด์ ๊ฒฐ์ ๋ก ์ ์ธ ์ถ์ ๋ฐฉ์์ ์ฌ์ฉํ๋ค. ์ด๋ ์ถ์ ์น์ ๋ถํ์ ์ฑ ๋ฌธ์ ํน์ ์ ๋ขฐ๋ ๋ฌธ์ ๋ฅผ ์ผ๊ธฐํ๋ค. ์ด๋ฌํ ๋ฌธ์ ์ ์ ๊ทน๋ณตํ๊ธฐ ์ํ์ฌ ์ ์ํ๋ UAT ๊ธฐ๋ฒ์ ํ๋ฅ ๋ก ์ ์ธ ๋ณํ ์ถ์ ์ ํ์ตํ๊ณ ์ํํ ์ ์๋ ๋ด๋ด ๋คํธ์ํฌ ๋ชจ๋ธ์ธ ๋ณํ ์คํ ์ธ์ฝ๋ (variational autoencoder, VAE) ๋ชจ๋ธ์ ์ฌ์ฉํ๋ค. UAT๋ ์๊ณก๋ ์์ฑ ํน์ง ๋ฒกํฐ์ ์์ ํ๊ฒ๊ณผ์ ๊ด๊ณ๋ฅผ ๋งค๊ฐํ๋ ๊ฐ์ธํ ์๋ ๋ณ์๋ฅผ ๊นจ๋ํ ์์ฑ ํน์ง ๋ฒกํฐ ์ถ์ ์น์ ๋ถํฌ ์ ๋ณด๋ฅผ ์ด์ฉํ์ฌ ๋ชจ๋ธ๋งํ๋ค. UAT์ ์๋ ๋ณ์๋ค์ ๋ฅ ๋ฌ๋ ๊ธฐ๋ฐ ์ํฅ ๋ชจ๋ธ์ ์ต์ ํ๋ uncertainty decoding (UD) ํ๋ ์์ํฌ๋ก๋ถํฐ ์ ๋๋ ์ต๋ ์ฐ๋ ๊ธฐ์ค์ ๋ฐ๋ผ์ ํ์ต๋๋ค. ์ ์๋ ๊ธฐ๋ฒ์ Aurora-4 DB์ CHiME-4 DB์์ ๊ธฐ์กด์ DNN ๊ธฐ๋ฐ ๊ธฐ๋ฒ๋ค์ ํฌ๊ฒ ๋ฐ์ด๋๋ ์ฑ๋ฅ์ ๋ณด์๋ค.In this thesis, we propose three acoustic modeling techniques for robust automatic speech recognition (ASR). Firstly, we propose a DNN-based acoustic modeling technique which makes the best use of the inherent noise-robustness of DNN is proposed. By applying this technique, the DNN can automatically learn the complicated relationship among the noisy, clean speech and noise estimate to phonetic target smoothly. The proposed method outperformed noise-aware training (NAT), i.e., the conventional auxiliary-feature-based model adaptation technique in Aurora-5 DB.
The second method is multi-channel feature enhancement technique. In the general multi-channel speech recognition scenario, the enhanced single speech signal source is extracted from the multiple inputs using beamforming, i.e., the conventional signal-processing-based technique and the speech recognition process is performed by feeding that source into the acoustic model. We propose the multi-channel feature enhancement DNN algorithm by properly combining the delay-and-sum (DS) beamformer, which is one of the conventional beamforming techniques and DNN. Through the experiments using multichannel wall street journal audio visual (MC-WSJ-AV) corpus, it has been shown that the proposed method outperformed the conventional multi-channel feature enhancement techniques.
Finally, uncertainty-aware training (UAT) technique is proposed. The most of the existing DNN-based techniques including the techniques introduced above, aim to optimize the point estimates of the targets (e.g., clean features, and acoustic model parameters). This tampers with the reliability of the estimates. In order to overcome this issue, UAT employs a modified structure of variational autoencoder (VAE), a neural network model which learns and performs stochastic variational inference (VIF). UAT models the robust latent variables which intervene the mapping between the noisy observed features and the phonetic target using the distributive information of the clean feature estimates. The proposed technique outperforms the conventional DNN-based techniques on Aurora-4 and CHiME-4 databases.Abstract i
Contents iv
List of Figures ix
List of Tables xiii
1 Introduction 1
2 Background 9
2.1 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Experimental Database . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Aurora-4 DB . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Aurora-5 DB . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 MC-WSJ-AV DB . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.4 CHiME-4 DB . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Two-stage Noise-aware Training for Environment-robust Speech
Recognition 25
iii
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Noise-aware Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Two-stage NAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.1 Lower DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.2 Upper DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.3 Joint Training . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.1 GMM-HMM System . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.2 Training and Structures of DNN-based Techniques . . . . . . 37
3.4.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 40
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4 DNN-based Feature Enhancement for Robust Multichannel Speech
Recognition 45
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Observation Model in Multi-Channel Reverberant Noisy Environment 49
4.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.1 Lower DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3.2 Upper DNN and Joint Training . . . . . . . . . . . . . . . . . 54
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4.1 Recognition System and Feature Extraction . . . . . . . . . . 56
4.4.2 Training and Structures of DNN-based Techniques . . . . . . 58
4.4.3 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 62
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
iv
5 Uncertainty-aware Training for DNN-HMM System using Varia-
tional Inference 67
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Uncertainty Decoding for Noise Robustness . . . . . . . . . . . . . . 72
5.3 Variational Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.4 VIF-based uncertainty-aware Training . . . . . . . . . . . . . . . . . 83
5.4.1 Clean Uncertainty Network . . . . . . . . . . . . . . . . . . . 91
5.4.2 Environment Uncertainty Network . . . . . . . . . . . . . . . 93
5.4.3 Prediction Network and Joint Training . . . . . . . . . . . . . 95
5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.5.1 Experimental Setup: Feature Extraction and ASR System . . 96
5.5.2 Network Structures . . . . . . . . . . . . . . . . . . . . . . . . 98
5.5.3 Eects of CUN on the Noise Robustness . . . . . . . . . . . . 104
5.5.4 Uncertainty Representation in Dierent SNR Condition . . . 105
5.5.5 Result of Speech Recognition . . . . . . . . . . . . . . . . . . 112
5.5.6 Result of Speech Recognition with LSTM-HMM . . . . . . . 114
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6 Conclusions 127
Bibliography 131
์์ฝ 145Docto
- โฆ