643 research outputs found

    A Bayesian Network View on Acoustic Model-Based Techniques for Robust Speech Recognition

    Full text link
    This article provides a unifying Bayesian network view on various approaches for acoustic model adaptation, missing feature, and uncertainty decoding that are well-known in the literature of robust automatic speech recognition. The representatives of these classes can often be deduced from a Bayesian network that extends the conventional hidden Markov models used in speech recognition. These extensions, in turn, can in many cases be motivated from an underlying observation model that relates clean and distorted feature vectors. By converting the observation models into a Bayesian network representation, we formulate the corresponding compensation rules leading to a unified view on known derivations as well as to new formulations for certain approaches. The generic Bayesian perspective provided in this contribution thus highlights structural differences and similarities between the analyzed approaches

    Spatial Diffuseness Features for DNN-Based Speech Recognition in Noisy and Reverberant Environments

    Full text link
    We propose a spatial diffuseness feature for deep neural network (DNN)-based automatic speech recognition to improve recognition accuracy in reverberant and noisy environments. The feature is computed in real-time from multiple microphone signals without requiring knowledge or estimation of the direction of arrival, and represents the relative amount of diffuse noise in each time and frequency bin. It is shown that using the diffuseness feature as an additional input to a DNN-based acoustic model leads to a reduced word error rate for the REVERB challenge corpus, both compared to logmelspec features extracted from noisy signals, and features enhanced by spectral subtraction.Comment: accepted for ICASSP201

    Joint Uncertainty Decoding with Unscented Transform for Noise Robust Subspace Gaussian Mixture Models

    Get PDF
    Common noise compensation techniques use vector Taylor series (VTS) to approximate the mismatch function. Recent work shows that the approximation accuracy may be improved by sampling. One such sampling technique is the unscented transform (UT), which draws samples deterministically from clean speech and noise model to derive the noise corrupted speech parameters. This paper applies UT to noise compensation of the subspace Gaussian mixture model (SGMM). Since UT requires relatively smaller number of samples for accurate estimation, it has significantly lower computational cost compared to other random sampling techniques. However, the number of surface Gaussians in an SGMM is typically very large, making the direct application of UT, for compensating individual Gaussian components, computationally impractical. In this paper, we avoid the computational burden by employing UT in the framework of joint uncertainty decoding (JUD), which groups all the Gaussian components into small number of classes, sharing the compensation parameters by class. We evaluate the JUD-UT technique for an SGMM system using the Aurora 4 corpus. Experimental results indicate that UT can lead to increased accuracy compared to VTS approximation if the JUD phase factor is untuned, and to similar accuracy if the phase factor is tuned empirically. 1

    Environmentally robust ASR front-end for deep neural network acoustic models

    Get PDF
    This paper examines the individual and combined impacts of various front-end approaches on the performance of deep neural network (DNN) based speech recognition systems in distant talking situations, where acoustic environmental distortion degrades the recognition performance. Training of a DNN-based acoustic model consists of generation of state alignments followed by learning the network parameters. This paper first shows that the network parameters are more sensitive to the speech quality than the alignments and thus this stage requires improvement. Then, various front-end robustness approaches to addressing this problem are categorised based on functionality. The degree to which each class of approaches impacts the performance of DNN-based acoustic models is examined experimentally. Based on the results, a front-end processing pipeline is proposed for efficiently combining different classes of approaches. Using this front-end, the combined effects of different classes of approaches are further evaluated in a single distant microphone-based meeting transcription task with both speaker independent (SI) and speaker adaptive training (SAT) set-ups. By combining multiple speech enhancement results, multiple types of features, and feature transformation, the front-end shows relative performance gains of 7.24% and 9.83% in the SI and SAT scenarios, respectively, over competitive DNN-based systems using log mel-filter bank features.This is the final version of the article. It first appeared from Elsevier via http://dx.doi.org/10.1016/j.csl.2014.11.00

    ๊ฐ•์ธํ•œ ์Œ์„ฑ์ธ์‹์„ ์œ„ํ•œ DNN ๊ธฐ๋ฐ˜ ์Œํ–ฅ ๋ชจ๋ธ๋ง

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2019. 2. ๊น€๋‚จ์ˆ˜.๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ฐ•์ธํ•œ ์Œ์„ฑ์ธ์‹์„ ์œ„ํ•ด์„œ DNN์„ ํ™œ์šฉํ•œ ์Œํ–ฅ ๋ชจ๋ธ๋ง ๊ธฐ๋ฒ•๋“ค์„ ์ œ์•ˆํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํฌ๊ฒŒ ์„ธ ๊ฐ€์ง€์˜ DNN ๊ธฐ๋ฐ˜ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ๋Š” DNN์ด ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ์žก์Œ ํ™˜๊ฒฝ์— ๋Œ€ํ•œ ๊ฐ•์ธํ•จ์„ ๋ณด์กฐ ํŠน์ง• ๋ฒกํ„ฐ๋“ค์„ ํ†ตํ•˜์—ฌ ์ตœ๋Œ€๋กœ ํ™œ์šฉํ•˜๋Š” ์Œํ–ฅ ๋ชจ๋ธ๋ง ๊ธฐ๋ฒ•์ด๋‹ค. ์ด๋Ÿฌํ•œ ๊ธฐ๋ฒ•์„ ํ†ตํ•˜์—ฌ DNN์€ ์™œ๊ณก๋œ ์Œ์„ฑ, ๊นจ๋—ํ•œ ์Œ์„ฑ, ์žก์Œ ์ถ”์ •์น˜, ๊ทธ๋ฆฌ๊ณ  ์Œ์†Œ ํƒ€๊ฒŸ๊ณผ์˜ ๋ณต์žกํ•œ ๊ด€๊ณ„๋ฅผ ๋ณด๋‹ค ์›ํ™œํ•˜๊ฒŒ ํ•™์Šตํ•˜๊ฒŒ ๋œ๋‹ค. ๋ณธ ๊ธฐ๋ฒ•์€ Aurora-5 DB ์—์„œ ๊ธฐ์กด์˜ ๋ณด์กฐ ์žก์Œ ํŠน์ง• ๋ฒกํ„ฐ๋ฅผ ํ™œ์šฉํ•œ ๋ชจ๋ธ ์ ์‘ ๊ธฐ๋ฒ•์ธ ์žก์Œ ์ธ์ง€ ํ•™์Šต (noise-aware training, NAT) ๊ธฐ๋ฒ•์„ ํฌ๊ฒŒ ๋›ฐ์–ด๋„˜๋Š” ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ๋‘ ๋ฒˆ์งธ๋Š” DNN์„ ํ™œ์šฉํ•œ ๋‹ค ์ฑ„๋„ ํŠน์ง• ํ–ฅ์ƒ ๊ธฐ๋ฒ•์ด๋‹ค. ๊ธฐ์กด์˜ ๋‹ค ์ฑ„๋„ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ๋Š” ์ „ํ†ต์ ์ธ ์‹ ํ˜ธ ์ฒ˜๋ฆฌ ๊ธฐ๋ฒ•์ธ ๋น”ํฌ๋ฐ ๊ธฐ๋ฒ•์„ ํ†ตํ•˜์—ฌ ํ–ฅ์ƒ๋œ ๋‹จ์ผ ์†Œ์Šค ์Œ์„ฑ ์‹ ํ˜ธ๋ฅผ ์ถ”์ถœํ•˜๊ณ  ๊ทธ๋ฅผ ํ†ตํ•˜์—ฌ ์Œ์„ฑ์ธ์‹์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์šฐ๋ฆฌ๋Š” ๊ธฐ์กด์˜ ๋น”ํฌ๋ฐ ์ค‘์—์„œ ๊ฐ€์žฅ ๊ธฐ๋ณธ์  ๊ธฐ๋ฒ• ์ค‘ ํ•˜๋‚˜์ธ delay-and-sum (DS) ๋น”ํฌ๋ฐ ๊ธฐ๋ฒ•๊ณผ DNN์„ ๊ฒฐํ•ฉํ•œ ๋‹ค ์ฑ„๋„ ํŠน์ง• ํ–ฅ์ƒ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” DNN์€ ์ค‘๊ฐ„ ๋‹จ๊ณ„ ํŠน์ง• ๋ฒกํ„ฐ๋ฅผ ํ™œ์šฉํ•œ ๊ณต๋™ ํ•™์Šต ๊ธฐ๋ฒ•์„ ํ†ตํ•˜์—ฌ ์™œ๊ณก๋œ ๋‹ค ์ฑ„๋„ ์ž…๋ ฅ ์Œ์„ฑ ์‹ ํ˜ธ๋“ค๊ณผ ๊นจ๋—ํ•œ ์Œ์„ฑ ์‹ ํ˜ธ์™€์˜ ๊ด€๊ณ„๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ‘œํ˜„ํ•œ๋‹ค. ์ œ์•ˆ๋œ ๊ธฐ๋ฒ•์€ multichannel wall street journal audio visual (MC-WSJAV) corpus์—์„œ์˜ ์‹คํ—˜์„ ํ†ตํ•˜์—ฌ, ๊ธฐ์กด์˜ ๋‹ค์ฑ„๋„ ํ–ฅ์ƒ ๊ธฐ๋ฒ•๋“ค๋ณด๋‹ค ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ž„์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ๋ถˆํ™•์ •์„ฑ ์ธ์ง€ ํ•™์Šต (Uncertainty-aware training, UAT) ๊ธฐ๋ฒ•์ด๋‹ค. ์œ„์—์„œ ์†Œ๊ฐœ๋œ ๊ธฐ๋ฒ•๋“ค์„ ํฌํ•จํ•˜์—ฌ ๊ฐ•์ธํ•œ ์Œ์„ฑ์ธ์‹์„ ์œ„ํ•œ ๊ธฐ์กด์˜ DNN ๊ธฐ๋ฐ˜ ๊ธฐ๋ฒ•๋“ค์€ ๊ฐ๊ฐ์˜ ๋„คํŠธ์›Œํฌ์˜ ํƒ€๊ฒŸ์„ ์ถ”์ •ํ•˜๋Š”๋ฐ ์žˆ์–ด์„œ ๊ฒฐ์ •๋ก ์ ์ธ ์ถ”์ • ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋Š” ์ถ”์ •์น˜์˜ ๋ถˆํ™•์ •์„ฑ ๋ฌธ์ œ ํ˜น์€ ์‹ ๋ขฐ๋„ ๋ฌธ์ œ๋ฅผ ์•ผ๊ธฐํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์ ์„ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์ œ์•ˆํ•˜๋Š” UAT ๊ธฐ๋ฒ•์€ ํ™•๋ฅ ๋ก ์ ์ธ ๋ณ€ํ™” ์ถ”์ •์„ ํ•™์Šตํ•˜๊ณ  ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ ๋ชจ๋ธ์ธ ๋ณ€ํ™” ์˜คํ† ์ธ์ฝ”๋” (variational autoencoder, VAE) ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•œ๋‹ค. UAT๋Š” ์™œ๊ณก๋œ ์Œ์„ฑ ํŠน์ง• ๋ฒกํ„ฐ์™€ ์Œ์†Œ ํƒ€๊ฒŸ๊ณผ์˜ ๊ด€๊ณ„๋ฅผ ๋งค๊ฐœํ•˜๋Š” ๊ฐ•์ธํ•œ ์€๋‹‰ ๋ณ€์ˆ˜๋ฅผ ๊นจ๋—ํ•œ ์Œ์„ฑ ํŠน์ง• ๋ฒกํ„ฐ ์ถ”์ •์น˜์˜ ๋ถ„ํฌ ์ •๋ณด๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ชจ๋ธ๋งํ•œ๋‹ค. UAT์˜ ์€๋‹‰ ๋ณ€์ˆ˜๋“ค์€ ๋”ฅ ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์Œํ–ฅ ๋ชจ๋ธ์— ์ตœ์ ํ™”๋œ uncertainty decoding (UD) ํ”„๋ ˆ์ž„์›Œํฌ๋กœ๋ถ€ํ„ฐ ์œ ๋„๋œ ์ตœ๋Œ€ ์šฐ๋„ ๊ธฐ์ค€์— ๋”ฐ๋ผ์„œ ํ•™์Šต๋œ๋‹ค. ์ œ์•ˆ๋œ ๊ธฐ๋ฒ•์€ Aurora-4 DB์™€ CHiME-4 DB์—์„œ ๊ธฐ์กด์˜ DNN ๊ธฐ๋ฐ˜ ๊ธฐ๋ฒ•๋“ค์„ ํฌ๊ฒŒ ๋›ฐ์–ด๋„˜๋Š” ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค.In this thesis, we propose three acoustic modeling techniques for robust automatic speech recognition (ASR). Firstly, we propose a DNN-based acoustic modeling technique which makes the best use of the inherent noise-robustness of DNN is proposed. By applying this technique, the DNN can automatically learn the complicated relationship among the noisy, clean speech and noise estimate to phonetic target smoothly. The proposed method outperformed noise-aware training (NAT), i.e., the conventional auxiliary-feature-based model adaptation technique in Aurora-5 DB. The second method is multi-channel feature enhancement technique. In the general multi-channel speech recognition scenario, the enhanced single speech signal source is extracted from the multiple inputs using beamforming, i.e., the conventional signal-processing-based technique and the speech recognition process is performed by feeding that source into the acoustic model. We propose the multi-channel feature enhancement DNN algorithm by properly combining the delay-and-sum (DS) beamformer, which is one of the conventional beamforming techniques and DNN. Through the experiments using multichannel wall street journal audio visual (MC-WSJ-AV) corpus, it has been shown that the proposed method outperformed the conventional multi-channel feature enhancement techniques. Finally, uncertainty-aware training (UAT) technique is proposed. The most of the existing DNN-based techniques including the techniques introduced above, aim to optimize the point estimates of the targets (e.g., clean features, and acoustic model parameters). This tampers with the reliability of the estimates. In order to overcome this issue, UAT employs a modified structure of variational autoencoder (VAE), a neural network model which learns and performs stochastic variational inference (VIF). UAT models the robust latent variables which intervene the mapping between the noisy observed features and the phonetic target using the distributive information of the clean feature estimates. The proposed technique outperforms the conventional DNN-based techniques on Aurora-4 and CHiME-4 databases.Abstract i Contents iv List of Figures ix List of Tables xiii 1 Introduction 1 2 Background 9 2.1 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Experimental Database . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Aurora-4 DB . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 Aurora-5 DB . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.3 MC-WSJ-AV DB . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.4 CHiME-4 DB . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3 Two-stage Noise-aware Training for Environment-robust Speech Recognition 25 iii 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Noise-aware Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3 Two-stage NAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.1 Lower DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.2 Upper DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.3 Joint Training . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4.1 GMM-HMM System . . . . . . . . . . . . . . . . . . . . . . . 37 3.4.2 Training and Structures of DNN-based Techniques . . . . . . 37 3.4.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 40 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4 DNN-based Feature Enhancement for Robust Multichannel Speech Recognition 45 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Observation Model in Multi-Channel Reverberant Noisy Environment 49 4.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3.1 Lower DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3.2 Upper DNN and Joint Training . . . . . . . . . . . . . . . . . 54 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.4.1 Recognition System and Feature Extraction . . . . . . . . . . 56 4.4.2 Training and Structures of DNN-based Techniques . . . . . . 58 4.4.3 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.4.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 62 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 iv 5 Uncertainty-aware Training for DNN-HMM System using Varia- tional Inference 67 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.2 Uncertainty Decoding for Noise Robustness . . . . . . . . . . . . . . 72 5.3 Variational Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.4 VIF-based uncertainty-aware Training . . . . . . . . . . . . . . . . . 83 5.4.1 Clean Uncertainty Network . . . . . . . . . . . . . . . . . . . 91 5.4.2 Environment Uncertainty Network . . . . . . . . . . . . . . . 93 5.4.3 Prediction Network and Joint Training . . . . . . . . . . . . . 95 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.5.1 Experimental Setup: Feature Extraction and ASR System . . 96 5.5.2 Network Structures . . . . . . . . . . . . . . . . . . . . . . . . 98 5.5.3 Eects of CUN on the Noise Robustness . . . . . . . . . . . . 104 5.5.4 Uncertainty Representation in Dierent SNR Condition . . . 105 5.5.5 Result of Speech Recognition . . . . . . . . . . . . . . . . . . 112 5.5.6 Result of Speech Recognition with LSTM-HMM . . . . . . . 114 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6 Conclusions 127 Bibliography 131 ์š”์•ฝ 145Docto
    • โ€ฆ
    corecore