202 research outputs found

    A Bayesian Network View on Acoustic Model-Based Techniques for Robust Speech Recognition

    Full text link
    This article provides a unifying Bayesian network view on various approaches for acoustic model adaptation, missing feature, and uncertainty decoding that are well-known in the literature of robust automatic speech recognition. The representatives of these classes can often be deduced from a Bayesian network that extends the conventional hidden Markov models used in speech recognition. These extensions, in turn, can in many cases be motivated from an underlying observation model that relates clean and distorted feature vectors. By converting the observation models into a Bayesian network representation, we formulate the corresponding compensation rules leading to a unified view on known derivations as well as to new formulations for certain approaches. The generic Bayesian perspective provided in this contribution thus highlights structural differences and similarities between the analyzed approaches

    Uncertainty decoding on Frequency Filtered parameters for robust ASR

    Get PDF
    The use of feature enhancement techniques to obtain estimates of the clean parameters is a common approach for robust automatic speech recognition (ASR). However, the decoding algorithm typically ignores how accurate these estimates are. Uncertainty decoding methods incorporate this type of information. In this paper, we develop a formulation of the uncertainty decoding paradigm for Frequency Filtered (FF) parameters using spectral subtraction as a feature enhancement method. Additionally, we show that the uncertainty decoding method for FF parameters admits a simple interpretation as a spectral weighting method that assigns more importance to the most reliable spectral components. Furthermore, we suggest combining this method with SSBD-HMM (Spectral Subtraction and Bounded Distance HMM), one recently proposed technique that is able to compensate for the effects of features that are highly contaminated (outliers). This combination pursues two objectives: to improve the results achieved by uncertainty decoding methods and to determine which part of the improvements is due to compensating for the effects of outliers and which part is due to compensating for other less deteriorated features.Publicad

    ๊ฐ•์ธํ•œ ์Œ์„ฑ์ธ์‹์„ ์œ„ํ•œ DNN ๊ธฐ๋ฐ˜ ์Œํ–ฅ ๋ชจ๋ธ๋ง

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2019. 2. ๊น€๋‚จ์ˆ˜.๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ฐ•์ธํ•œ ์Œ์„ฑ์ธ์‹์„ ์œ„ํ•ด์„œ DNN์„ ํ™œ์šฉํ•œ ์Œํ–ฅ ๋ชจ๋ธ๋ง ๊ธฐ๋ฒ•๋“ค์„ ์ œ์•ˆํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํฌ๊ฒŒ ์„ธ ๊ฐ€์ง€์˜ DNN ๊ธฐ๋ฐ˜ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ๋Š” DNN์ด ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ์žก์Œ ํ™˜๊ฒฝ์— ๋Œ€ํ•œ ๊ฐ•์ธํ•จ์„ ๋ณด์กฐ ํŠน์ง• ๋ฒกํ„ฐ๋“ค์„ ํ†ตํ•˜์—ฌ ์ตœ๋Œ€๋กœ ํ™œ์šฉํ•˜๋Š” ์Œํ–ฅ ๋ชจ๋ธ๋ง ๊ธฐ๋ฒ•์ด๋‹ค. ์ด๋Ÿฌํ•œ ๊ธฐ๋ฒ•์„ ํ†ตํ•˜์—ฌ DNN์€ ์™œ๊ณก๋œ ์Œ์„ฑ, ๊นจ๋—ํ•œ ์Œ์„ฑ, ์žก์Œ ์ถ”์ •์น˜, ๊ทธ๋ฆฌ๊ณ  ์Œ์†Œ ํƒ€๊ฒŸ๊ณผ์˜ ๋ณต์žกํ•œ ๊ด€๊ณ„๋ฅผ ๋ณด๋‹ค ์›ํ™œํ•˜๊ฒŒ ํ•™์Šตํ•˜๊ฒŒ ๋œ๋‹ค. ๋ณธ ๊ธฐ๋ฒ•์€ Aurora-5 DB ์—์„œ ๊ธฐ์กด์˜ ๋ณด์กฐ ์žก์Œ ํŠน์ง• ๋ฒกํ„ฐ๋ฅผ ํ™œ์šฉํ•œ ๋ชจ๋ธ ์ ์‘ ๊ธฐ๋ฒ•์ธ ์žก์Œ ์ธ์ง€ ํ•™์Šต (noise-aware training, NAT) ๊ธฐ๋ฒ•์„ ํฌ๊ฒŒ ๋›ฐ์–ด๋„˜๋Š” ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ๋‘ ๋ฒˆ์งธ๋Š” DNN์„ ํ™œ์šฉํ•œ ๋‹ค ์ฑ„๋„ ํŠน์ง• ํ–ฅ์ƒ ๊ธฐ๋ฒ•์ด๋‹ค. ๊ธฐ์กด์˜ ๋‹ค ์ฑ„๋„ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ๋Š” ์ „ํ†ต์ ์ธ ์‹ ํ˜ธ ์ฒ˜๋ฆฌ ๊ธฐ๋ฒ•์ธ ๋น”ํฌ๋ฐ ๊ธฐ๋ฒ•์„ ํ†ตํ•˜์—ฌ ํ–ฅ์ƒ๋œ ๋‹จ์ผ ์†Œ์Šค ์Œ์„ฑ ์‹ ํ˜ธ๋ฅผ ์ถ”์ถœํ•˜๊ณ  ๊ทธ๋ฅผ ํ†ตํ•˜์—ฌ ์Œ์„ฑ์ธ์‹์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์šฐ๋ฆฌ๋Š” ๊ธฐ์กด์˜ ๋น”ํฌ๋ฐ ์ค‘์—์„œ ๊ฐ€์žฅ ๊ธฐ๋ณธ์  ๊ธฐ๋ฒ• ์ค‘ ํ•˜๋‚˜์ธ delay-and-sum (DS) ๋น”ํฌ๋ฐ ๊ธฐ๋ฒ•๊ณผ DNN์„ ๊ฒฐํ•ฉํ•œ ๋‹ค ์ฑ„๋„ ํŠน์ง• ํ–ฅ์ƒ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” DNN์€ ์ค‘๊ฐ„ ๋‹จ๊ณ„ ํŠน์ง• ๋ฒกํ„ฐ๋ฅผ ํ™œ์šฉํ•œ ๊ณต๋™ ํ•™์Šต ๊ธฐ๋ฒ•์„ ํ†ตํ•˜์—ฌ ์™œ๊ณก๋œ ๋‹ค ์ฑ„๋„ ์ž…๋ ฅ ์Œ์„ฑ ์‹ ํ˜ธ๋“ค๊ณผ ๊นจ๋—ํ•œ ์Œ์„ฑ ์‹ ํ˜ธ์™€์˜ ๊ด€๊ณ„๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ‘œํ˜„ํ•œ๋‹ค. ์ œ์•ˆ๋œ ๊ธฐ๋ฒ•์€ multichannel wall street journal audio visual (MC-WSJAV) corpus์—์„œ์˜ ์‹คํ—˜์„ ํ†ตํ•˜์—ฌ, ๊ธฐ์กด์˜ ๋‹ค์ฑ„๋„ ํ–ฅ์ƒ ๊ธฐ๋ฒ•๋“ค๋ณด๋‹ค ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ž„์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ๋ถˆํ™•์ •์„ฑ ์ธ์ง€ ํ•™์Šต (Uncertainty-aware training, UAT) ๊ธฐ๋ฒ•์ด๋‹ค. ์œ„์—์„œ ์†Œ๊ฐœ๋œ ๊ธฐ๋ฒ•๋“ค์„ ํฌํ•จํ•˜์—ฌ ๊ฐ•์ธํ•œ ์Œ์„ฑ์ธ์‹์„ ์œ„ํ•œ ๊ธฐ์กด์˜ DNN ๊ธฐ๋ฐ˜ ๊ธฐ๋ฒ•๋“ค์€ ๊ฐ๊ฐ์˜ ๋„คํŠธ์›Œํฌ์˜ ํƒ€๊ฒŸ์„ ์ถ”์ •ํ•˜๋Š”๋ฐ ์žˆ์–ด์„œ ๊ฒฐ์ •๋ก ์ ์ธ ์ถ”์ • ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋Š” ์ถ”์ •์น˜์˜ ๋ถˆํ™•์ •์„ฑ ๋ฌธ์ œ ํ˜น์€ ์‹ ๋ขฐ๋„ ๋ฌธ์ œ๋ฅผ ์•ผ๊ธฐํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์ ์„ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์ œ์•ˆํ•˜๋Š” UAT ๊ธฐ๋ฒ•์€ ํ™•๋ฅ ๋ก ์ ์ธ ๋ณ€ํ™” ์ถ”์ •์„ ํ•™์Šตํ•˜๊ณ  ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ ๋ชจ๋ธ์ธ ๋ณ€ํ™” ์˜คํ† ์ธ์ฝ”๋” (variational autoencoder, VAE) ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•œ๋‹ค. UAT๋Š” ์™œ๊ณก๋œ ์Œ์„ฑ ํŠน์ง• ๋ฒกํ„ฐ์™€ ์Œ์†Œ ํƒ€๊ฒŸ๊ณผ์˜ ๊ด€๊ณ„๋ฅผ ๋งค๊ฐœํ•˜๋Š” ๊ฐ•์ธํ•œ ์€๋‹‰ ๋ณ€์ˆ˜๋ฅผ ๊นจ๋—ํ•œ ์Œ์„ฑ ํŠน์ง• ๋ฒกํ„ฐ ์ถ”์ •์น˜์˜ ๋ถ„ํฌ ์ •๋ณด๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ชจ๋ธ๋งํ•œ๋‹ค. UAT์˜ ์€๋‹‰ ๋ณ€์ˆ˜๋“ค์€ ๋”ฅ ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์Œํ–ฅ ๋ชจ๋ธ์— ์ตœ์ ํ™”๋œ uncertainty decoding (UD) ํ”„๋ ˆ์ž„์›Œํฌ๋กœ๋ถ€ํ„ฐ ์œ ๋„๋œ ์ตœ๋Œ€ ์šฐ๋„ ๊ธฐ์ค€์— ๋”ฐ๋ผ์„œ ํ•™์Šต๋œ๋‹ค. ์ œ์•ˆ๋œ ๊ธฐ๋ฒ•์€ Aurora-4 DB์™€ CHiME-4 DB์—์„œ ๊ธฐ์กด์˜ DNN ๊ธฐ๋ฐ˜ ๊ธฐ๋ฒ•๋“ค์„ ํฌ๊ฒŒ ๋›ฐ์–ด๋„˜๋Š” ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค.In this thesis, we propose three acoustic modeling techniques for robust automatic speech recognition (ASR). Firstly, we propose a DNN-based acoustic modeling technique which makes the best use of the inherent noise-robustness of DNN is proposed. By applying this technique, the DNN can automatically learn the complicated relationship among the noisy, clean speech and noise estimate to phonetic target smoothly. The proposed method outperformed noise-aware training (NAT), i.e., the conventional auxiliary-feature-based model adaptation technique in Aurora-5 DB. The second method is multi-channel feature enhancement technique. In the general multi-channel speech recognition scenario, the enhanced single speech signal source is extracted from the multiple inputs using beamforming, i.e., the conventional signal-processing-based technique and the speech recognition process is performed by feeding that source into the acoustic model. We propose the multi-channel feature enhancement DNN algorithm by properly combining the delay-and-sum (DS) beamformer, which is one of the conventional beamforming techniques and DNN. Through the experiments using multichannel wall street journal audio visual (MC-WSJ-AV) corpus, it has been shown that the proposed method outperformed the conventional multi-channel feature enhancement techniques. Finally, uncertainty-aware training (UAT) technique is proposed. The most of the existing DNN-based techniques including the techniques introduced above, aim to optimize the point estimates of the targets (e.g., clean features, and acoustic model parameters). This tampers with the reliability of the estimates. In order to overcome this issue, UAT employs a modified structure of variational autoencoder (VAE), a neural network model which learns and performs stochastic variational inference (VIF). UAT models the robust latent variables which intervene the mapping between the noisy observed features and the phonetic target using the distributive information of the clean feature estimates. The proposed technique outperforms the conventional DNN-based techniques on Aurora-4 and CHiME-4 databases.Abstract i Contents iv List of Figures ix List of Tables xiii 1 Introduction 1 2 Background 9 2.1 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Experimental Database . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Aurora-4 DB . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 Aurora-5 DB . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.3 MC-WSJ-AV DB . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.4 CHiME-4 DB . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3 Two-stage Noise-aware Training for Environment-robust Speech Recognition 25 iii 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Noise-aware Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3 Two-stage NAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.1 Lower DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.2 Upper DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.3 Joint Training . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4.1 GMM-HMM System . . . . . . . . . . . . . . . . . . . . . . . 37 3.4.2 Training and Structures of DNN-based Techniques . . . . . . 37 3.4.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 40 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4 DNN-based Feature Enhancement for Robust Multichannel Speech Recognition 45 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Observation Model in Multi-Channel Reverberant Noisy Environment 49 4.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3.1 Lower DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3.2 Upper DNN and Joint Training . . . . . . . . . . . . . . . . . 54 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.4.1 Recognition System and Feature Extraction . . . . . . . . . . 56 4.4.2 Training and Structures of DNN-based Techniques . . . . . . 58 4.4.3 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.4.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 62 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 iv 5 Uncertainty-aware Training for DNN-HMM System using Varia- tional Inference 67 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.2 Uncertainty Decoding for Noise Robustness . . . . . . . . . . . . . . 72 5.3 Variational Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.4 VIF-based uncertainty-aware Training . . . . . . . . . . . . . . . . . 83 5.4.1 Clean Uncertainty Network . . . . . . . . . . . . . . . . . . . 91 5.4.2 Environment Uncertainty Network . . . . . . . . . . . . . . . 93 5.4.3 Prediction Network and Joint Training . . . . . . . . . . . . . 95 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.5.1 Experimental Setup: Feature Extraction and ASR System . . 96 5.5.2 Network Structures . . . . . . . . . . . . . . . . . . . . . . . . 98 5.5.3 Eects of CUN on the Noise Robustness . . . . . . . . . . . . 104 5.5.4 Uncertainty Representation in Dierent SNR Condition . . . 105 5.5.5 Result of Speech Recognition . . . . . . . . . . . . . . . . . . 112 5.5.6 Result of Speech Recognition with LSTM-HMM . . . . . . . 114 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6 Conclusions 127 Bibliography 131 ์š”์•ฝ 145Docto

    Nonparametric uncertainty estimation and propagation for noise robust ASR

    Get PDF
    International audienceWe consider the framework of uncertainty propagation for automatic speech recognition (ASR) in highly non-stationary noise environments. Uncertainty is considered as the variance of speech distortion. Yet, its accurate estimation in the spectral domain and its propagation to the feature domain remain difficult. Existing methods typically rely on a single uncertainty estimator and propagator fixed by mathematical approximation. In this paper, we propose a new paradigm where we seek to learn more powerful mappings to predict uncertainty from data.We investigate two such possible mappings: linear fusion of multiple uncertainty estimators/propagators and nonparametric uncertainty estimation/propagation. In addition, a procedure to propagate the estimated spectral-domain uncertainty to the static Mel frequency cepstral coefficients (MFCCs), to the log-energy, and to their first- and second-order time derivatives is proposed. This results in a full uncertainty covariance matrix over both static and dynamic MFCCs. Experimental evaluation on Tracks 1 and 2 of the 2nd CHiME Challenge resulted in up to 29% and 28% relative keyword error rate reduction with respect to speech enhancement alone

    Multivariate Cepstral Feature Compensation on Band-limited Data for Robust Speech Recognition

    Get PDF
    Proceedings of the 16th Nordic Conference of Computational Linguistics NODALIDA-2007. Editors: Joakim Nivre, Heiki-Jaan Kaalep, Kadri Muischnek and Mare Koit. University of Tartu, Tartu, 2007. ISBN 978-9985-4-0513-0 (online) ISBN 978-9985-4-0514-7 (CD-ROM) pp. 144-151

    Exploration and Optimization of Noise Reduction Algorithms for Speech Recognition in Embedded Devices

    Get PDF
    Environmental noise present in real-life applications substantially degrades the performance of speech recognition systems. An example is an in-car scenario where a speech recognition system has to support the man-machine interface. Several sources of noise coming from the engine, wipers, wheels etc., interact with speech. Special challenge is given in an open window scenario, where noise of traffic, park noise, etc., has to be regarded. The main goal of this thesis is to improve the performance of a speech recognition system based on a state-of-the-art hidden Markov model (HMM) using noise reduction methods. The performance is measured with respect to word error rate and with the method of mutual information. The noise reduction methods are based on weighting rules. Least-squares weighting rules in the frequency domain have been developed to enable a continuous development based on the existing system and also to guarantee its low complexity and footprint for applications in embedded devices. The weighting rule parameters are optimized employing a multidimensional optimization task method of Monte Carlo followed by a compass search method. Root compression and cepstral smoothing methods have also been implemented to boost the recognition performance. The additional complexity and memory requirements of the proposed system are minimum. The performance of the proposed system was compared to the European Telecommunications Standards Institute (ETSI) standardized system. The proposed system outperforms the ETSI system by up to 8.6 % relative increase in word accuracy and achieves up to 35.1 % relative increase in word accuracy compared to the existing baseline system on the ETSI Aurora 3 German task. A relative increase of up to 18 % in word accuracy over the existing baseline system is also obtained from the proposed weighting rules on large vocabulary databases. An entropy-based feature vector analysis method has also been developed to assess the quality of feature vectors. The entropy estimation is based on the histogram approach. The method has the advantage to objectively asses the feature vector quality regardless of the acoustic modeling assumption used in the speech recognition system
    • โ€ฆ
    corecore