3,990 research outputs found
Transfer Learning for Speech and Language Processing
Transfer learning is a vital technique that generalizes models trained for
one setting or task to other settings or tasks. For example in speech
recognition, an acoustic model trained for one language can be used to
recognize speech in another language, with little or no re-training data.
Transfer learning is closely related to multi-task learning (cross-lingual vs.
multilingual), and is traditionally studied in the name of `model adaptation'.
Recent advance in deep learning shows that transfer learning becomes much
easier and more effective with high-level abstract features learned by deep
models, and the `transfer' can be conducted not only between data distributions
and data types, but also between model structures (e.g., shallow nets and deep
nets) or even model types (e.g., Bayesian models and neural models). This
review paper summarizes some recent prominent research towards this direction,
particularly for speech and language processing. We also report some results
from our group and highlight the potential of this very interesting research
field.Comment: 13 pages, APSIPA 201
Very Deep Convolutional Neural Networks for Robust Speech Recognition
This paper describes the extension and optimization of our previous work on
very deep convolutional neural networks (CNNs) for effective recognition of
noisy speech in the Aurora 4 task. The appropriate number of convolutional
layers, the sizes of the filters, pooling operations and input feature maps are
all modified: the filter and pooling sizes are reduced and dimensions of input
feature maps are extended to allow adding more convolutional layers.
Furthermore appropriate input padding and input feature map selection
strategies are developed. In addition, an adaptation framework using joint
training of very deep CNN with auxiliary features i-vector and fMLLR features
is developed. These modifications give substantial word error rate reductions
over the standard CNN used as baseline. Finally the very deep CNN is combined
with an LSTM-RNN acoustic model and it is shown that state-level weighted log
likelihood score combination in a joint acoustic model decoding scheme is very
effective. On the Aurora 4 task, the very deep CNN achieves a WER of 8.81%,
further 7.99% with auxiliary feature joint training, and 7.09% with LSTM-RNN
joint decoding.Comment: accepted by SLT 201
๊ฐ์ธํ ์์ฑ์ธ์์ ์ํ DNN ๊ธฐ๋ฐ ์ํฅ ๋ชจ๋ธ๋ง
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ)-- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ปดํจํฐ๊ณตํ๋ถ, 2019. 2. ๊น๋จ์.๋ณธ ๋
ผ๋ฌธ์์๋ ๊ฐ์ธํ ์์ฑ์ธ์์ ์ํด์ DNN์ ํ์ฉํ ์ํฅ ๋ชจ๋ธ๋ง ๊ธฐ๋ฒ๋ค์ ์ ์ํ๋ค. ๋ณธ ๋
ผ๋ฌธ์์๋ ํฌ๊ฒ ์ธ ๊ฐ์ง์ DNN ๊ธฐ๋ฐ ๊ธฐ๋ฒ์ ์ ์ํ๋ค. ์ฒซ ๋ฒ์งธ๋ DNN์ด ๊ฐ์ง๊ณ ์๋ ์ก์ ํ๊ฒฝ์ ๋ํ ๊ฐ์ธํจ์ ๋ณด์กฐ ํน์ง ๋ฒกํฐ๋ค์ ํตํ์ฌ ์ต๋๋ก ํ์ฉํ๋ ์ํฅ ๋ชจ๋ธ๋ง ๊ธฐ๋ฒ์ด๋ค. ์ด๋ฌํ ๊ธฐ๋ฒ์ ํตํ์ฌ DNN์ ์๊ณก๋ ์์ฑ, ๊นจ๋ํ ์์ฑ, ์ก์ ์ถ์ ์น, ๊ทธ๋ฆฌ๊ณ ์์ ํ๊ฒ๊ณผ์ ๋ณต์กํ ๊ด๊ณ๋ฅผ ๋ณด๋ค ์ํํ๊ฒ ํ์ตํ๊ฒ ๋๋ค. ๋ณธ ๊ธฐ๋ฒ์ Aurora-5 DB ์์ ๊ธฐ์กด์ ๋ณด์กฐ ์ก์ ํน์ง ๋ฒกํฐ๋ฅผ ํ์ฉํ ๋ชจ๋ธ ์ ์ ๊ธฐ๋ฒ์ธ ์ก์ ์ธ์ง ํ์ต (noise-aware training, NAT) ๊ธฐ๋ฒ์ ํฌ๊ฒ ๋ฐ์ด๋๋ ์ฑ๋ฅ์ ๋ณด์๋ค.
๋ ๋ฒ์งธ๋ DNN์ ํ์ฉํ ๋ค ์ฑ๋ ํน์ง ํฅ์ ๊ธฐ๋ฒ์ด๋ค. ๊ธฐ์กด์ ๋ค ์ฑ๋ ์๋๋ฆฌ์ค์์๋ ์ ํต์ ์ธ ์ ํธ ์ฒ๋ฆฌ ๊ธฐ๋ฒ์ธ ๋นํฌ๋ฐ ๊ธฐ๋ฒ์ ํตํ์ฌ ํฅ์๋ ๋จ์ผ ์์ค ์์ฑ ์ ํธ๋ฅผ ์ถ์ถํ๊ณ ๊ทธ๋ฅผ ํตํ์ฌ ์์ฑ์ธ์์ ์ํํ๋ค. ์ฐ๋ฆฌ๋ ๊ธฐ์กด์ ๋นํฌ๋ฐ ์ค์์ ๊ฐ์ฅ ๊ธฐ๋ณธ์ ๊ธฐ๋ฒ ์ค ํ๋์ธ delay-and-sum (DS) ๋นํฌ๋ฐ ๊ธฐ๋ฒ๊ณผ DNN์ ๊ฒฐํฉํ ๋ค ์ฑ๋ ํน์ง ํฅ์ ๊ธฐ๋ฒ์ ์ ์ํ๋ค. ์ ์ํ๋ DNN์ ์ค๊ฐ ๋จ๊ณ ํน์ง ๋ฒกํฐ๋ฅผ ํ์ฉํ ๊ณต๋ ํ์ต ๊ธฐ๋ฒ์ ํตํ์ฌ ์๊ณก๋ ๋ค ์ฑ๋ ์
๋ ฅ ์์ฑ ์ ํธ๋ค๊ณผ ๊นจ๋ํ ์์ฑ ์ ํธ์์ ๊ด๊ณ๋ฅผ ํจ๊ณผ์ ์ผ๋ก ํํํ๋ค. ์ ์๋ ๊ธฐ๋ฒ์ multichannel wall street journal audio visual (MC-WSJAV) corpus์์์ ์คํ์ ํตํ์ฌ, ๊ธฐ์กด์ ๋ค์ฑ๋ ํฅ์ ๊ธฐ๋ฒ๋ค๋ณด๋ค ๋ฐ์ด๋ ์ฑ๋ฅ์ ๋ณด์์ ํ์ธํ์๋ค.
๋ง์ง๋ง์ผ๋ก, ๋ถํ์ ์ฑ ์ธ์ง ํ์ต (Uncertainty-aware training, UAT) ๊ธฐ๋ฒ์ด๋ค. ์์์ ์๊ฐ๋ ๊ธฐ๋ฒ๋ค์ ํฌํจํ์ฌ ๊ฐ์ธํ ์์ฑ์ธ์์ ์ํ ๊ธฐ์กด์ DNN ๊ธฐ๋ฐ ๊ธฐ๋ฒ๋ค์ ๊ฐ๊ฐ์ ๋คํธ์ํฌ์ ํ๊ฒ์ ์ถ์ ํ๋๋ฐ ์์ด์ ๊ฒฐ์ ๋ก ์ ์ธ ์ถ์ ๋ฐฉ์์ ์ฌ์ฉํ๋ค. ์ด๋ ์ถ์ ์น์ ๋ถํ์ ์ฑ ๋ฌธ์ ํน์ ์ ๋ขฐ๋ ๋ฌธ์ ๋ฅผ ์ผ๊ธฐํ๋ค. ์ด๋ฌํ ๋ฌธ์ ์ ์ ๊ทน๋ณตํ๊ธฐ ์ํ์ฌ ์ ์ํ๋ UAT ๊ธฐ๋ฒ์ ํ๋ฅ ๋ก ์ ์ธ ๋ณํ ์ถ์ ์ ํ์ตํ๊ณ ์ํํ ์ ์๋ ๋ด๋ด ๋คํธ์ํฌ ๋ชจ๋ธ์ธ ๋ณํ ์คํ ์ธ์ฝ๋ (variational autoencoder, VAE) ๋ชจ๋ธ์ ์ฌ์ฉํ๋ค. UAT๋ ์๊ณก๋ ์์ฑ ํน์ง ๋ฒกํฐ์ ์์ ํ๊ฒ๊ณผ์ ๊ด๊ณ๋ฅผ ๋งค๊ฐํ๋ ๊ฐ์ธํ ์๋ ๋ณ์๋ฅผ ๊นจ๋ํ ์์ฑ ํน์ง ๋ฒกํฐ ์ถ์ ์น์ ๋ถํฌ ์ ๋ณด๋ฅผ ์ด์ฉํ์ฌ ๋ชจ๋ธ๋งํ๋ค. UAT์ ์๋ ๋ณ์๋ค์ ๋ฅ ๋ฌ๋ ๊ธฐ๋ฐ ์ํฅ ๋ชจ๋ธ์ ์ต์ ํ๋ uncertainty decoding (UD) ํ๋ ์์ํฌ๋ก๋ถํฐ ์ ๋๋ ์ต๋ ์ฐ๋ ๊ธฐ์ค์ ๋ฐ๋ผ์ ํ์ต๋๋ค. ์ ์๋ ๊ธฐ๋ฒ์ Aurora-4 DB์ CHiME-4 DB์์ ๊ธฐ์กด์ DNN ๊ธฐ๋ฐ ๊ธฐ๋ฒ๋ค์ ํฌ๊ฒ ๋ฐ์ด๋๋ ์ฑ๋ฅ์ ๋ณด์๋ค.In this thesis, we propose three acoustic modeling techniques for robust automatic speech recognition (ASR). Firstly, we propose a DNN-based acoustic modeling technique which makes the best use of the inherent noise-robustness of DNN is proposed. By applying this technique, the DNN can automatically learn the complicated relationship among the noisy, clean speech and noise estimate to phonetic target smoothly. The proposed method outperformed noise-aware training (NAT), i.e., the conventional auxiliary-feature-based model adaptation technique in Aurora-5 DB.
The second method is multi-channel feature enhancement technique. In the general multi-channel speech recognition scenario, the enhanced single speech signal source is extracted from the multiple inputs using beamforming, i.e., the conventional signal-processing-based technique and the speech recognition process is performed by feeding that source into the acoustic model. We propose the multi-channel feature enhancement DNN algorithm by properly combining the delay-and-sum (DS) beamformer, which is one of the conventional beamforming techniques and DNN. Through the experiments using multichannel wall street journal audio visual (MC-WSJ-AV) corpus, it has been shown that the proposed method outperformed the conventional multi-channel feature enhancement techniques.
Finally, uncertainty-aware training (UAT) technique is proposed. The most of the existing DNN-based techniques including the techniques introduced above, aim to optimize the point estimates of the targets (e.g., clean features, and acoustic model parameters). This tampers with the reliability of the estimates. In order to overcome this issue, UAT employs a modified structure of variational autoencoder (VAE), a neural network model which learns and performs stochastic variational inference (VIF). UAT models the robust latent variables which intervene the mapping between the noisy observed features and the phonetic target using the distributive information of the clean feature estimates. The proposed technique outperforms the conventional DNN-based techniques on Aurora-4 and CHiME-4 databases.Abstract i
Contents iv
List of Figures ix
List of Tables xiii
1 Introduction 1
2 Background 9
2.1 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Experimental Database . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Aurora-4 DB . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Aurora-5 DB . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 MC-WSJ-AV DB . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.4 CHiME-4 DB . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Two-stage Noise-aware Training for Environment-robust Speech
Recognition 25
iii
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Noise-aware Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Two-stage NAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.1 Lower DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.2 Upper DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.3 Joint Training . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.1 GMM-HMM System . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.2 Training and Structures of DNN-based Techniques . . . . . . 37
3.4.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 40
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4 DNN-based Feature Enhancement for Robust Multichannel Speech
Recognition 45
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Observation Model in Multi-Channel Reverberant Noisy Environment 49
4.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.1 Lower DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3.2 Upper DNN and Joint Training . . . . . . . . . . . . . . . . . 54
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4.1 Recognition System and Feature Extraction . . . . . . . . . . 56
4.4.2 Training and Structures of DNN-based Techniques . . . . . . 58
4.4.3 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 62
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
iv
5 Uncertainty-aware Training for DNN-HMM System using Varia-
tional Inference 67
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Uncertainty Decoding for Noise Robustness . . . . . . . . . . . . . . 72
5.3 Variational Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.4 VIF-based uncertainty-aware Training . . . . . . . . . . . . . . . . . 83
5.4.1 Clean Uncertainty Network . . . . . . . . . . . . . . . . . . . 91
5.4.2 Environment Uncertainty Network . . . . . . . . . . . . . . . 93
5.4.3 Prediction Network and Joint Training . . . . . . . . . . . . . 95
5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.5.1 Experimental Setup: Feature Extraction and ASR System . . 96
5.5.2 Network Structures . . . . . . . . . . . . . . . . . . . . . . . . 98
5.5.3 Eects of CUN on the Noise Robustness . . . . . . . . . . . . 104
5.5.4 Uncertainty Representation in Dierent SNR Condition . . . 105
5.5.5 Result of Speech Recognition . . . . . . . . . . . . . . . . . . 112
5.5.6 Result of Speech Recognition with LSTM-HMM . . . . . . . 114
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6 Conclusions 127
Bibliography 131
์์ฝ 145Docto
Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview
We present a structured overview of adaptation algorithms for neural
network-based speech recognition, considering both hybrid hidden Markov model /
neural network systems and end-to-end neural network systems, with a focus on
speaker adaptation, domain adaptation, and accent adaptation. The overview
characterizes adaptation algorithms as based on embeddings, model parameter
adaptation, or data augmentation. We present a meta-analysis of the performance
of speech recognition adaptation algorithms, based on relative error rate
reductions as reported in the literature.Comment: Submitted to IEEE Open Journal of Signal Processing. 30 pages, 27
figure
DNN adaptation by automatic quality estimation of ASR hypotheses
In this paper we propose to exploit the automatic Quality Estimation (QE) of
ASR hypotheses to perform the unsupervised adaptation of a deep neural network
modeling acoustic probabilities. Our hypothesis is that significant
improvements can be achieved by: i)automatically transcribing the evaluation
data we are currently trying to recognise, and ii) selecting from it a subset
of "good quality" instances based on the word error rate (WER) scores predicted
by a QE component. To validate this hypothesis, we run several experiments on
the evaluation data sets released for the CHiME-3 challenge. First, we operate
in oracle conditions in which manual transcriptions of the evaluation data are
available, thus allowing us to compute the "true" sentence WER. In this
scenario, we perform the adaptation with variable amounts of data, which are
characterised by different levels of quality. Then, we move to realistic
conditions in which the manual transcriptions of the evaluation data are not
available. In this case, the adaptation is performed on data selected according
to the WER scores "predicted" by a QE component. Our results indicate that: i)
QE predictions allow us to closely approximate the adaptation results obtained
in oracle conditions, and ii) the overall ASR performance based on the proposed
QE-driven adaptation method is significantly better than the strong, most
recent, CHiME-3 baseline.Comment: Computer Speech & Language December 201
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments
Eliminating the negative effect of non-stationary environmental noise is a
long-standing research topic for automatic speech recognition that stills
remains an important challenge. Data-driven supervised approaches, including
ones based on deep neural networks, have recently emerged as potential
alternatives to traditional unsupervised approaches and with sufficient
training, can alleviate the shortcomings of the unsupervised methods in various
real-life acoustic environments. In this light, we review recently developed,
representative deep learning approaches for tackling non-stationary additive
and convolutional degradation of speech with the aim of providing guidelines
for those involved in the development of environmentally robust speech
recognition systems. We separately discuss single- and multi-channel techniques
developed for the front-end and back-end of speech recognition systems, as well
as joint front-end and back-end training frameworks
- โฆ