156,229 research outputs found
Improved training for online end-to-end speech recognition systems
Achieving high accuracy with end-to-end speech recognizers requires careful
parameter initialization prior to training. Otherwise, the networks may fail to
find a good local optimum. This is particularly true for online networks, such
as unidirectional LSTMs. Currently, the best strategy to train such systems is
to bootstrap the training from a tied-triphone system. However, this is time
consuming, and more importantly, is impossible for languages without a
high-quality pronunciation lexicon. In this work, we propose an initialization
strategy that uses teacher-student learning to transfer knowledge from a large,
well-trained, offline end-to-end speech recognition model to an online
end-to-end model, eliminating the need for a lexicon or any other linguistic
resources. We also explore curriculum learning and label smoothing and show how
they can be combined with the proposed teacher-student learning for further
improvements. We evaluate our methods on a Microsoft Cortana personal assistant
task and show that the proposed method results in a 19 % relative improvement
in word error rate compared to a randomly-initialized baseline system.Comment: Interspeech 201
μ¬κ·ν μΈκ³΅μ κ²½λ§μ μ΄μ©ν μ¨λΌμΈ μμ±μΈμ
νμλ
Όλ¬Έ (λ°μ¬)-- μμΈλνκ΅ λνμ : μ κΈ°Β·μ»΄ν¨ν°κ³΅νλΆ, 2017. 2. μ±μμ©.μ¬κ·ν μΈκ³΅μ κ²½λ§(recurrent neural network, RNN)μ μ΅κ·Ό μνμ€-ν¬-μνμ€(sequence-to-sequence) λ°©μμ μ¬λ¬ λͺ¨λΈμμ μ’μ μ±λ₯μ λ³΄μ¬ μλ€. μ΅κ·Όμ μμ±μΈμμμ μ¬μ©νλ μ’
λ¨κ°(end-to-end) νλ ¨ λ°©μμ λ°μ μΌλ‘ μΈν΄, RNNμ μΌλ ¨μ μ€λμ€ νΉμ§(feature)μ μ
λ ₯μΌλ‘ νκ³ μΌλ ¨μ κΈμ(character) νΉμ λ¨μ΄λ€μ μΆλ ₯μΌλ‘ νλ λ¨μΌν ν¨μλ₯Ό νμ΅ν μ μκ² λμλ€. μ΄ ν¨μλ μ€κ°μ μμ λ¨μ νΉμ λ°μ μ¬μ (lexicon) λ¨μμ λ³νμ κ±°μΉμ§ μλλ€. μ§κΈκΉμ§, λλΆλΆμ μ’
λ¨κ° μμ±μΈμμ κΈ°μ‘΄ λ°©μμΌλ‘ μ»μ λμ μ νλλ₯Ό λ°λΌκ°λ λ° μ΄μ μ΄ λ§μΆ°μ Έ μμλ€. νμ§λ§, λΉλ‘ μ’
λ¨κ° μμ±μΈμ λͺ¨λΈμ΄ κΈ°μ‘΄ μμ±μΈμ λͺ¨λΈλ§νΌμ μ νλλ₯Ό λ¬μ±νμμλ, μ΄ λͺ¨λΈμ λ³΄ν΅ λ―Έλ¦¬ μλΌμ§ μ€λμ€ λ°μ΄ν°λ₯Ό μ¬μ©νλ λ°ν λ¨μμ μμ±μΈμμ μ¬μ©λμκ³ , μ€μκ°μΌλ‘ μ°μμ μΈ μ€λμ€ λ°μ΄ν°λ₯Ό λ°μ μ¬μ©νλ μμ±μΈμμλ μ μ¬μ©λμ§ μμλ€. μ΄κ²μ 미리 μλΌμ§ λ°μ΄ν°λ‘ νμ΅ν RNNμ λ§€μ° κΈ΄ μ€λμ€ μ
λ ₯μ λν΄μλ μ λμνλλ‘ μΌλ°ν(generalization)κ° λκΈ° μ΄λ €μ κΈ° λλ¬Έμ΄λ€.
μ λ¬Έμ λ₯Ό ν΄κ²°νκΈ° μν΄, λ³Έ λ
Όλ¬Έμμλ 무νν κΈ΄ μνμ€λ₯Ό μ¬μ©νλ RNN νλ ¨ λ°©λ²μ μ μνλ€. λ¨Όμ , μ΄λ₯Ό μν ν¨κ³Όμ μΈ κ·Έλν½ νλ‘μΈμ(graphics processing unit, GPU) κΈ°λ° RNN νλ ¨ νλ μμν¬(framework)λ₯Ό μ€λͺ
νλ€. μ΄ νλ μμν¬λ μ νλ μκ°μΆ μμ ν(truncated backpropagation through time, truncated BPTT)λ₯Ό μ¬μ©ν΄ νλ ¨λλ©°, λλΆμ μ€μκ°μΌλ‘ λ€μ΄μ€λ μ°μμ μΈ λ°μ΄ν°λ₯Ό μ¬μ©νμ¬ νλ ¨ν μ μλ€. λ€μμΌλ‘, μ°κ²°μ± μκ³μ΄ λΆλ₯κΈ°(connectionist temporal classification, CTC) μκ³ λ¦¬μ¦μ μμ€(loss) κ³μ° λ°©μμ λ³νν μ€μκ° CTC νμ΅ μκ³ λ¦¬μ¦μ μ 보μΈλ€. μλ‘κ² μ λ³΄μΈ CTC μμ€ κ³μ° μκ³ λ¦¬μ¦μ truncated BPTT κΈ°λ°μ RNN νλ ¨μ λ°λ‘ μ μ©λ μ μλ€.
λ€μμΌλ‘, RNNλ§μΌλ‘ ꡬμ±λ μ’
λ¨κ° μ€μκ° μμ±μΈμ λͺ¨λΈμ μκ°νλ€. μ΄ λͺ¨λΈμ ν¬κ² CTC μΆλ ₯μ μ¬μ©νλ μν₯(acoustic) RNNκ³Ό κΈμ λ¨μ RNN μΈμ΄ λͺ¨λΈ(language model)λ‘ κ΅¬μ±λλ€. κ·Έλ¦¬κ³ , μ λμ¬ νΈλ¦¬(prefix-tree) κΈ°λ°μ μλ‘μ΄ λΉ νμ(beam search)μ΄ μ¬μ©λμ΄ λ¬΄νν μ
λ ₯ μ€λμ€μ λν΄ λμ½λ©(decoding)μ μνν μ μλ€. μ΄ λμ½λ© λ°©μμλ μλ‘μ΄ λΉ κ°μ§μΉκΈ°(beam pruning) μκ³ λ¦¬μ¦μ΄ λμ
λμ΄ νΈλ¦¬ ꡬ쑰μ ν¬κΈ°κ° μ§μμ μΌλ‘ μ¦κ°νλ κ²μ λ°©μ§νλ€. μ μμ±μΈμ λͺ¨λΈμλ λ³λμ μμ λͺ¨λΈμ΄λ λ°μ μ¬μ μ΄ ν¬ν¨λμ΄ μμ§ μκ³ , 무νν κΈ΄ μΌλ ¨μ μ€λμ€μ λν΄ λμ½λ©μ μνν μ μλ€λ νΉμ§μ΄ μλ€. μ λͺ¨λΈμ λν λ€λ₯Έ μ’
λ¨κ° λͺ¨λΈμ λΉν΄ λ§€μ° μ μ λ©λͺ¨λ¦¬λ₯Ό μ¬μ©νλ©΄μλ λΉκ²¬λ λ§ν μ νλλ₯Ό 보μΈλ€.
λ§μ§λ§μΌλ‘, λ³Έ λ
Όλ¬Έμμλ κ³μΈ΅ν ꡬ쑰(hierarchical structure)λ₯Ό μ΄μ©ν΄ κΈμ λ¨μ RNN μΈμ΄ λͺ¨λΈμ μ±λ₯μ ν₯μμμΌ°λ€. νΉν, μ΄ κΈμ λ¨μ RNN λͺ¨λΈμ λΉμ·ν νλΌλ―Έν° μλ₯Ό κ°λ λ¨μ΄ λ¨μ RNN μΈμ΄ λͺ¨λΈλ³΄λ€ κ°μ λ μμΈ‘ 볡μ‘λ(perplexity)λ₯Ό λ¬μ±νμλ€. λν, μ΄ κΈμ λ¨μ RNN μΈμ΄ λͺ¨λΈμ μμ μ€λͺ
ν κΈμ λ¨μ μ€μκ° μμ±μΈμ μμ€ν
μ μ μ©νμ¬ λμ± μ μ μ°μ°μ μ¬μ©νλ©΄μλ μμ±μΈμ μ νλλ₯Ό ν₯μμν¬ μ μμλ€.Recurrent neural networks (RNNs) have shown outstanding sequence to sequence modeling performance. Thanks to recent advances in end-to-end training approaches for automatic speech recognition (ASR), RNNs can learn direct mapping functions from the sequence of audio features to the sequence of output characters or words without any intermediate phoneme or lexicon layers. So far, majority of studies on end-to-end ASR have been focused on increasing the accuracy of speech recognition to the level of traditional state-of-the-art models. However, although the end-to-end ASR models have reached the accuracy of the traditional systems, their application has usually been limited to utterance-level speech recognition with pre-segmented audio instead of online speech recognition with continuous audio. This is because the RNNs cannot be easily generalized to very long streams of audio when they are trained with segmented audio.
To address this problem, we propose an RNN training approach on training sequences with virtually infinite length. Specifically, we describe an efficient GPU-based RNN training framework for the truncated backpropagation through time (BPTT) algorithm, which is suitable for online (continuous) training. Then, we present an online version of the connectionist temporal classification (CTC) loss computation algorithm, where the original CTC loss is estimated with partial sliding window. This modified CTC algorithm can be directly employed for truncated BPTT based RNN training.
In addition, a fully RNN based end-to-end online ASR model is proposed. The model is composed of an acoustic RNN with CTC output and a character-level RNN language model that is augmented with a hierarchical structure. Prefix-tree based beam search decoding is employed with a new beam pruning algorithm to prevent exponential growth of the tree. The model is free from phoneme or lexicon models, and can be used for decoding infinitely long audio sequences. Also, this model has very small memory footprint compared to the other end-to-end systems while showing the competitive accuracy.
Furthermore, we propose an improved character-level RNN LM with a hierarchical structure. This character-level RNN LM shows improved perplexity compared to the lightweight word-level RNN LM with a comparable size. When this RNN LM is applied to the proposed character-level online ASR, better speech recognition accuracy can be achieved with reduced amount of computation.1 Introduction 1
1.1 Automatic Speech Recognition 1
1.1.1 Traditional ASR 2
1.1.2 End-to-End ASR with Recurrent Neural Networks 3
1.1.3 Offline and Online ASR 3
1.2 Scope of the Dissertation 4
1.2.1 End-to-End Online ASR with RNNs 4
1.2.2 Challenges and Contributions 5
2 Flexible and Efficient RNN Training on GPUs 7
2.1 Introduction 7
2.2 Generalization 9
2.2.1 Generalized RNN Structure 9
2.2.2 Training 11
2.3 Parallelization 15
2.3.1 Intra-Stream Parallelism 15
2.3.2 Inter-Stream Parallelism 17
2.4 Experiments 18
2.5 Concluding Remarks 21
3 Online Sequence Training with Connectionist Temporal Classification 22
3.1 Introduction 22
3.2 Connectionist Temporal Classification 25
3.3 Online Sequence Training 28
3.3.1 Problem Definition 28
3.3.2 Overview of the Proposed Approach 29
3.3.3 CTC-TR: Standard CTC with Truncation 31
3.3.4 CTC-EM: EM-Based Online CTC 32
3.4 Training Continuously Running RNNs 37
3.5 Parallel Training 38
3.6 Experiments 39
3.6.1 End-to-End Speech Recognition with RNNs 39
3.6.2 Phoneme Recognition on TIMIT 46
3.7 Concluding Remarks 51
4 Character-Level Incremental Speech Recognition 52
4.1 Introduction 52
4.2 Models 54
4.2.1 Acoustic Model 54
4.2.2 Language Model 56
4.3 Character-Level Beam Search 57
4.3.1 Prefix-Tree-Based CTC Beam Search 57
4.3.2 Pruning 60
4.4 Experiments 62
4.5 Concluding Remarks 65
5 Character-Level Language Modeling with Hierarchical RNNs 66
5.1 Introduction 66
5.2 Related Work 68
5.2.1 Character-Level Language Modeling with RNNs 68
5.2.2 Character-Aware Word-Level Language Modeling 69
5.3 RNNs with External Clock and Reset Signals 70
5.4 Character-Level Language Modeling with a Hierarchical RNN 72
5.5 Experiments 75
5.5.1 Perplexity 76
5.5.2 End-to-End Automatic Speech Recognition (ASR) 79
5.6 Concluding Remarks 81
6 Conclusion 83
Bibliography 85
Abstract in Korean 98Docto
RWTH ASR Systems for LibriSpeech: Hybrid vs Attention -- w/o Data Augmentation
We present state-of-the-art automatic speech recognition (ASR) systems
employing a standard hybrid DNN/HMM architecture compared to an attention-based
encoder-decoder design for the LibriSpeech task. Detailed descriptions of the
system development, including model design, pretraining schemes, training
schedules, and optimization approaches are provided for both system
architectures. Both hybrid DNN/HMM and attention-based systems employ
bi-directional LSTMs for acoustic modeling/encoding. For language modeling, we
employ both LSTM and Transformer based architectures. All our systems are built
using RWTHs open-source toolkits RASR and RETURNN. To the best knowledge of the
authors, the results obtained when training on the full LibriSpeech training
set, are the best published currently, both for the hybrid DNN/HMM and the
attention-based systems. Our single hybrid system even outperforms previous
results obtained from combining eight single systems. Our comparison shows that
on the LibriSpeech 960h task, the hybrid DNN/HMM system outperforms the
attention-based system by 15% relative on the clean and 40% relative on the
other test sets in terms of word error rate. Moreover, experiments on a reduced
100h-subset of the LibriSpeech training corpus even show a more pronounced
margin between the hybrid DNN/HMM and attention-based architectures.Comment: Proceedings of INTERSPEECH 201
Character-Level Incremental Speech Recognition with Recurrent Neural Networks
In real-time speech recognition applications, the latency is an important
issue. We have developed a character-level incremental speech recognition (ISR)
system that responds quickly even during the speech, where the hypotheses are
gradually improved while the speaking proceeds. The algorithm employs a
speech-to-character unidirectional recurrent neural network (RNN), which is
end-to-end trained with connectionist temporal classification (CTC), and an
RNN-based character-level language model (LM). The output values of the
CTC-trained RNN are character-level probabilities, which are processed by beam
search decoding. The RNN LM augments the decoding by providing long-term
dependency information. We propose tree-based online beam search with
additional depth-pruning, which enables the system to process infinitely long
input speech with low latency. This system not only responds quickly on speech
but also can dictate out-of-vocabulary (OOV) words according to pronunciation.
The proposed model achieves the word error rate (WER) of 8.90% on the Wall
Street Journal (WSJ) Nov'92 20K evaluation set when trained on the WSJ SI-284
training set.Comment: To appear in ICASSP 201
Improved training of end-to-end attention models for speech recognition
Sequence-to-sequence attention-based models on subword units allow simple
open-vocabulary end-to-end speech recognition. In this work, we show that such
models can achieve competitive results on the Switchboard 300h and LibriSpeech
1000h tasks. In particular, we report the state-of-the-art word error rates
(WER) of 3.54% on the dev-clean and 3.82% on the test-clean evaluation subsets
of LibriSpeech. We introduce a new pretraining scheme by starting with a high
time reduction factor and lowering it during training, which is crucial both
for convergence and final performance. In some experiments, we also use an
auxiliary CTC loss function to help the convergence. In addition, we train long
short-term memory (LSTM) language models on subword units. By shallow fusion,
we report up to 27% relative improvements in WER over the attention baseline
without a language model.Comment: submitted to Interspeech 201
- β¦