9 research outputs found
Character-Level Incremental Speech Recognition with Recurrent Neural Networks
In real-time speech recognition applications, the latency is an important
issue. We have developed a character-level incremental speech recognition (ISR)
system that responds quickly even during the speech, where the hypotheses are
gradually improved while the speaking proceeds. The algorithm employs a
speech-to-character unidirectional recurrent neural network (RNN), which is
end-to-end trained with connectionist temporal classification (CTC), and an
RNN-based character-level language model (LM). The output values of the
CTC-trained RNN are character-level probabilities, which are processed by beam
search decoding. The RNN LM augments the decoding by providing long-term
dependency information. We propose tree-based online beam search with
additional depth-pruning, which enables the system to process infinitely long
input speech with low latency. This system not only responds quickly on speech
but also can dictate out-of-vocabulary (OOV) words according to pronunciation.
The proposed model achieves the word error rate (WER) of 8.90% on the Wall
Street Journal (WSJ) Nov'92 20K evaluation set when trained on the WSJ SI-284
training set.Comment: To appear in ICASSP 201
Single stream parallelization of generalized LSTM-like RNNs on a GPU
Recurrent neural networks (RNNs) have shown outstanding performance on
processing sequence data. However, they suffer from long training time, which
demands parallel implementations of the training procedure. Parallelization of
the training algorithms for RNNs are very challenging because internal
recurrent paths form dependencies between two different time frames. In this
paper, we first propose a generalized graph-based RNN structure that covers the
most popular long short-term memory (LSTM) network. Then, we present a
parallelization approach that automatically explores parallelisms of arbitrary
RNNs by analyzing the graph structure. The experimental results show that the
proposed approach shows great speed-up even with a single training stream, and
further accelerates the training when combined with multiple parallel training
streams.Comment: Accepted by the 40th IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP) 201
Fixed-Point Performance Analysis of Recurrent Neural Networks
Recurrent neural networks have shown excellent performance in many
applications, however they require increased complexity in hardware or software
based implementations. The hardware complexity can be much lowered by
minimizing the word-length of weights and signals. This work analyzes the
fixed-point performance of recurrent neural networks using a retrain based
quantization method. The quantization sensitivity of each layer in RNNs is
studied, and the overall fixed-point optimization results minimizing the
capacity of weights while not sacrificing the performance are presented. A
language model and a phoneme recognition examples are used
μ¬κ·ν μΈκ³΅μ κ²½λ§μ μ΄μ©ν μ¨λΌμΈ μμ±μΈμ
νμλ
Όλ¬Έ (λ°μ¬)-- μμΈλνκ΅ λνμ : μ κΈ°Β·μ»΄ν¨ν°κ³΅νλΆ, 2017. 2. μ±μμ©.μ¬κ·ν μΈκ³΅μ κ²½λ§(recurrent neural network, RNN)μ μ΅κ·Ό μνμ€-ν¬-μνμ€(sequence-to-sequence) λ°©μμ μ¬λ¬ λͺ¨λΈμμ μ’μ μ±λ₯μ λ³΄μ¬ μλ€. μ΅κ·Όμ μμ±μΈμμμ μ¬μ©νλ μ’
λ¨κ°(end-to-end) νλ ¨ λ°©μμ λ°μ μΌλ‘ μΈν΄, RNNμ μΌλ ¨μ μ€λμ€ νΉμ§(feature)μ μ
λ ₯μΌλ‘ νκ³ μΌλ ¨μ κΈμ(character) νΉμ λ¨μ΄λ€μ μΆλ ₯μΌλ‘ νλ λ¨μΌν ν¨μλ₯Ό νμ΅ν μ μκ² λμλ€. μ΄ ν¨μλ μ€κ°μ μμ λ¨μ νΉμ λ°μ μ¬μ (lexicon) λ¨μμ λ³νμ κ±°μΉμ§ μλλ€. μ§κΈκΉμ§, λλΆλΆμ μ’
λ¨κ° μμ±μΈμμ κΈ°μ‘΄ λ°©μμΌλ‘ μ»μ λμ μ νλλ₯Ό λ°λΌκ°λ λ° μ΄μ μ΄ λ§μΆ°μ Έ μμλ€. νμ§λ§, λΉλ‘ μ’
λ¨κ° μμ±μΈμ λͺ¨λΈμ΄ κΈ°μ‘΄ μμ±μΈμ λͺ¨λΈλ§νΌμ μ νλλ₯Ό λ¬μ±νμμλ, μ΄ λͺ¨λΈμ λ³΄ν΅ λ―Έλ¦¬ μλΌμ§ μ€λμ€ λ°μ΄ν°λ₯Ό μ¬μ©νλ λ°ν λ¨μμ μμ±μΈμμ μ¬μ©λμκ³ , μ€μκ°μΌλ‘ μ°μμ μΈ μ€λμ€ λ°μ΄ν°λ₯Ό λ°μ μ¬μ©νλ μμ±μΈμμλ μ μ¬μ©λμ§ μμλ€. μ΄κ²μ 미리 μλΌμ§ λ°μ΄ν°λ‘ νμ΅ν RNNμ λ§€μ° κΈ΄ μ€λμ€ μ
λ ₯μ λν΄μλ μ λμνλλ‘ μΌλ°ν(generalization)κ° λκΈ° μ΄λ €μ κΈ° λλ¬Έμ΄λ€.
μ λ¬Έμ λ₯Ό ν΄κ²°νκΈ° μν΄, λ³Έ λ
Όλ¬Έμμλ 무νν κΈ΄ μνμ€λ₯Ό μ¬μ©νλ RNN νλ ¨ λ°©λ²μ μ μνλ€. λ¨Όμ , μ΄λ₯Ό μν ν¨κ³Όμ μΈ κ·Έλν½ νλ‘μΈμ(graphics processing unit, GPU) κΈ°λ° RNN νλ ¨ νλ μμν¬(framework)λ₯Ό μ€λͺ
νλ€. μ΄ νλ μμν¬λ μ νλ μκ°μΆ μμ ν(truncated backpropagation through time, truncated BPTT)λ₯Ό μ¬μ©ν΄ νλ ¨λλ©°, λλΆμ μ€μκ°μΌλ‘ λ€μ΄μ€λ μ°μμ μΈ λ°μ΄ν°λ₯Ό μ¬μ©νμ¬ νλ ¨ν μ μλ€. λ€μμΌλ‘, μ°κ²°μ± μκ³μ΄ λΆλ₯κΈ°(connectionist temporal classification, CTC) μκ³ λ¦¬μ¦μ μμ€(loss) κ³μ° λ°©μμ λ³νν μ€μκ° CTC νμ΅ μκ³ λ¦¬μ¦μ μ 보μΈλ€. μλ‘κ² μ λ³΄μΈ CTC μμ€ κ³μ° μκ³ λ¦¬μ¦μ truncated BPTT κΈ°λ°μ RNN νλ ¨μ λ°λ‘ μ μ©λ μ μλ€.
λ€μμΌλ‘, RNNλ§μΌλ‘ ꡬμ±λ μ’
λ¨κ° μ€μκ° μμ±μΈμ λͺ¨λΈμ μκ°νλ€. μ΄ λͺ¨λΈμ ν¬κ² CTC μΆλ ₯μ μ¬μ©νλ μν₯(acoustic) RNNκ³Ό κΈμ λ¨μ RNN μΈμ΄ λͺ¨λΈ(language model)λ‘ κ΅¬μ±λλ€. κ·Έλ¦¬κ³ , μ λμ¬ νΈλ¦¬(prefix-tree) κΈ°λ°μ μλ‘μ΄ λΉ νμ(beam search)μ΄ μ¬μ©λμ΄ λ¬΄νν μ
λ ₯ μ€λμ€μ λν΄ λμ½λ©(decoding)μ μνν μ μλ€. μ΄ λμ½λ© λ°©μμλ μλ‘μ΄ λΉ κ°μ§μΉκΈ°(beam pruning) μκ³ λ¦¬μ¦μ΄ λμ
λμ΄ νΈλ¦¬ ꡬ쑰μ ν¬κΈ°κ° μ§μμ μΌλ‘ μ¦κ°νλ κ²μ λ°©μ§νλ€. μ μμ±μΈμ λͺ¨λΈμλ λ³λμ μμ λͺ¨λΈμ΄λ λ°μ μ¬μ μ΄ ν¬ν¨λμ΄ μμ§ μκ³ , 무νν κΈ΄ μΌλ ¨μ μ€λμ€μ λν΄ λμ½λ©μ μνν μ μλ€λ νΉμ§μ΄ μλ€. μ λͺ¨λΈμ λν λ€λ₯Έ μ’
λ¨κ° λͺ¨λΈμ λΉν΄ λ§€μ° μ μ λ©λͺ¨λ¦¬λ₯Ό μ¬μ©νλ©΄μλ λΉκ²¬λ λ§ν μ νλλ₯Ό 보μΈλ€.
λ§μ§λ§μΌλ‘, λ³Έ λ
Όλ¬Έμμλ κ³μΈ΅ν ꡬ쑰(hierarchical structure)λ₯Ό μ΄μ©ν΄ κΈμ λ¨μ RNN μΈμ΄ λͺ¨λΈμ μ±λ₯μ ν₯μμμΌ°λ€. νΉν, μ΄ κΈμ λ¨μ RNN λͺ¨λΈμ λΉμ·ν νλΌλ―Έν° μλ₯Ό κ°λ λ¨μ΄ λ¨μ RNN μΈμ΄ λͺ¨λΈλ³΄λ€ κ°μ λ μμΈ‘ 볡μ‘λ(perplexity)λ₯Ό λ¬μ±νμλ€. λν, μ΄ κΈμ λ¨μ RNN μΈμ΄ λͺ¨λΈμ μμ μ€λͺ
ν κΈμ λ¨μ μ€μκ° μμ±μΈμ μμ€ν
μ μ μ©νμ¬ λμ± μ μ μ°μ°μ μ¬μ©νλ©΄μλ μμ±μΈμ μ νλλ₯Ό ν₯μμν¬ μ μμλ€.Recurrent neural networks (RNNs) have shown outstanding sequence to sequence modeling performance. Thanks to recent advances in end-to-end training approaches for automatic speech recognition (ASR), RNNs can learn direct mapping functions from the sequence of audio features to the sequence of output characters or words without any intermediate phoneme or lexicon layers. So far, majority of studies on end-to-end ASR have been focused on increasing the accuracy of speech recognition to the level of traditional state-of-the-art models. However, although the end-to-end ASR models have reached the accuracy of the traditional systems, their application has usually been limited to utterance-level speech recognition with pre-segmented audio instead of online speech recognition with continuous audio. This is because the RNNs cannot be easily generalized to very long streams of audio when they are trained with segmented audio.
To address this problem, we propose an RNN training approach on training sequences with virtually infinite length. Specifically, we describe an efficient GPU-based RNN training framework for the truncated backpropagation through time (BPTT) algorithm, which is suitable for online (continuous) training. Then, we present an online version of the connectionist temporal classification (CTC) loss computation algorithm, where the original CTC loss is estimated with partial sliding window. This modified CTC algorithm can be directly employed for truncated BPTT based RNN training.
In addition, a fully RNN based end-to-end online ASR model is proposed. The model is composed of an acoustic RNN with CTC output and a character-level RNN language model that is augmented with a hierarchical structure. Prefix-tree based beam search decoding is employed with a new beam pruning algorithm to prevent exponential growth of the tree. The model is free from phoneme or lexicon models, and can be used for decoding infinitely long audio sequences. Also, this model has very small memory footprint compared to the other end-to-end systems while showing the competitive accuracy.
Furthermore, we propose an improved character-level RNN LM with a hierarchical structure. This character-level RNN LM shows improved perplexity compared to the lightweight word-level RNN LM with a comparable size. When this RNN LM is applied to the proposed character-level online ASR, better speech recognition accuracy can be achieved with reduced amount of computation.1 Introduction 1
1.1 Automatic Speech Recognition 1
1.1.1 Traditional ASR 2
1.1.2 End-to-End ASR with Recurrent Neural Networks 3
1.1.3 Offline and Online ASR 3
1.2 Scope of the Dissertation 4
1.2.1 End-to-End Online ASR with RNNs 4
1.2.2 Challenges and Contributions 5
2 Flexible and Efficient RNN Training on GPUs 7
2.1 Introduction 7
2.2 Generalization 9
2.2.1 Generalized RNN Structure 9
2.2.2 Training 11
2.3 Parallelization 15
2.3.1 Intra-Stream Parallelism 15
2.3.2 Inter-Stream Parallelism 17
2.4 Experiments 18
2.5 Concluding Remarks 21
3 Online Sequence Training with Connectionist Temporal Classification 22
3.1 Introduction 22
3.2 Connectionist Temporal Classification 25
3.3 Online Sequence Training 28
3.3.1 Problem Definition 28
3.3.2 Overview of the Proposed Approach 29
3.3.3 CTC-TR: Standard CTC with Truncation 31
3.3.4 CTC-EM: EM-Based Online CTC 32
3.4 Training Continuously Running RNNs 37
3.5 Parallel Training 38
3.6 Experiments 39
3.6.1 End-to-End Speech Recognition with RNNs 39
3.6.2 Phoneme Recognition on TIMIT 46
3.7 Concluding Remarks 51
4 Character-Level Incremental Speech Recognition 52
4.1 Introduction 52
4.2 Models 54
4.2.1 Acoustic Model 54
4.2.2 Language Model 56
4.3 Character-Level Beam Search 57
4.3.1 Prefix-Tree-Based CTC Beam Search 57
4.3.2 Pruning 60
4.4 Experiments 62
4.5 Concluding Remarks 65
5 Character-Level Language Modeling with Hierarchical RNNs 66
5.1 Introduction 66
5.2 Related Work 68
5.2.1 Character-Level Language Modeling with RNNs 68
5.2.2 Character-Aware Word-Level Language Modeling 69
5.3 RNNs with External Clock and Reset Signals 70
5.4 Character-Level Language Modeling with a Hierarchical RNN 72
5.5 Experiments 75
5.5.1 Perplexity 76
5.5.2 End-to-End Automatic Speech Recognition (ASR) 79
5.6 Concluding Remarks 81
6 Conclusion 83
Bibliography 85
Abstract in Korean 98Docto
FPGA-Based Low-Power Speech Recognition with Recurrent Neural Networks
In this paper, a neural network based real-time speech recognition (SR)
system is developed using an FPGA for very low-power operation. The implemented
system employs two recurrent neural networks (RNNs); one is a
speech-to-character RNN for acoustic modeling (AM) and the other is for
character-level language modeling (LM). The system also employs a statistical
word-level LM to improve the recognition accuracy. The results of the AM, the
character-level LM, and the word-level LM are combined using a fairly simple
N-best search algorithm instead of the hidden Markov model (HMM) based network.
The RNNs are implemented using massively parallel processing elements (PEs) for
low latency and high throughput. The weights are quantized to 6 bits to store
all of them in the on-chip memory of an FPGA. The proposed algorithm is
implemented on a Xilinx XC7Z045, and the system can operate much faster than
real-time.Comment: Accepted to SiPS 201