282 research outputs found
Exploring efficient neural architectures for linguistic-acoustic mapping in text-to-speech
Conversion from text to speech relies on the accurate mapping from linguistic to acoustic symbol sequences, for which current practice employs recurrent statistical models such as recurrent neural networks. Despite the good performance of such models (in terms of low distortion in the generated speech), their recursive structure with intermediate affine transformations tends to make them slow to train and to sample from. In this work, we explore two different mechanisms that enhance the operational efficiency of recurrent neural networks, and study their performanceβspeed trade-off. The first mechanism is based on the quasi-recurrent neural network, where expensive affine transformations are removed from temporal connections and placed only on feed-forward computational directions. The second mechanism includes a module based on the transformer decoder network, designed without recurrent connections but emulating them with attention and positioning codes. Our results show that the proposed decoder networks are competitive in terms of distortion when compared to a recurrent baseline, whilst being significantly faster in terms of CPU and GPU inference time. The best performing model is the one based on the quasi-recurrent mechanism, reaching the same level of naturalness as the recurrent neural network based model with a speedup of 11.2 on CPU and 3.3 on GPU.Peer ReviewedPostprint (published version
Low latency modeling of temporal contexts for speech recognition
This thesis focuses on the development of neural network acoustic models for large vocabulary continuous speech recognition (LVCSR) to satisfy the design goals of low latency and low computational complexity. Low latency enables online speech recognition; and low computational complexity helps reduce the computational cost both during training and inference.
Long span sequential dependencies and sequential distortions in the input vector sequence are a major challenge in acoustic modeling. Recurrent neural networks have been shown to effectively model these dependencies. Specifically, bidirectional long short term memory (BLSTM) networks, provide state-of-the-art performance across several LVCSR tasks. However the deployment of bidirectional models for online LVCSR is non-trivial due to their large latency; and unidirectional LSTM models are typically preferred.
In this thesis we explore the use of hierarchical temporal convolution to model long span temporal dependencies. We propose a sub-sampled variant of these temporal convolution neural networks, termed time-delay neural networks (TDNNs). These sub-sampled TDNNs reduce the computation complexity by ~5x, compared to TDNNs, during frame randomized pre-training. These models are shown to be effective in modeling long-span temporal contexts, however there is a performance gap compared to (B)LSTMs.
As recent advancements in acoustic model training have eliminated the need for frame randomized pre-training we modify the TDNN architecture to use higher sampling rates, as the increased computation can be amortized over the sequence. These variants of sub- sampled TDNNs provide performance superior to unidirectional LSTM networks, while also affording a lower real time factor (RTF) during inference. However we show that the BLSTM models outperform both the TDNN and LSTM models.
We propose a hybrid architecture interleaving temporal convolution and LSTM layers which is shown to outperform the BLSTM models. Further we improve these BLSTM models by using higher frame rates at lower layers and show that the proposed TDNN- LSTM model performs similar to these superior BLSTM models, while reducing the overall latency to 200 ms.
Finally we describe an online system for reverberation robust ASR, using the above described models in conjunction with other data augmentation techniques like reverberation simulation, which simulates far-field environments, and volume perturbation, which helps tackle volume variation even without gain normalization
On-device Efficient Acoustic Modeling with Simple Gated Convolutional Networks
νμλ
Όλ¬Έ (μμ¬)-- μμΈλνκ΅ λνμ : 곡과λν μ κΈ°Β·μ 보곡νλΆ, 2019. 2. μ±μμ©.μ€λλ , μλ μμ± μΈμ μμ€ν
μΌλ‘ μΈκ³΅μ κ²½λ§ κΈ°λ°μ μκ³ λ¦¬μ¦μ΄ μ£Όμνκ² νμ©λκ³ μλ€. κ·Έλ° κ°μ΄λ°, μ€λ§νΈν°μ΄λ μλ² λλ μ₯μΉμμ μλ²λ₯Ό κ±°μΉμ§ μκ³ μ§νλλ μ¨-λλ°μ΄μ€ μμ± μΈμ μμ€ν
μ λν μμκ° μ¦κ°νκ³ μλ€. μ¨-λλ°μ΄μ€ μμ± μΈμ μμ€ν
μ μ¬μ©μμ μμ±μ΄ μλΉμ€ μ 곡μμ μλ²λ‘ μ 곡λμ§ μκ³ , μμ±μΈμμ΄ μ¬μ©μμ μ₯μΉμμ λ
립μ μΌλ‘ μ΄λ£¨μ΄μ§λ€. λ°λΌμ, νλΌμ΄λ²μ μΉ¨ν΄μ 보μμ λν μ°λ €λ₯Ό μλΉ λΆλΆ ν΄μν μ μλ€.
κ·Έλ¬λ, μΈκ³΅μ κ²½λ§ κΈ°λ°μ μμ± μΈμ μμ€ν
μμ μ£Όλ‘ μ¬μ©λλ LSTM κΈ°λ°μ νκ·μ κ²½λ§(RNN)μ μ¨-λλ°μ΄μ€ μμ± μΈμμ ν¨μ¨μ μ΄μ§ μλ€. LSTM RNNμ μνμ€(sequence) μ 보μ λ³λ ¬νκ° μ΄λ ΅λ€. μ΄λ LSTM RNNμλ νμ¬μ μκ° μ€ν
(step)μ΄ κ³Όκ±°μ μκ° μ€ν
μ μμ‘΄νλ λλ¨Ήμ(Feedback) νΉμ±μ΄ μ‘΄μ¬νκΈ° λλ¬Έμ΄λ€. λ, μ΄ λλ¨Ήμ μ 보λ λ무 컀μ μΊμ λ©λͺ¨λ¦¬μ λ€μ΄κ° μ μλ€. λ°λΌμ, μνμ€ μ 보μ 맀 μκ° μ€ν
λ§λ€ DRAMμ μ κ·Όνμ¬ μνμ λΆλ¬μμΌ νλ€. μ΄ κ²½μ° λ§€ μκ° μ€ν
λ§λ€ DRAMμ μ κ·Όνμ¬ μ λ ₯μλͺ¨κ° μ¦κ°ν λΏλ§ μλλΌ, μ€ν μκ°λ μ¦κ°νκ² λλ€.
μ°λ¦¬λ μ΄ λ
Όλ¬Έμμ μ¨-λλ°μ΄μ€μ μΉνμ μΈ μΈκ³΅μ κ²½λ§ λͺ¨λΈμ μ μνλ€. μ΄ λͺ¨λΈλ€μ μν₯ λͺ¨λΈλ§μ νμ©νμ¬ LSTM RNNμ λ체νλ€. κ²μ΄ν°λ μ½λ³Όλ£¨μ
λ€ν¬μν¬(Gated ConvNet), λκ°μ±λΆ LSTM(Diagonal LSTM), QRNN(the quasi RNN)μ΄ νμ©λμλ€. μ΄λ€ λͺ¨λΈμ λλΆλΆμ μ°μ°μμ μμ μμ‘΄μ±μ΄ μ‘΄μ¬ νμ§ μμ μκ° μ€ν
λ³ λ³λ ¬νκ° κ°λ₯νλ€. \\ \\ \\ \\
μ΄λ€ λͺ¨λΈλ€μ μλ μμ± μΈμμμ 1μ°¨μ κΉμ΄ μ½λ³Όλ£¨μ
(1D depthwise Convolution)μ΄ μΆκ°λ νμλ LSTM RNNμ μ±λ₯μ ν¨μ¬ λ₯κ°νμλ€. νΉν κ²μ΄ν°λ μ½λ³Όλ£¨μ
λ€νΈμν¬μ κ²½μ° κΉμ ꡬ쑰λ₯Ό μ±ννμμ λ, μν₯ λͺ¨λΈ μμ΄ κ°μ₯ μ’μ μ±λ₯μ 보μ¬μ£Όμλ€. 무μ보λ€λ μ¨-λλ°μ΄μ€μ ν¨μ¨μ μΈ μΈκ³΅μ κ²½λ§ λͺ¨λΈλ€μ μνμ€μ μκ° μ€ν
λ³ λ³λ ¬νλ₯Ό ν΅ν΄ μ€μ μλ² λλ μ₯μΉμμ LSTM RNN λλΉ μ΅μ 5λ°°μ μ€ν μλ μ¦κ°λ₯Ό 보μ¬μ£Όμλ€.
μ°λ¦¬λ μ¬κΈ°μ λ λμκ°, μ¬ν κ²μ΄ν°λ μ½λ³Όλ£¨μ
λ€νΈμν¬(Simple Gated ConvNet)μ μ μνλ€. μ¬ν κ²μ΄ν°λ μ½λ³Όλ£¨μ
μ κ²μ΄ν°λ μ½λ³Όλ£¨μ
μ κ°μ₯ λ¨μν λ ννμ κΈ°λ°μ λ κ²μΌλ‘, νλΌλ―Έν°μ μκ° νλͺ
μ μΌλ‘ κ°μνλ€.
μ΄λ νλμ¨μ΄ μ¬μμ μ νμ λ°λ μ¨-λλ°μ΄μ€ μμ±μΈμμ μ 리ν νΉμ±μ΄λ€. λν μ¬ν κ²μ΄ν°λ μ½λ³Όλ£¨μ
λ€νΈμν¬λ μκ° μ€ν
λ³ μμ μμ‘΄μ±μ΄ μ‘΄μ¬νμ§ μκΈ° λλ¬Έμ μκ° μ€ν
λ³ λ³λ ¬νλ κ°λ₯νλ€. μ°λ¦¬λ 1μ°¨μ κΉμ΄ λ³λ ¬ν(1D depthwise convolution)μ μ¬λ¬ λ°©ν₯μ μ μ©νμ¬ μ±λ₯ ν₯μμ μ΄λμ΄ λ΄μλ€.
ꡬ체μ μΌλ‘, μ°λ¦¬λ μ¬ν κ²μ΄ν°λ μ½λ³Όλ£¨μ
λ€ν¬μν¬λ₯Ό νμ©ν΄ νλΌλ―Έν° μ¬μ©λμ 3 M μ΄νλ‘ μ€μλ€. λμΌν νλΌλ―Έν° μκ° μ£Όμ΄μ‘μ λ μ¬ν κ²μ΄ν°λ μ½λ³Όλ£¨μ
λ€νΈμν¬λ μλ μμ± μΈμμμ LSTM RNNμ΄λ κ²μ΄ν°λ μ½λ³Όλ£¨μ
λ€νΈμν¬μ μ±λ₯μ λ₯κ°νλ€. 3 M μλμ μ¬ν κ²μ΄ν°λ μ½λ³Όλ£¨μ
λ€ν¬μν¬λ 10 Mμ LSTMλ³΄λ€ λ μ’μ μ±λ₯μ 보μ¬μ£ΌκΈ°λ νμλ€. λν, μκ° μ€ν
λ³ λ³λ ¬νλ₯Ό ν΅ν΄μ ARM CPUμμ LSTM RNN λλΉ 10 λ°°μ μ€ν μλ μ¦κ°λ₯Ό μ»μ΄λλ€.Automatic speech recognition (ASR) is widely adopted for smartphones and many embedded devices in recent years, and neural network based algorithms show the best performance for ASR. While most of ASR systems are based on server-based processing, there is an increasing demand for on-device speech recognition because of privacy concern and low latency processing. Reducing the power consumption is especially important for on-device speech recognition to lengthen the battery life.
Among several neural network models, recurrent neural network (RNN) based algorithms are mostly used for speech recognition, and long short-term memory(LSTM) RNN is most popular because of its superior performance over the other ones. However, executing LSTM RNN demands many DRAM accesses because the cache size of embedded devices is usually much smaller than the parameter size of RNN. Multi-time step parallelization technique computes multiple output samples at a time by fetching one set of parameters, and thus it can reduce the number of DRAM accesses in proportional to the number of time steps computed at a time. However, LSTM RNN does not permit the multi-time step parallelization because of complex feedback structure of the model.
This thesis presents neural network models that support efficient on-device speech recognition. First, a few models that permit multi-time step parallel processing are evaluated. The models evaluated include Gated ConvNet, Diagonal LSTM, and QRNN (quasi RNN). Since the performance of these models are not as good as the LSTM, one-dimensional depthwise convolution is added to improve the performance. The one-dimensional convolution helps finding the temporal patterns of speech signal. Second, Simple Gated Convolution Network (Simple Gated ConvNet) is proposed for improved performance when the parameter count is very small. The Simple Gated ConvNet employs the simplest form of Gated ConvNet. Instead it relies on one-dimensional convolution for temporal observation. Simple Gated ConvNet supports low-power on-device speech recognition because it can be executed employing multi-time step parallelization. The Simple Gated ConvNet under 3 million even shows better performance than the LSTM with 10 million parameters. In addition, the execution speed in ARM CPU can be increased more than ten-times compared with the LSTM RNN through multi-time step parallelization.1 Introduction 1
1.1 On-device speech recognition: advantages and challenges . . . . . . 1
1.2 The components of speech recognition . . . . . . . . . . . . . . . . 3
1.3 The downsides of RNN based acoustic models . . . . . . . . . . . . . 4
1.4 Exploration of efficient on-device acoustic modeling with neural networks . . . . . . . . . . 5
1.5 Simple Gated ConvNet for small footprint acoustic modeling . . . . . 6
1.6 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Exploration of Efficient On-device Acoustic Modeling with Neural Networks 8
2.1 Acoustic Modeling Algorithms . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Diagonal LSTM RNN . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 QRNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Gated ConvNet . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 End-to-end speech recognition . . . . . . . . . . . . . . . . . 11
2.2.2 Phoneme classification . . . . . . . . . . . . . . . . . . . . . 15
2.2.3 Implementation Results on Embedded Systems . . . . . . . . 17
2.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 Simple Gated Convolutional Networks for small footprint acoustic modeling 20
3.1 Simple Gated ConvNet . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.1 Gated ConvNet . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.2 Simple Gated ConvNet . . . . . . . . . . . . . . . . . . . . . 21
3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.1 Experiment Setups . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4 Conclusions 32
Abstract (In Korean) 39Maste
- β¦