282 research outputs found

    Exploring efficient neural architectures for linguistic-acoustic mapping in text-to-speech

    Get PDF
    Conversion from text to speech relies on the accurate mapping from linguistic to acoustic symbol sequences, for which current practice employs recurrent statistical models such as recurrent neural networks. Despite the good performance of such models (in terms of low distortion in the generated speech), their recursive structure with intermediate affine transformations tends to make them slow to train and to sample from. In this work, we explore two different mechanisms that enhance the operational efficiency of recurrent neural networks, and study their performance–speed trade-off. The first mechanism is based on the quasi-recurrent neural network, where expensive affine transformations are removed from temporal connections and placed only on feed-forward computational directions. The second mechanism includes a module based on the transformer decoder network, designed without recurrent connections but emulating them with attention and positioning codes. Our results show that the proposed decoder networks are competitive in terms of distortion when compared to a recurrent baseline, whilst being significantly faster in terms of CPU and GPU inference time. The best performing model is the one based on the quasi-recurrent mechanism, reaching the same level of naturalness as the recurrent neural network based model with a speedup of 11.2 on CPU and 3.3 on GPU.Peer ReviewedPostprint (published version

    Low latency modeling of temporal contexts for speech recognition

    Get PDF
    This thesis focuses on the development of neural network acoustic models for large vocabulary continuous speech recognition (LVCSR) to satisfy the design goals of low latency and low computational complexity. Low latency enables online speech recognition; and low computational complexity helps reduce the computational cost both during training and inference. Long span sequential dependencies and sequential distortions in the input vector sequence are a major challenge in acoustic modeling. Recurrent neural networks have been shown to effectively model these dependencies. Specifically, bidirectional long short term memory (BLSTM) networks, provide state-of-the-art performance across several LVCSR tasks. However the deployment of bidirectional models for online LVCSR is non-trivial due to their large latency; and unidirectional LSTM models are typically preferred. In this thesis we explore the use of hierarchical temporal convolution to model long span temporal dependencies. We propose a sub-sampled variant of these temporal convolution neural networks, termed time-delay neural networks (TDNNs). These sub-sampled TDNNs reduce the computation complexity by ~5x, compared to TDNNs, during frame randomized pre-training. These models are shown to be effective in modeling long-span temporal contexts, however there is a performance gap compared to (B)LSTMs. As recent advancements in acoustic model training have eliminated the need for frame randomized pre-training we modify the TDNN architecture to use higher sampling rates, as the increased computation can be amortized over the sequence. These variants of sub- sampled TDNNs provide performance superior to unidirectional LSTM networks, while also affording a lower real time factor (RTF) during inference. However we show that the BLSTM models outperform both the TDNN and LSTM models. We propose a hybrid architecture interleaving temporal convolution and LSTM layers which is shown to outperform the BLSTM models. Further we improve these BLSTM models by using higher frame rates at lower layers and show that the proposed TDNN- LSTM model performs similar to these superior BLSTM models, while reducing the overall latency to 200 ms. Finally we describe an online system for reverberation robust ASR, using the above described models in conjunction with other data augmentation techniques like reverberation simulation, which simulates far-field environments, and volume perturbation, which helps tackle volume variation even without gain normalization

    On-device Efficient Acoustic Modeling with Simple Gated Convolutional Networks

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (석사)-- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 전기·정보곡학뢀, 2019. 2. μ„±μ›μš©.μ˜€λŠ˜λ‚ , μžλ™ μŒμ„± 인식 μ‹œμŠ€ν…œμœΌλ‘œ 인곡신경망 기반의 μ•Œκ³ λ¦¬μ¦˜μ΄ μ£Όμš”ν•˜κ²Œ ν™œμš©λ˜κ³  μžˆλ‹€. 그런 κ°€μš΄λ°, μŠ€λ§ˆνŠΈν°μ΄λ‚˜ μž„λ² λ””λ“œ μž₯μΉ˜μ—μ„œ μ„œλ²„λ₯Ό κ±°μΉ˜μ§€ μ•Šκ³  μ§„ν–‰λ˜λŠ” 온-λ””λ°”μ΄μŠ€ μŒμ„± 인식 μ‹œμŠ€ν…œμ— λŒ€ν•œ μˆ˜μš”κ°€ μ¦κ°€ν•˜κ³  μžˆλ‹€. 온-λ””λ°”μ΄μŠ€ μŒμ„± 인식 μ‹œμŠ€ν…œμ€ μ‚¬μš©μžμ˜ μŒμ„±μ΄ μ„œλΉ„μŠ€ 제곡자의 μ„œλ²„λ‘œ μ œκ³΅λ˜μ§€ μ•Šκ³ , μŒμ„±μΈμ‹μ΄ μ‚¬μš©μžμ˜ μž₯μΉ˜μ—μ„œ λ…λ¦½μ μœΌλ‘œ 이루어진닀. λ”°λΌμ„œ, ν”„λΌμ΄λ²„μ‹œ 침해와 λ³΄μ•ˆμ— λŒ€ν•œ 우렀λ₯Ό 상당 λΆ€λΆ„ ν•΄μ†Œν•  수 μžˆλ‹€. κ·ΈλŸ¬λ‚˜, 인곡신경망 기반의 μŒμ„± 인식 μ‹œμŠ€ν…œμ—μ„œ 주둜 μ‚¬μš©λ˜λŠ” LSTM 기반의 νšŒκ·€μ‹ κ²½λ§(RNN)은 온-λ””λ°”μ΄μŠ€ μŒμ„± 인식에 νš¨μœ¨μ μ΄μ§€ μ•Šλ‹€. LSTM RNN은 μ‹œν€€μŠ€(sequence) μ •λ³΄μ˜ 병렬화가 μ–΄λ ΅λ‹€. μ΄λŠ” LSTM RNNμ—λŠ” ν˜„μž¬μ˜ μ‹œκ°„ μŠ€ν…(step)이 과거의 μ‹œκ°„ μŠ€ν…μ— μ˜μ‘΄ν•˜λŠ” λ˜λ¨Ήμž„(Feedback) νŠΉμ„±μ΄ μ‘΄μž¬ν•˜κΈ° λ•Œλ¬Έμ΄λ‹€. 또, 이 λ˜λ¨Ήμž„ μ •λ³΄λŠ” λ„ˆλ¬΄ μ»€μ„œ μΊμ‹œ λ©”λͺ¨λ¦¬μ— λ“€μ–΄κ°ˆ 수 μ—†λ‹€. λ”°λΌμ„œ, μ‹œν€€μŠ€ μ •λ³΄μ˜ 맀 μ‹œκ°„ μŠ€ν…λ§ˆλ‹€ DRAM에 μ ‘κ·Όν•˜μ—¬ μƒ˜ν”Œμ„ λΆˆλŸ¬μ™€μ•Ό ν•œλ‹€. 이 경우 맀 μ‹œκ°„ μŠ€ν…λ§ˆλ‹€ DRAM에 μ ‘κ·Όν•˜μ—¬ μ „λ ₯μ†Œλͺ¨κ°€ 증가할 뿐만 μ•„λ‹ˆλΌ, μ‹€ν–‰ μ‹œκ°„λ„ μ¦κ°€ν•˜κ²Œ λœλ‹€. μš°λ¦¬λŠ” 이 λ…Όλ¬Έμ—μ„œ 온-λ””λ°”μ΄μŠ€μ— μΉœν™”μ μΈ 인곡신경말 λͺ¨λΈμ„ μ œμ‹œν•œλ‹€. 이 λͺ¨λΈλ“€μ„ 음ν–₯ λͺ¨λΈλ§μ— ν™œμš©ν•˜μ—¬ LSTM RNN을 λŒ€μ²΄ν•œλ‹€. κ²Œμ΄ν‹°λ“œ μ½˜λ³Όλ£¨μ…˜ λ„€ν¬μ›Œν¬(Gated ConvNet), λŒ€κ°μ„±λΆ„ LSTM(Diagonal LSTM), QRNN(the quasi RNN)이 ν™œμš©λ˜μ—ˆλ‹€. 이듀 λͺ¨λΈμ€ λŒ€λΆ€λΆ„μ˜ μ—°μ‚°μ—μ„œ μˆœμ„œ μ˜μ‘΄μ„±μ΄ 쑴재 ν•˜μ§€ μ•Šμ•„ μ‹œκ°„ μŠ€ν…λ³„ 병렬화가 κ°€λŠ₯ν•˜λ‹€. \\ \\ \\ \\ 이듀 λͺ¨λΈλ“€μ€ μžλ™ μŒμ„± μΈμ‹μ—μ„œ 1차원 깊이 μ½˜λ³Όλ£¨μ…˜(1D depthwise Convolution)이 μΆ”κ°€λœ ν›„μ—λŠ” LSTM RNN의 μ„±λŠ₯을 훨씬 λŠ₯κ°€ν•˜μ˜€λ‹€. 특히 κ²Œμ΄ν‹°λ“œ μ½˜λ³Όλ£¨μ…˜ λ„€νŠΈμ›Œν¬μ˜ 경우 κΉŠμ€ ꡬ쑰λ₯Ό μ±„νƒν•˜μ˜€μ„ λ•Œ, 음ν–₯ λͺ¨λΈ 없이 κ°€μž₯ 쒋은 μ„±λŠ₯을 λ³΄μ—¬μ£Όμ—ˆλ‹€. 무엇보닀도 온-λ””λ°”μ΄μŠ€μ— 효율적인 인곡신경망 λͺ¨λΈλ“€μ€ μ‹œν€€μŠ€μ˜ μ‹œκ°„ μŠ€ν…λ³„ 병렬화λ₯Ό 톡해 μ‹€μ œ μž„λ² λ””λ“œ μž₯μΉ˜μ—μ„œ LSTM RNN λŒ€λΉ„ μ΅œμ†Œ 5배의 μ‹€ν–‰ 속도 증가λ₯Ό λ³΄μ—¬μ£Όμ—ˆλ‹€. μš°λ¦¬λŠ” μ—¬κΈ°μ„œ 더 λ‚˜μ•„κ°€, μ‹¬ν”Œ κ²Œμ΄ν‹°λ“œ μ½˜λ³Όλ£¨μ…˜ λ„€νŠΈμ›Œν¬(Simple Gated ConvNet)을 μ œμ‹œν•œλ‹€. μ‹¬ν”Œ κ²Œμ΄ν‹°λ“œ μ½˜λ³Όλ£¨μ…˜μ€ κ²Œμ΄ν‹°λ“œ μ½˜λ³Όλ£¨μ…˜μ˜ κ°€μž₯ λ‹¨μˆœν™” 된 ν˜•νƒœμ— κΈ°λ°˜μ„ λ‘” κ²ƒμœΌλ‘œ, νŒŒλΌλ―Έν„°μ˜ μˆ˜κ°€ 혁λͺ…μ μœΌλ‘œ κ°μ†Œν•œλ‹€. μ΄λŠ” ν•˜λ“œμ›¨μ–΄ μ‚¬μ–‘μ˜ μ œν•œμ„ λ°›λŠ” 온-λ””λ°”μ΄μŠ€ μŒμ„±μΈμ‹μ— μœ λ¦¬ν•œ νŠΉμ„±μ΄λ‹€. λ˜ν•œ μ‹¬ν”Œ κ²Œμ΄ν‹°λ“œ μ½˜λ³Όλ£¨μ…˜ λ„€νŠΈμ›Œν¬λŠ” μ‹œκ°„ μŠ€ν… 별 μˆœμ„œ μ˜μ‘΄μ„±μ΄ μ‘΄μž¬ν•˜μ§€ μ•ŠκΈ° λ•Œλ¬Έμ— μ‹œκ°„ μŠ€ν…λ³„ 병렬화도 κ°€λŠ₯ν•˜λ‹€. μš°λ¦¬λŠ” 1차원 깊이 병렬화(1D depthwise convolution)을 μ—¬λŸ¬ λ°©ν–₯을 μ μš©ν•˜μ—¬ μ„±λŠ₯ ν–₯상을 μ΄λŒμ–΄ λ‚΄μ—ˆλ‹€. ꡬ체적으둜, μš°λ¦¬λŠ” μ‹¬ν”Œ κ²Œμ΄ν‹°λ“œ μ½˜λ³Όλ£¨μ…˜ λ„€ν¬μ›Œν¬λ₯Ό ν™œμš©ν•΄ νŒŒλΌλ―Έν„° μ‚¬μš©λŸ‰μ„ 3 M μ΄ν•˜λ‘œ μ€„μ˜€λ‹€. λ™μΌν•œ νŒŒλΌλ―Έν„° μˆ˜κ°€ μ£Όμ–΄μ‘Œμ„ λ•Œ μ‹¬ν”Œ κ²Œμ΄ν‹°λ“œ μ½˜λ³Όλ£¨μ…˜ λ„€νŠΈμ›Œν¬λŠ” μžλ™ μŒμ„± μΈμ‹μ—μ„œ LSTM RNNμ΄λ‚˜ κ²Œμ΄ν‹°λ“œ μ½˜λ³Όλ£¨μ…˜ λ„€νŠΈμ›Œν¬μ˜ μ„±λŠ₯을 λŠ₯κ°€ν–ˆλ‹€. 3 M μ•„λž˜μ˜ μ‹¬ν”Œ κ²Œμ΄ν‹°λ“œ μ½˜λ³Όλ£¨μ…˜ λ„€ν¬μ›Œν¬λŠ” 10 M의 LSTM보닀 더 쒋은 μ„±λŠ₯을 보여주기도 ν•˜μ˜€λ‹€. λ˜ν•œ, μ‹œκ°„ μŠ€ν… 별 병렬화λ₯Ό ν†΅ν•΄μ„œ ARM CPUμ—μ„œ LSTM RNN λŒ€λΉ„ 10 배의 μ‹€ν–‰ 속도 증가λ₯Ό μ–»μ–΄λƒˆλ‹€.Automatic speech recognition (ASR) is widely adopted for smartphones and many embedded devices in recent years, and neural network based algorithms show the best performance for ASR. While most of ASR systems are based on server-based processing, there is an increasing demand for on-device speech recognition because of privacy concern and low latency processing. Reducing the power consumption is especially important for on-device speech recognition to lengthen the battery life. Among several neural network models, recurrent neural network (RNN) based algorithms are mostly used for speech recognition, and long short-term memory(LSTM) RNN is most popular because of its superior performance over the other ones. However, executing LSTM RNN demands many DRAM accesses because the cache size of embedded devices is usually much smaller than the parameter size of RNN. Multi-time step parallelization technique computes multiple output samples at a time by fetching one set of parameters, and thus it can reduce the number of DRAM accesses in proportional to the number of time steps computed at a time. However, LSTM RNN does not permit the multi-time step parallelization because of complex feedback structure of the model. This thesis presents neural network models that support efficient on-device speech recognition. First, a few models that permit multi-time step parallel processing are evaluated. The models evaluated include Gated ConvNet, Diagonal LSTM, and QRNN (quasi RNN). Since the performance of these models are not as good as the LSTM, one-dimensional depthwise convolution is added to improve the performance. The one-dimensional convolution helps finding the temporal patterns of speech signal. Second, Simple Gated Convolution Network (Simple Gated ConvNet) is proposed for improved performance when the parameter count is very small. The Simple Gated ConvNet employs the simplest form of Gated ConvNet. Instead it relies on one-dimensional convolution for temporal observation. Simple Gated ConvNet supports low-power on-device speech recognition because it can be executed employing multi-time step parallelization. The Simple Gated ConvNet under 3 million even shows better performance than the LSTM with 10 million parameters. In addition, the execution speed in ARM CPU can be increased more than ten-times compared with the LSTM RNN through multi-time step parallelization.1 Introduction 1 1.1 On-device speech recognition: advantages and challenges . . . . . . 1 1.2 The components of speech recognition . . . . . . . . . . . . . . . . 3 1.3 The downsides of RNN based acoustic models . . . . . . . . . . . . . 4 1.4 Exploration of efficient on-device acoustic modeling with neural networks . . . . . . . . . . 5 1.5 Simple Gated ConvNet for small footprint acoustic modeling . . . . . 6 1.6 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Exploration of Efficient On-device Acoustic Modeling with Neural Networks 8 2.1 Acoustic Modeling Algorithms . . . . . . . . . . . . . . . . . . . . 8 2.1.1 Diagonal LSTM RNN . . . . . . . . . . . . . . . . . . . . . 8 2.1.2 QRNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.3 Gated ConvNet . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.1 End-to-end speech recognition . . . . . . . . . . . . . . . . . 11 2.2.2 Phoneme classification . . . . . . . . . . . . . . . . . . . . . 15 2.2.3 Implementation Results on Embedded Systems . . . . . . . . 17 2.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3 Simple Gated Convolutional Networks for small footprint acoustic modeling 20 3.1 Simple Gated ConvNet . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.1.1 Gated ConvNet . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.1.2 Simple Gated ConvNet . . . . . . . . . . . . . . . . . . . . . 21 3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.1 Experiment Setups . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4 Conclusions 32 Abstract (In Korean) 39Maste
    • …
    corecore