Search CORE

282 research outputs found

Exploring efficient neural architectures for linguistic-acoustic mapping in text-to-speech

Author: Bonafonte Cávez Antonio
Pascual de la Puente Santiago
Serra Joan
Publication venue: 'MDPI AG'
Publication date: 01/01/2019
Field of study

Conversion from text to speech relies on the accurate mapping from linguistic to acoustic symbol sequences, for which current practice employs recurrent statistical models such as recurrent neural networks. Despite the good performance of such models (in terms of low distortion in the generated speech), their recursive structure with intermediate affine transformations tends to make them slow to train and to sample from. In this work, we explore two different mechanisms that enhance the operational efficiency of recurrent neural networks, and study their performance–speed trade-off. The first mechanism is based on the quasi-recurrent neural network, where expensive affine transformations are removed from temporal connections and placed only on feed-forward computational directions. The second mechanism includes a module based on the transformer decoder network, designed without recurrent connections but emulating them with attention and positioning codes. Our results show that the proposed decoder networks are competitive in terms of distortion when compared to a recurrent baseline, whilst being significantly faster in terms of CPU and GPU inference time. The best performing model is the one based on the quasi-recurrent mechanism, reaching the same level of naturalness as the recurrent neural network based model with a speedup of 11.2 on CPU and 3.3 on GPU.Peer ReviewedPostprint (published version

Multidisciplinary Digital Publishing Institute

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Low latency modeling of temporal contexts for speech recognition

Author: Peddinti Vijayaditya
Publication venue: 'The Busan Gyeongnam Mathematical Society'
Publication date: 22/05/2018
Field of study

This thesis focuses on the development of neural network acoustic models for large vocabulary continuous speech recognition (LVCSR) to satisfy the design goals of low latency and low computational complexity. Low latency enables online speech recognition; and low computational complexity helps reduce the computational cost both during training and inference. Long span sequential dependencies and sequential distortions in the input vector sequence are a major challenge in acoustic modeling. Recurrent neural networks have been shown to effectively model these dependencies. Specifically, bidirectional long short term memory (BLSTM) networks, provide state-of-the-art performance across several LVCSR tasks. However the deployment of bidirectional models for online LVCSR is non-trivial due to their large latency; and unidirectional LSTM models are typically preferred. In this thesis we explore the use of hierarchical temporal convolution to model long span temporal dependencies. We propose a sub-sampled variant of these temporal convolution neural networks, termed time-delay neural networks (TDNNs). These sub-sampled TDNNs reduce the computation complexity by ~5x, compared to TDNNs, during frame randomized pre-training. These models are shown to be effective in modeling long-span temporal contexts, however there is a performance gap compared to (B)LSTMs. As recent advancements in acoustic model training have eliminated the need for frame randomized pre-training we modify the TDNN architecture to use higher sampling rates, as the increased computation can be amortized over the sequence. These variants of sub- sampled TDNNs provide performance superior to unidirectional LSTM networks, while also affording a lower real time factor (RTF) during inference. However we show that the BLSTM models outperform both the TDNN and LSTM models. We propose a hybrid architecture interleaving temporal convolution and LSTM layers which is shown to outperform the BLSTM models. Further we improve these BLSTM models by using higher frame rates at lower layers and show that the proposed TDNN- LSTM model performs similar to these superior BLSTM models, while reducing the overall latency to 200 ms. Finally we describe an online system for reverberation robust ASR, using the above described models in conjunction with other data augmentation techniques like reverberation simulation, which simulates far-field environments, and volume perturbation, which helps tackle volume variation even without gain normalization

Johns Hopkins University

JScholarship

On-device Efficient Acoustic Modeling with Simple Gated Convolutional Networks

Author: 이루카스
Publication venue: 서울대학교 대학원
Publication date: 01/02/2019
Field of study

학위논문 (석사)-- 서울대학교 대학원 : 공과대학 전기·정보공학부, 2019. 2. 성원용.오늘날, 자동 음성 인식 시스템으로 인공신경망 기반의 알고리즘이 주요하게 활용되고 있다. 그런 가운데, 스마트폰이나 임베디드 장치에서 서버를 거치지 않고 진행되는 온-디바이스 음성 인식 시스템에 대한 수요가 증가하고 있다. 온-디바이스 음성 인식 시스템은 사용자의 음성이 서비스 제공자의 서버로 제공되지 않고, 음성인식이 사용자의 장치에서 독립적으로 이루어진다. 따라서, 프라이버시 침해와 보안에 대한 우려를 상당 부분 해소할 수 있다. 그러나, 인공신경망 기반의 음성 인식 시스템에서 주로 사용되는 LSTM 기반의 회귀신경망(RNN)은 온-디바이스 음성 인식에 효율적이지 않다. LSTM RNN은 시퀀스(sequence) 정보의 병렬화가 어렵다. 이는 LSTM RNN에는 현재의 시간 스텝(step)이 과거의 시간 스텝에 의존하는 되먹임(Feedback) 특성이 존재하기 때문이다. 또, 이 되먹임 정보는 너무 커서 캐시 메모리에 들어갈 수 없다. 따라서, 시퀀스 정보의 매 시간 스텝마다 DRAM에 접근하여 샘플을 불러와야 한다. 이 경우 매 시간 스텝마다 DRAM에 접근하여 전력소모가 증가할 뿐만 아니라, 실행 시간도 증가하게 된다. 우리는 이 논문에서 온-디바이스에 친화적인 인공신경말 모델을 제시한다. 이 모델들을 음향 모델링에 활용하여 LSTM RNN을 대체한다. 게이티드 콘볼루션 네크워크(Gated ConvNet), 대각성분 LSTM(Diagonal LSTM), QRNN(the quasi RNN)이 활용되었다. 이들 모델은 대부분의 연산에서 순서 의존성이 존재 하지 않아 시간 스텝별 병렬화가 가능하다. \\ \\ \\ \\ 이들 모델들은 자동 음성 인식에서 1차원 깊이 콘볼루션(1D depthwise Convolution)이 추가된 후에는 LSTM RNN의 성능을 훨씬 능가하였다. 특히 게이티드 콘볼루션 네트워크의 경우 깊은 구조를 채택하였을 때, 음향 모델 없이 가장 좋은 성능을 보여주었다. 무엇보다도 온-디바이스에 효율적인 인공신경망 모델들은 시퀀스의 시간 스텝별 병렬화를 통해 실제 임베디드 장치에서 LSTM RNN 대비 최소 5배의 실행 속도 증가를 보여주었다. 우리는 여기서 더 나아가, 심플 게이티드 콘볼루션 네트워크(Simple Gated ConvNet)을 제시한다. 심플 게이티드 콘볼루션은 게이티드 콘볼루션의 가장 단순화 된 형태에 기반을 둔 것으로, 파라미터의 수가 혁명적으로 감소한다. 이는 하드웨어 사양의 제한을 받는 온-디바이스 음성인식에 유리한 특성이다. 또한 심플 게이티드 콘볼루션 네트워크는 시간 스텝 별 순서 의존성이 존재하지 않기 때문에 시간 스텝별 병렬화도 가능하다. 우리는 1차원 깊이 병렬화(1D depthwise convolution)을 여러 방향을 적용하여 성능 향상을 이끌어 내었다. 구체적으로, 우리는 심플 게이티드 콘볼루션 네크워크를 활용해 파라미터 사용량을 3 M 이하로 줄였다. 동일한 파라미터 수가 주어졌을 때 심플 게이티드 콘볼루션 네트워크는 자동 음성 인식에서 LSTM RNN이나 게이티드 콘볼루션 네트워크의 성능을 능가했다. 3 M 아래의 심플 게이티드 콘볼루션 네크워크는 10 M의 LSTM보다 더 좋은 성능을 보여주기도 하였다. 또한, 시간 스텝 별 병렬화를 통해서 ARM CPU에서 LSTM RNN 대비 10 배의 실행 속도 증가를 얻어냈다.Automatic speech recognition (ASR) is widely adopted for smartphones and many embedded devices in recent years, and neural network based algorithms show the best performance for ASR. While most of ASR systems are based on server-based processing, there is an increasing demand for on-device speech recognition because of privacy concern and low latency processing. Reducing the power consumption is especially important for on-device speech recognition to lengthen the battery life. Among several neural network models, recurrent neural network (RNN) based algorithms are mostly used for speech recognition, and long short-term memory(LSTM) RNN is most popular because of its superior performance over the other ones. However, executing LSTM RNN demands many DRAM accesses because the cache size of embedded devices is usually much smaller than the parameter size of RNN. Multi-time step parallelization technique computes multiple output samples at a time by fetching one set of parameters, and thus it can reduce the number of DRAM accesses in proportional to the number of time steps computed at a time. However, LSTM RNN does not permit the multi-time step parallelization because of complex feedback structure of the model. This thesis presents neural network models that support efficient on-device speech recognition. First, a few models that permit multi-time step parallel processing are evaluated. The models evaluated include Gated ConvNet, Diagonal LSTM, and QRNN (quasi RNN). Since the performance of these models are not as good as the LSTM, one-dimensional depthwise convolution is added to improve the performance. The one-dimensional convolution helps finding the temporal patterns of speech signal. Second, Simple Gated Convolution Network (Simple Gated ConvNet) is proposed for improved performance when the parameter count is very small. The Simple Gated ConvNet employs the simplest form of Gated ConvNet. Instead it relies on one-dimensional convolution for temporal observation. Simple Gated ConvNet supports low-power on-device speech recognition because it can be executed employing multi-time step parallelization. The Simple Gated ConvNet under 3 million even shows better performance than the LSTM with 10 million parameters. In addition, the execution speed in ARM CPU can be increased more than ten-times compared with the LSTM RNN through multi-time step parallelization.1 Introduction 1 1.1 On-device speech recognition: advantages and challenges . . . . . . 1 1.2 The components of speech recognition . . . . . . . . . . . . . . . . 3 1.3 The downsides of RNN based acoustic models . . . . . . . . . . . . . 4 1.4 Exploration of efficient on-device acoustic modeling with neural networks . . . . . . . . . . 5 1.5 Simple Gated ConvNet for small footprint acoustic modeling . . . . . 6 1.6 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Exploration of Efficient On-device Acoustic Modeling with Neural Networks 8 2.1 Acoustic Modeling Algorithms . . . . . . . . . . . . . . . . . . . . 8 2.1.1 Diagonal LSTM RNN . . . . . . . . . . . . . . . . . . . . . 8 2.1.2 QRNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.3 Gated ConvNet . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.1 End-to-end speech recognition . . . . . . . . . . . . . . . . . 11 2.2.2 Phoneme classification . . . . . . . . . . . . . . . . . . . . . 15 2.2.3 Implementation Results on Embedded Systems . . . . . . . . 17 2.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3 Simple Gated Convolutional Networks for small footprint acoustic modeling 20 3.1 Simple Gated ConvNet . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.1.1 Gated ConvNet . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.1.2 Simple Gated ConvNet . . . . . . . . . . . . . . . . . . . . . 21 3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.1 Experiment Setups . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4 Conclusions 32 Abstract (In Korean) 39Maste

SNU Open Repository and Archive