Search CORE

156,229 research outputs found

Improved training for online end-to-end speech recognition systems

Author: Kim Suyoun
Li Jinyu
Seltzer Michael L.
Zhao Rui
Publication venue
Publication date: 30/08/2018
Field of study

Achieving high accuracy with end-to-end speech recognizers requires careful parameter initialization prior to training. Otherwise, the networks may fail to find a good local optimum. This is particularly true for online networks, such as unidirectional LSTMs. Currently, the best strategy to train such systems is to bootstrap the training from a tied-triphone system. However, this is time consuming, and more importantly, is impossible for languages without a high-quality pronunciation lexicon. In this work, we propose an initialization strategy that uses teacher-student learning to transfer knowledge from a large, well-trained, offline end-to-end speech recognition model to an online end-to-end model, eliminating the need for a lexicon or any other linguistic resources. We also explore curriculum learning and label smoothing and show how they can be combined with the proposed teacher-student learning for further improvements. We evaluate our methods on a Microsoft Cortana personal assistant task and show that the proposed method results in a 19 % relative improvement in word error rate compared to a randomly-initialized baseline system.Comment: Interspeech 201

arXiv.org e-Print Archive

Crossref

재귀형 인공신경망을 이용한 온라인 음성인식

Author: Kyuyeon Hwang
Publication venue: 서울대학교 대학원
Publication date: 01/02/2017
Field of study

학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2017. 2. 성원용.재귀형 인공신경망(recurrent neural network, RNN)은 최근 시퀀스-투-시퀀스(sequence-to-sequence) 방식의 여러 모델에서 좋은 성능을 보여 왔다. 최근의 음성인식에서 사용하는 종단간(end-to-end) 훈련 방식의 발전으로 인해, RNN은 일련의 오디오 특징(feature)을 입력으로 하고 일련의 글자(character) 혹은 단어들을 출력으로 하는 단일한 함수를 학습할 수 있게 되었다. 이 함수는 중간에 음소 단위 혹은 발음 사전(lexicon) 단위의 변환을 거치지 않는다. 지금까지, 대부분의 종단간 음성인식은 기존 방식으로 얻은 높은 정확도를 따라가는 데 초점이 맞춰져 있었다. 하지만, 비록 종단간 음성인식 모델이 기존 음성인식 모델만큼의 정확도를 달성했음에도, 이 모델은 보통 미리 잘라진 오디오 데이터를 사용하는 발화 단위의 음성인식에 사용되었고, 실시간으로 연속적인 오디오 데이터를 받아 사용하는 음성인식에는 잘 사용되지 않았다. 이것은 미리 잘라진 데이터로 학습한 RNN은 매우 긴 오디오 입력에 대해서도 잘 동작하도록 일반화(generalization)가 되기 어려웠기 때문이다. 위 문제를 해결하기 위해, 본 논문에서는 무한히 긴 시퀀스를 사용하는 RNN 훈련 방법을 제안한다. 먼저, 이를 위한 효과적인 그래픽 프로세서(graphics processing unit, GPU) 기반 RNN 훈련 프레임워크(framework)를 설명한다. 이 프레임워크는 제한된 시간축 역전파(truncated backpropagation through time, truncated BPTT)를 사용해 훈련되며, 덕분에 실시간으로 들어오는 연속적인 데이터를 사용하여 훈련할 수 있다. 다음으로, 연결성 시계열 분류기(connectionist temporal classification, CTC) 알고리즘의 손실(loss) 계산 방식을 변형한 실시간 CTC 학습 알고리즘을 선보인다. 새롭게 선보인 CTC 손실 계산 알고리즘은 truncated BPTT 기반의 RNN 훈련에 바로 적용될 수 있다. 다음으로, RNN만으로 구성된 종단간 실시간 음성인식 모델을 소개한다. 이 모델은 크게 CTC 출력을 사용하는 음향(acoustic) RNN과 글자 단위 RNN 언어 모델(language model)로 구성된다. 그리고, 접두사 트리(prefix-tree) 기반의 새로운 빔 탐색(beam search)이 사용되어 무한한 입력 오디오에 대해 디코딩(decoding)을 수행할 수 있다. 이 디코딩 방식에는 새로운 빔 가지치기(beam pruning) 알고리즘이 도입되어 트리 구조의 크기가 지수적으로 증가하는 것을 방지한다. 위 음성인식 모델에는 별도의 음소 모델이나 발음 사전이 포함되어 있지 않고, 무한히 긴 일련의 오디오에 대해 디코딩을 수행할 수 있다는 특징이 있다. 위 모델은 또한 다른 종단간 모델에 비해 매우 적은 메모리를 사용하면서도 비견될 만한 정확도를 보인다. 마지막으로, 본 논문에서는 계층형 구조(hierarchical structure)를 이용해 글자 단위 RNN 언어 모델의 성능을 향상시켰다. 특히, 이 글자 단위 RNN 모델은 비슷한 파라미터 수를 갖는 단어 단위 RNN 언어 모델보다 개선된 예측 복잡도(perplexity)를 달성하였다. 또한, 이 글자 단위 RNN 언어 모델을 앞서 설명한 글자 단위 실시간 음성인식 시스템에 적용하여 더욱 적은 연산을 사용하면서도 음성인식 정확도를 향상시킬 수 있었다.Recurrent neural networks (RNNs) have shown outstanding sequence to sequence modeling performance. Thanks to recent advances in end-to-end training approaches for automatic speech recognition (ASR), RNNs can learn direct mapping functions from the sequence of audio features to the sequence of output characters or words without any intermediate phoneme or lexicon layers. So far, majority of studies on end-to-end ASR have been focused on increasing the accuracy of speech recognition to the level of traditional state-of-the-art models. However, although the end-to-end ASR models have reached the accuracy of the traditional systems, their application has usually been limited to utterance-level speech recognition with pre-segmented audio instead of online speech recognition with continuous audio. This is because the RNNs cannot be easily generalized to very long streams of audio when they are trained with segmented audio. To address this problem, we propose an RNN training approach on training sequences with virtually infinite length. Specifically, we describe an efficient GPU-based RNN training framework for the truncated backpropagation through time (BPTT) algorithm, which is suitable for online (continuous) training. Then, we present an online version of the connectionist temporal classification (CTC) loss computation algorithm, where the original CTC loss is estimated with partial sliding window. This modified CTC algorithm can be directly employed for truncated BPTT based RNN training. In addition, a fully RNN based end-to-end online ASR model is proposed. The model is composed of an acoustic RNN with CTC output and a character-level RNN language model that is augmented with a hierarchical structure. Prefix-tree based beam search decoding is employed with a new beam pruning algorithm to prevent exponential growth of the tree. The model is free from phoneme or lexicon models, and can be used for decoding infinitely long audio sequences. Also, this model has very small memory footprint compared to the other end-to-end systems while showing the competitive accuracy. Furthermore, we propose an improved character-level RNN LM with a hierarchical structure. This character-level RNN LM shows improved perplexity compared to the lightweight word-level RNN LM with a comparable size. When this RNN LM is applied to the proposed character-level online ASR, better speech recognition accuracy can be achieved with reduced amount of computation.1 Introduction 1 1.1 Automatic Speech Recognition 1 1.1.1 Traditional ASR 2 1.1.2 End-to-End ASR with Recurrent Neural Networks 3 1.1.3 Offline and Online ASR 3 1.2 Scope of the Dissertation 4 1.2.1 End-to-End Online ASR with RNNs 4 1.2.2 Challenges and Contributions 5 2 Flexible and Efficient RNN Training on GPUs 7 2.1 Introduction 7 2.2 Generalization 9 2.2.1 Generalized RNN Structure 9 2.2.2 Training 11 2.3 Parallelization 15 2.3.1 Intra-Stream Parallelism 15 2.3.2 Inter-Stream Parallelism 17 2.4 Experiments 18 2.5 Concluding Remarks 21 3 Online Sequence Training with Connectionist Temporal Classification 22 3.1 Introduction 22 3.2 Connectionist Temporal Classification 25 3.3 Online Sequence Training 28 3.3.1 Problem Definition 28 3.3.2 Overview of the Proposed Approach 29 3.3.3 CTC-TR: Standard CTC with Truncation 31 3.3.4 CTC-EM: EM-Based Online CTC 32 3.4 Training Continuously Running RNNs 37 3.5 Parallel Training 38 3.6 Experiments 39 3.6.1 End-to-End Speech Recognition with RNNs 39 3.6.2 Phoneme Recognition on TIMIT 46 3.7 Concluding Remarks 51 4 Character-Level Incremental Speech Recognition 52 4.1 Introduction 52 4.2 Models 54 4.2.1 Acoustic Model 54 4.2.2 Language Model 56 4.3 Character-Level Beam Search 57 4.3.1 Prefix-Tree-Based CTC Beam Search 57 4.3.2 Pruning 60 4.4 Experiments 62 4.5 Concluding Remarks 65 5 Character-Level Language Modeling with Hierarchical RNNs 66 5.1 Introduction 66 5.2 Related Work 68 5.2.1 Character-Level Language Modeling with RNNs 68 5.2.2 Character-Aware Word-Level Language Modeling 69 5.3 RNNs with External Clock and Reset Signals 70 5.4 Character-Level Language Modeling with a Hierarchical RNN 72 5.5 Experiments 75 5.5.1 Perplexity 76 5.5.2 End-to-End Automatic Speech Recognition (ASR) 79 5.6 Concluding Remarks 81 6 Conclusion 83 Bibliography 85 Abstract in Korean 98Docto

SNU Open Repository and Archive

RWTH ASR Systems for LibriSpeech: Hybrid vs Attention -- w/o Data Augmentation

Author: Beck Eugen
Irie Kazuki
Kitza Markus
Lüscher Christoph
Michel Wilfried
Ney Hermann
Schlüter Ralf
Zeyer Albert
Publication venue: 'International Speech Communication Association'
Publication date: 01/01/2019
Field of study

We present state-of-the-art automatic speech recognition (ASR) systems employing a standard hybrid DNN/HMM architecture compared to an attention-based encoder-decoder design for the LibriSpeech task. Detailed descriptions of the system development, including model design, pretraining schemes, training schedules, and optimization approaches are provided for both system architectures. Both hybrid DNN/HMM and attention-based systems employ bi-directional LSTMs for acoustic modeling/encoding. For language modeling, we employ both LSTM and Transformer based architectures. All our systems are built using RWTHs open-source toolkits RASR and RETURNN. To the best knowledge of the authors, the results obtained when training on the full LibriSpeech training set, are the best published currently, both for the hybrid DNN/HMM and the attention-based systems. Our single hybrid system even outperforms previous results obtained from combining eight single systems. Our comparison shows that on the LibriSpeech 960h task, the hybrid DNN/HMM system outperforms the attention-based system by 15% relative on the clean and 40% relative on the other test sets in terms of word error rate. Moreover, experiments on a reduced 100h-subset of the LibriSpeech training corpus even show a more pronounced margin between the hybrid DNN/HMM and attention-based architectures.Comment: Proceedings of INTERSPEECH 201

arXiv.org e-Print Archive

Crossref

Publikationsserver der RWTH Aachen University

Character-Level Incremental Speech Recognition with Recurrent Neural Networks

Author: Hwang Kyuyeon
Sung Wonyong
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 28/01/2016
Field of study

In real-time speech recognition applications, the latency is an important issue. We have developed a character-level incremental speech recognition (ISR) system that responds quickly even during the speech, where the hypotheses are gradually improved while the speaking proceeds. The algorithm employs a speech-to-character unidirectional recurrent neural network (RNN), which is end-to-end trained with connectionist temporal classification (CTC), and an RNN-based character-level language model (LM). The output values of the CTC-trained RNN are character-level probabilities, which are processed by beam search decoding. The RNN LM augments the decoding by providing long-term dependency information. We propose tree-based online beam search with additional depth-pruning, which enables the system to process infinitely long input speech with low latency. This system not only responds quickly on speech but also can dictate out-of-vocabulary (OOV) words according to pronunciation. The proposed model achieves the word error rate (WER) of 8.90% on the Wall Street Journal (WSJ) Nov'92 20K evaluation set when trained on the WSJ SI-284 training set.Comment: To appear in ICASSP 201

arXiv.org e-Print Archive

Crossref

Improved training of end-to-end attention models for speech recognition

Author: Irie Kazuki
Ney Hermann
Schlüter Ralf
Zeyer Albert
Publication venue: 'International Speech Communication Association'
Publication date: 01/01/2018
Field of study

Sequence-to-sequence attention-based models on subword units allow simple open-vocabulary end-to-end speech recognition. In this work, we show that such models can achieve competitive results on the Switchboard 300h and LibriSpeech 1000h tasks. In particular, we report the state-of-the-art word error rates (WER) of 3.54% on the dev-clean and 3.82% on the test-clean evaluation subsets of LibriSpeech. We introduce a new pretraining scheme by starting with a high time reduction factor and lowering it during training, which is crucial both for convergence and final performance. In some experiments, we also use an auxiliary CTC loss function to help the convergence. In addition, we train long short-term memory (LSTM) language models on subword units. By shallow fusion, we report up to 27% relative improvements in WER over the attention baseline without a language model.Comment: submitted to Interspeech 201

arXiv.org e-Print Archive

Publikationsserver der RWTH Aachen University