156,229 research outputs found

    Improved training for online end-to-end speech recognition systems

    Full text link
    Achieving high accuracy with end-to-end speech recognizers requires careful parameter initialization prior to training. Otherwise, the networks may fail to find a good local optimum. This is particularly true for online networks, such as unidirectional LSTMs. Currently, the best strategy to train such systems is to bootstrap the training from a tied-triphone system. However, this is time consuming, and more importantly, is impossible for languages without a high-quality pronunciation lexicon. In this work, we propose an initialization strategy that uses teacher-student learning to transfer knowledge from a large, well-trained, offline end-to-end speech recognition model to an online end-to-end model, eliminating the need for a lexicon or any other linguistic resources. We also explore curriculum learning and label smoothing and show how they can be combined with the proposed teacher-student learning for further improvements. We evaluate our methods on a Microsoft Cortana personal assistant task and show that the proposed method results in a 19 % relative improvement in word error rate compared to a randomly-initialized baseline system.Comment: Interspeech 201

    μž¬κ·€ν˜• 인곡신경망을 μ΄μš©ν•œ 온라인 μŒμ„±μΈμ‹

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (박사)-- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : 전기·컴퓨터곡학뢀, 2017. 2. μ„±μ›μš©.μž¬κ·€ν˜• 인곡신경망(recurrent neural network, RNN)은 졜근 μ‹œν€€μŠ€-투-μ‹œν€€μŠ€(sequence-to-sequence) λ°©μ‹μ˜ μ—¬λŸ¬ λͺ¨λΈμ—μ„œ 쒋은 μ„±λŠ₯을 보여 μ™”λ‹€. 졜근의 μŒμ„±μΈμ‹μ—μ„œ μ‚¬μš©ν•˜λŠ” 쒅단간(end-to-end) ν›ˆλ ¨ λ°©μ‹μ˜ λ°œμ „μœΌλ‘œ 인해, RNN은 일련의 μ˜€λ””μ˜€ νŠΉμ§•(feature)을 μž…λ ₯으둜 ν•˜κ³  일련의 κΈ€μž(character) ν˜Ήμ€ 단어듀을 좜λ ₯으둜 ν•˜λŠ” λ‹¨μΌν•œ ν•¨μˆ˜λ₯Ό ν•™μŠ΅ν•  수 있게 λ˜μ—ˆλ‹€. 이 ν•¨μˆ˜λŠ” 쀑간에 μŒμ†Œ λ‹¨μœ„ ν˜Ήμ€ 발음 사전(lexicon) λ‹¨μœ„μ˜ λ³€ν™˜μ„ κ±°μΉ˜μ§€ μ•ŠλŠ”λ‹€. μ§€κΈˆκΉŒμ§€, λŒ€λΆ€λΆ„μ˜ 쒅단간 μŒμ„±μΈμ‹μ€ κΈ°μ‘΄ λ°©μ‹μœΌλ‘œ 얻은 높은 정확도λ₯Ό λ”°λΌκ°€λŠ” 데 초점이 맞좰져 μžˆμ—ˆλ‹€. ν•˜μ§€λ§Œ, 비둝 쒅단간 μŒμ„±μΈμ‹ λͺ¨λΈμ΄ κΈ°μ‘΄ μŒμ„±μΈμ‹ λͺ¨λΈλ§ŒνΌμ˜ 정확도λ₯Ό λ‹¬μ„±ν–ˆμŒμ—λ„, 이 λͺ¨λΈμ€ 보톡 미리 μž˜λΌμ§„ μ˜€λ””μ˜€ 데이터λ₯Ό μ‚¬μš©ν•˜λŠ” λ°œν™” λ‹¨μœ„μ˜ μŒμ„±μΈμ‹μ— μ‚¬μš©λ˜μ—ˆκ³ , μ‹€μ‹œκ°„μœΌλ‘œ 연속적인 μ˜€λ””μ˜€ 데이터λ₯Ό λ°›μ•„ μ‚¬μš©ν•˜λŠ” μŒμ„±μΈμ‹μ—λŠ” 잘 μ‚¬μš©λ˜μ§€ μ•Šμ•˜λ‹€. 이것은 미리 μž˜λΌμ§„ λ°μ΄ν„°λ‘œ ν•™μŠ΅ν•œ RNN은 맀우 κΈ΄ μ˜€λ””μ˜€ μž…λ ₯에 λŒ€ν•΄μ„œλ„ 잘 λ™μž‘ν•˜λ„λ‘ μΌλ°˜ν™”(generalization)κ°€ 되기 μ–΄λ €μ› κΈ° λ•Œλ¬Έμ΄λ‹€. μœ„ 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄, λ³Έ λ…Όλ¬Έμ—μ„œλŠ” λ¬΄ν•œνžˆ κΈ΄ μ‹œν€€μŠ€λ₯Ό μ‚¬μš©ν•˜λŠ” RNN ν›ˆλ ¨ 방법을 μ œμ•ˆν•œλ‹€. λ¨Όμ €, 이λ₯Ό μœ„ν•œ 효과적인 κ·Έλž˜ν”½ ν”„λ‘œμ„Έμ„œ(graphics processing unit, GPU) 기반 RNN ν›ˆλ ¨ ν”„λ ˆμž„μ›Œν¬(framework)λ₯Ό μ„€λͺ…ν•œλ‹€. 이 ν”„λ ˆμž„μ›Œν¬λŠ” μ œν•œλœ μ‹œκ°„μΆ• μ—­μ „νŒŒ(truncated backpropagation through time, truncated BPTT)λ₯Ό μ‚¬μš©ν•΄ ν›ˆλ ¨λ˜λ©°, 덕뢄에 μ‹€μ‹œκ°„μœΌλ‘œ λ“€μ–΄μ˜€λŠ” 연속적인 데이터λ₯Ό μ‚¬μš©ν•˜μ—¬ ν›ˆλ ¨ν•  수 μžˆλ‹€. λ‹€μŒμœΌλ‘œ, μ—°κ²°μ„± μ‹œκ³„μ—΄ λΆ„λ₯˜κΈ°(connectionist temporal classification, CTC) μ•Œκ³ λ¦¬μ¦˜μ˜ 손싀(loss) 계산 방식을 λ³€ν˜•ν•œ μ‹€μ‹œκ°„ CTC ν•™μŠ΅ μ•Œκ³ λ¦¬μ¦˜μ„ 선보인닀. μƒˆλ‘­κ²Œ 선보인 CTC 손싀 계산 μ•Œκ³ λ¦¬μ¦˜μ€ truncated BPTT 기반의 RNN ν›ˆλ ¨μ— λ°”λ‘œ 적용될 수 μžˆλ‹€. λ‹€μŒμœΌλ‘œ, RNN만으둜 κ΅¬μ„±λœ 쒅단간 μ‹€μ‹œκ°„ μŒμ„±μΈμ‹ λͺ¨λΈμ„ μ†Œκ°œν•œλ‹€. 이 λͺ¨λΈμ€ 크게 CTC 좜λ ₯을 μ‚¬μš©ν•˜λŠ” 음ν–₯(acoustic) RNNκ³Ό κΈ€μž λ‹¨μœ„ RNN μ–Έμ–΄ λͺ¨λΈ(language model)둜 κ΅¬μ„±λœλ‹€. 그리고, 접두사 트리(prefix-tree) 기반의 μƒˆλ‘œμš΄ λΉ” 탐색(beam search)이 μ‚¬μš©λ˜μ–΄ λ¬΄ν•œν•œ μž…λ ₯ μ˜€λ””μ˜€μ— λŒ€ν•΄ λ””μ½”λ”©(decoding)을 μˆ˜ν–‰ν•  수 μžˆλ‹€. 이 λ””μ½”λ”© λ°©μ‹μ—λŠ” μƒˆλ‘œμš΄ λΉ” κ°€μ§€μΉ˜κΈ°(beam pruning) μ•Œκ³ λ¦¬μ¦˜μ΄ λ„μž…λ˜μ–΄ 트리 ꡬ쑰의 크기가 μ§€μˆ˜μ μœΌλ‘œ μ¦κ°€ν•˜λŠ” 것을 λ°©μ§€ν•œλ‹€. μœ„ μŒμ„±μΈμ‹ λͺ¨λΈμ—λŠ” λ³„λ„μ˜ μŒμ†Œ λͺ¨λΈμ΄λ‚˜ 발음 사전이 ν¬ν•¨λ˜μ–΄ μžˆμ§€ μ•Šκ³ , λ¬΄ν•œνžˆ κΈ΄ 일련의 μ˜€λ””μ˜€μ— λŒ€ν•΄ 디코딩을 μˆ˜ν–‰ν•  수 μžˆλ‹€λŠ” νŠΉμ§•μ΄ μžˆλ‹€. μœ„ λͺ¨λΈμ€ λ˜ν•œ λ‹€λ₯Έ 쒅단간 λͺ¨λΈμ— λΉ„ν•΄ 맀우 적은 λ©”λͺ¨λ¦¬λ₯Ό μ‚¬μš©ν•˜λ©΄μ„œλ„ 비견될 λ§Œν•œ 정확도λ₯Ό 보인닀. λ§ˆμ§€λ§‰μœΌλ‘œ, λ³Έ λ…Όλ¬Έμ—μ„œλŠ” κ³„μΈ΅ν˜• ꡬ쑰(hierarchical structure)λ₯Ό μ΄μš©ν•΄ κΈ€μž λ‹¨μœ„ RNN μ–Έμ–΄ λͺ¨λΈμ˜ μ„±λŠ₯을 ν–₯μƒμ‹œμΌ°λ‹€. 특히, 이 κΈ€μž λ‹¨μœ„ RNN λͺ¨λΈμ€ λΉ„μŠ·ν•œ νŒŒλΌλ―Έν„° 수λ₯Ό κ°–λŠ” 단어 λ‹¨μœ„ RNN μ–Έμ–΄ λͺ¨λΈλ³΄λ‹€ κ°œμ„ λœ 예츑 λ³΅μž‘λ„(perplexity)λ₯Ό λ‹¬μ„±ν•˜μ˜€λ‹€. λ˜ν•œ, 이 κΈ€μž λ‹¨μœ„ RNN μ–Έμ–΄ λͺ¨λΈμ„ μ•žμ„œ μ„€λͺ…ν•œ κΈ€μž λ‹¨μœ„ μ‹€μ‹œκ°„ μŒμ„±μΈμ‹ μ‹œμŠ€ν…œμ— μ μš©ν•˜μ—¬ λ”μš± 적은 연산을 μ‚¬μš©ν•˜λ©΄μ„œλ„ μŒμ„±μΈμ‹ 정확도λ₯Ό ν–₯μƒμ‹œν‚¬ 수 μžˆμ—ˆλ‹€.Recurrent neural networks (RNNs) have shown outstanding sequence to sequence modeling performance. Thanks to recent advances in end-to-end training approaches for automatic speech recognition (ASR), RNNs can learn direct mapping functions from the sequence of audio features to the sequence of output characters or words without any intermediate phoneme or lexicon layers. So far, majority of studies on end-to-end ASR have been focused on increasing the accuracy of speech recognition to the level of traditional state-of-the-art models. However, although the end-to-end ASR models have reached the accuracy of the traditional systems, their application has usually been limited to utterance-level speech recognition with pre-segmented audio instead of online speech recognition with continuous audio. This is because the RNNs cannot be easily generalized to very long streams of audio when they are trained with segmented audio. To address this problem, we propose an RNN training approach on training sequences with virtually infinite length. Specifically, we describe an efficient GPU-based RNN training framework for the truncated backpropagation through time (BPTT) algorithm, which is suitable for online (continuous) training. Then, we present an online version of the connectionist temporal classification (CTC) loss computation algorithm, where the original CTC loss is estimated with partial sliding window. This modified CTC algorithm can be directly employed for truncated BPTT based RNN training. In addition, a fully RNN based end-to-end online ASR model is proposed. The model is composed of an acoustic RNN with CTC output and a character-level RNN language model that is augmented with a hierarchical structure. Prefix-tree based beam search decoding is employed with a new beam pruning algorithm to prevent exponential growth of the tree. The model is free from phoneme or lexicon models, and can be used for decoding infinitely long audio sequences. Also, this model has very small memory footprint compared to the other end-to-end systems while showing the competitive accuracy. Furthermore, we propose an improved character-level RNN LM with a hierarchical structure. This character-level RNN LM shows improved perplexity compared to the lightweight word-level RNN LM with a comparable size. When this RNN LM is applied to the proposed character-level online ASR, better speech recognition accuracy can be achieved with reduced amount of computation.1 Introduction 1 1.1 Automatic Speech Recognition 1 1.1.1 Traditional ASR 2 1.1.2 End-to-End ASR with Recurrent Neural Networks 3 1.1.3 Offline and Online ASR 3 1.2 Scope of the Dissertation 4 1.2.1 End-to-End Online ASR with RNNs 4 1.2.2 Challenges and Contributions 5 2 Flexible and Efficient RNN Training on GPUs 7 2.1 Introduction 7 2.2 Generalization 9 2.2.1 Generalized RNN Structure 9 2.2.2 Training 11 2.3 Parallelization 15 2.3.1 Intra-Stream Parallelism 15 2.3.2 Inter-Stream Parallelism 17 2.4 Experiments 18 2.5 Concluding Remarks 21 3 Online Sequence Training with Connectionist Temporal Classification 22 3.1 Introduction 22 3.2 Connectionist Temporal Classification 25 3.3 Online Sequence Training 28 3.3.1 Problem Definition 28 3.3.2 Overview of the Proposed Approach 29 3.3.3 CTC-TR: Standard CTC with Truncation 31 3.3.4 CTC-EM: EM-Based Online CTC 32 3.4 Training Continuously Running RNNs 37 3.5 Parallel Training 38 3.6 Experiments 39 3.6.1 End-to-End Speech Recognition with RNNs 39 3.6.2 Phoneme Recognition on TIMIT 46 3.7 Concluding Remarks 51 4 Character-Level Incremental Speech Recognition 52 4.1 Introduction 52 4.2 Models 54 4.2.1 Acoustic Model 54 4.2.2 Language Model 56 4.3 Character-Level Beam Search 57 4.3.1 Prefix-Tree-Based CTC Beam Search 57 4.3.2 Pruning 60 4.4 Experiments 62 4.5 Concluding Remarks 65 5 Character-Level Language Modeling with Hierarchical RNNs 66 5.1 Introduction 66 5.2 Related Work 68 5.2.1 Character-Level Language Modeling with RNNs 68 5.2.2 Character-Aware Word-Level Language Modeling 69 5.3 RNNs with External Clock and Reset Signals 70 5.4 Character-Level Language Modeling with a Hierarchical RNN 72 5.5 Experiments 75 5.5.1 Perplexity 76 5.5.2 End-to-End Automatic Speech Recognition (ASR) 79 5.6 Concluding Remarks 81 6 Conclusion 83 Bibliography 85 Abstract in Korean 98Docto

    RWTH ASR Systems for LibriSpeech: Hybrid vs Attention -- w/o Data Augmentation

    Full text link
    We present state-of-the-art automatic speech recognition (ASR) systems employing a standard hybrid DNN/HMM architecture compared to an attention-based encoder-decoder design for the LibriSpeech task. Detailed descriptions of the system development, including model design, pretraining schemes, training schedules, and optimization approaches are provided for both system architectures. Both hybrid DNN/HMM and attention-based systems employ bi-directional LSTMs for acoustic modeling/encoding. For language modeling, we employ both LSTM and Transformer based architectures. All our systems are built using RWTHs open-source toolkits RASR and RETURNN. To the best knowledge of the authors, the results obtained when training on the full LibriSpeech training set, are the best published currently, both for the hybrid DNN/HMM and the attention-based systems. Our single hybrid system even outperforms previous results obtained from combining eight single systems. Our comparison shows that on the LibriSpeech 960h task, the hybrid DNN/HMM system outperforms the attention-based system by 15% relative on the clean and 40% relative on the other test sets in terms of word error rate. Moreover, experiments on a reduced 100h-subset of the LibriSpeech training corpus even show a more pronounced margin between the hybrid DNN/HMM and attention-based architectures.Comment: Proceedings of INTERSPEECH 201

    Character-Level Incremental Speech Recognition with Recurrent Neural Networks

    Full text link
    In real-time speech recognition applications, the latency is an important issue. We have developed a character-level incremental speech recognition (ISR) system that responds quickly even during the speech, where the hypotheses are gradually improved while the speaking proceeds. The algorithm employs a speech-to-character unidirectional recurrent neural network (RNN), which is end-to-end trained with connectionist temporal classification (CTC), and an RNN-based character-level language model (LM). The output values of the CTC-trained RNN are character-level probabilities, which are processed by beam search decoding. The RNN LM augments the decoding by providing long-term dependency information. We propose tree-based online beam search with additional depth-pruning, which enables the system to process infinitely long input speech with low latency. This system not only responds quickly on speech but also can dictate out-of-vocabulary (OOV) words according to pronunciation. The proposed model achieves the word error rate (WER) of 8.90% on the Wall Street Journal (WSJ) Nov'92 20K evaluation set when trained on the WSJ SI-284 training set.Comment: To appear in ICASSP 201

    Improved training of end-to-end attention models for speech recognition

    Full text link
    Sequence-to-sequence attention-based models on subword units allow simple open-vocabulary end-to-end speech recognition. In this work, we show that such models can achieve competitive results on the Switchboard 300h and LibriSpeech 1000h tasks. In particular, we report the state-of-the-art word error rates (WER) of 3.54% on the dev-clean and 3.82% on the test-clean evaluation subsets of LibriSpeech. We introduce a new pretraining scheme by starting with a high time reduction factor and lowering it during training, which is crucial both for convergence and final performance. In some experiments, we also use an auxiliary CTC loss function to help the convergence. In addition, we train long short-term memory (LSTM) language models on subword units. By shallow fusion, we report up to 27% relative improvements in WER over the attention baseline without a language model.Comment: submitted to Interspeech 201
    • …
    corecore