Search CORE

3 research outputs found

A Comparison of Modeling Units in Sequence-to-Sequence Speech Recognition with the Transformer on Mandarin Chinese

Author: Dong Linhao
Xu Bo
Xu Shuang
Zhou Shiyu
Publication venue
Publication date: 18/05/2018
Field of study

The choice of modeling units is critical to automatic speech recognition (ASR) tasks. Conventional ASR systems typically choose context-dependent states (CD-states) or context-dependent phonemes (CD-phonemes) as their modeling units. However, it has been challenged by sequence-to-sequence attention-based models, which integrate an acoustic, pronunciation and language model into a single neural network. On English ASR tasks, previous attempts have already shown that the modeling unit of graphemes can outperform that of phonemes by sequence-to-sequence attention-based model. In this paper, we are concerned with modeling units on Mandarin Chinese ASR tasks using sequence-to-sequence attention-based models with the Transformer. Five modeling units are explored including context-independent phonemes (CI-phonemes), syllables, words, sub-words and characters. Experiments on HKUST datasets demonstrate that the lexicon free modeling units can outperform lexicon related modeling units in terms of character error rate (CER). Among five modeling units, character based model performs best and establishes a new state-of-the-art CER of

26.64\%

on HKUST datasets without a hand-designed lexicon and an extra language model integration, which corresponds to a

4.8\%

relative improvement over the existing best CER of

28.0\%

by the joint CTC-attention based encoder-decoder network.Comment: arXiv admin note: substantial text overlap with arXiv:1804.1075

arXiv.org e-Print Archive

Layer Trajectory LSTM

Author: Gong Yifan
Li Jinyu
Liu Changliang
Publication venue
Publication date: 28/08/2018
Field of study

It is popular to stack LSTM layers to get better modeling power, especially when large amount of training data is available. However, an LSTM-RNN with too many vanilla LSTM layers is very hard to train and there still exists the gradient vanishing issue if the network goes too deep. This issue can be partially solved by adding skip connections between layers, such as residual LSTM. In this paper, we propose a layer trajectory LSTM (ltLSTM) which builds a layer-LSTM using all the layer outputs from a standard multi-layer time-LSTM. This layer-LSTM scans the outputs from time-LSTMs, and uses the summarized layer trajectory information for final senone classification. The forward-propagation of time-LSTM and layer-LSTM can be handled in two separate threads in parallel so that the network computation time is the same as the standard time-LSTM. With a layer-LSTM running through layers, a gated path is provided from the output layer to the bottom layer, alleviating the gradient vanishing issue. Trained with 30 thousand hours of EN-US Microsoft internal data, the proposed ltLSTM performed significantly better than the standard multi-layer LSTM and residual LSTM, with up to 9.0% relative word error rate reduction across different tasks.Comment: Accepted at Interspeech 2018. Note the computational cost in Table 2 in the original Interspeech publication was doubled. Please refer this publication for the right computational cos

arXiv.org e-Print Archive

Recent Progresses in Deep Learning based Acoustic Models (Updated)

Author: Li Jinyu
Yu Dong
Publication venue
Publication date: 26/04/2018
Field of study

In this paper, we summarize recent progresses made in deep learning based acoustic models and the motivation and insights behind the surveyed techniques. We first discuss acoustic models that can effectively exploit variable-length contextual information, such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), and their various combination with other models. We then describe acoustic models that are optimized end-to-end with emphasis on feature representations learned jointly with rest of the system, the connectionist temporal classification (CTC) criterion, and the attention-based sequence-to-sequence model. We further illustrate robustness issues in speech recognition systems, and discuss acoustic model adaptation, speech enhancement and separation, and robust training strategies. We also cover modeling techniques that lead to more efficient decoding and discuss possible future directions in acoustic model research.Comment: This is an updated version with latest literature until ICASSP2018 of the paper: Dong Yu and Jinyu Li, "Recent Progresses in Deep Learning based Acoustic Models," vol.4, no.3, IEEE/CAA Journal of Automatica Sinica, 201

arXiv.org e-Print Archive