3 research outputs found
A Comparison of Modeling Units in Sequence-to-Sequence Speech Recognition with the Transformer on Mandarin Chinese
The choice of modeling units is critical to automatic speech recognition
(ASR) tasks. Conventional ASR systems typically choose context-dependent states
(CD-states) or context-dependent phonemes (CD-phonemes) as their modeling
units. However, it has been challenged by sequence-to-sequence attention-based
models, which integrate an acoustic, pronunciation and language model into a
single neural network. On English ASR tasks, previous attempts have already
shown that the modeling unit of graphemes can outperform that of phonemes by
sequence-to-sequence attention-based model.
In this paper, we are concerned with modeling units on Mandarin Chinese ASR
tasks using sequence-to-sequence attention-based models with the Transformer.
Five modeling units are explored including context-independent phonemes
(CI-phonemes), syllables, words, sub-words and characters. Experiments on HKUST
datasets demonstrate that the lexicon free modeling units can outperform
lexicon related modeling units in terms of character error rate (CER). Among
five modeling units, character based model performs best and establishes a new
state-of-the-art CER of on HKUST datasets without a hand-designed
lexicon and an extra language model integration, which corresponds to a
relative improvement over the existing best CER of by the joint
CTC-attention based encoder-decoder network.Comment: arXiv admin note: substantial text overlap with arXiv:1804.1075
Layer Trajectory LSTM
It is popular to stack LSTM layers to get better modeling power, especially
when large amount of training data is available. However, an LSTM-RNN with too
many vanilla LSTM layers is very hard to train and there still exists the
gradient vanishing issue if the network goes too deep. This issue can be
partially solved by adding skip connections between layers, such as residual
LSTM. In this paper, we propose a layer trajectory LSTM (ltLSTM) which builds a
layer-LSTM using all the layer outputs from a standard multi-layer time-LSTM.
This layer-LSTM scans the outputs from time-LSTMs, and uses the summarized
layer trajectory information for final senone classification. The
forward-propagation of time-LSTM and layer-LSTM can be handled in two separate
threads in parallel so that the network computation time is the same as the
standard time-LSTM. With a layer-LSTM running through layers, a gated path is
provided from the output layer to the bottom layer, alleviating the gradient
vanishing issue. Trained with 30 thousand hours of EN-US Microsoft internal
data, the proposed ltLSTM performed significantly better than the standard
multi-layer LSTM and residual LSTM, with up to 9.0% relative word error rate
reduction across different tasks.Comment: Accepted at Interspeech 2018. Note the computational cost in Table 2
in the original Interspeech publication was doubled. Please refer this
publication for the right computational cos
Recent Progresses in Deep Learning based Acoustic Models (Updated)
In this paper, we summarize recent progresses made in deep learning based
acoustic models and the motivation and insights behind the surveyed techniques.
We first discuss acoustic models that can effectively exploit variable-length
contextual information, such as recurrent neural networks (RNNs), convolutional
neural networks (CNNs), and their various combination with other models. We
then describe acoustic models that are optimized end-to-end with emphasis on
feature representations learned jointly with rest of the system, the
connectionist temporal classification (CTC) criterion, and the attention-based
sequence-to-sequence model. We further illustrate robustness issues in speech
recognition systems, and discuss acoustic model adaptation, speech enhancement
and separation, and robust training strategies. We also cover modeling
techniques that lead to more efficient decoding and discuss possible future
directions in acoustic model research.Comment: This is an updated version with latest literature until ICASSP2018 of
the paper: Dong Yu and Jinyu Li, "Recent Progresses in Deep Learning based
Acoustic Models," vol.4, no.3, IEEE/CAA Journal of Automatica Sinica, 201