3 research outputs found
Improved TDNNs using Deep Kernels and Frequency Dependent Grid-RNNs
Time delay neural networks (TDNNs) are an effective acoustic model for large
vocabulary speech recognition. The strength of the model can be attributed to
its ability to effectively model long temporal contexts. However, current TDNN
models are relatively shallow, which limits the modelling capability. This
paper proposes a method of increasing the network depth by deepening the kernel
used in the TDNN temporal convolutions. The best performing kernel consists of
three fully connected layers with a residual (ResNet) connection from the
output of the first to the output of the third. The addition of
spectro-temporal processing as the input to the TDNN in the form of a
convolutional neural network (CNN) and a newly designed Grid-RNN was
investigated. The Grid-RNN strongly outperforms a CNN if different sets of
parameters for different frequency bands are used and can be further enhanced
by using a bi-directional Grid-RNN. Experiments using the multi-genre broadcast
(MGB3) English data (275h) show that deep kernel TDNNs reduces the word error
rate (WER) by 6% relative and when combined with the frequency dependent
Grid-RNN gives a relative WER reduction of 9%.Comment: 5 pages, 3 figures, 2 tables, to appear in 2018 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP 2018
Layer Trajectory LSTM
It is popular to stack LSTM layers to get better modeling power, especially
when large amount of training data is available. However, an LSTM-RNN with too
many vanilla LSTM layers is very hard to train and there still exists the
gradient vanishing issue if the network goes too deep. This issue can be
partially solved by adding skip connections between layers, such as residual
LSTM. In this paper, we propose a layer trajectory LSTM (ltLSTM) which builds a
layer-LSTM using all the layer outputs from a standard multi-layer time-LSTM.
This layer-LSTM scans the outputs from time-LSTMs, and uses the summarized
layer trajectory information for final senone classification. The
forward-propagation of time-LSTM and layer-LSTM can be handled in two separate
threads in parallel so that the network computation time is the same as the
standard time-LSTM. With a layer-LSTM running through layers, a gated path is
provided from the output layer to the bottom layer, alleviating the gradient
vanishing issue. Trained with 30 thousand hours of EN-US Microsoft internal
data, the proposed ltLSTM performed significantly better than the standard
multi-layer LSTM and residual LSTM, with up to 9.0% relative word error rate
reduction across different tasks.Comment: Accepted at Interspeech 2018. Note the computational cost in Table 2
in the original Interspeech publication was doubled. Please refer this
publication for the right computational cos
Recent Progresses in Deep Learning based Acoustic Models (Updated)
In this paper, we summarize recent progresses made in deep learning based
acoustic models and the motivation and insights behind the surveyed techniques.
We first discuss acoustic models that can effectively exploit variable-length
contextual information, such as recurrent neural networks (RNNs), convolutional
neural networks (CNNs), and their various combination with other models. We
then describe acoustic models that are optimized end-to-end with emphasis on
feature representations learned jointly with rest of the system, the
connectionist temporal classification (CTC) criterion, and the attention-based
sequence-to-sequence model. We further illustrate robustness issues in speech
recognition systems, and discuss acoustic model adaptation, speech enhancement
and separation, and robust training strategies. We also cover modeling
techniques that lead to more efficient decoding and discuss possible future
directions in acoustic model research.Comment: This is an updated version with latest literature until ICASSP2018 of
the paper: Dong Yu and Jinyu Li, "Recent Progresses in Deep Learning based
Acoustic Models," vol.4, no.3, IEEE/CAA Journal of Automatica Sinica, 201