19 research outputs found
Transformers with convolutional context for ASR
The recent success of transformer networks for neural machine translation and
other NLP tasks has led to a surge in research work trying to apply it for
speech recognition. Recent efforts studied key research questions around ways
of combining positional embedding with speech features, and stability of
optimization for large scale learning of transformer networks. In this paper,
we propose replacing the sinusoidal positional embedding for transformers with
convolutionally learned input representations. These contextual representations
provide subsequent transformer blocks with relative positional information
needed for discovering long-range relationships between local concepts. The
proposed system has favorable optimization characteristics where our reported
results are produced with fixed learning rate of 1.0 and no warmup steps. The
proposed model achieves a competitive 4.7% and 12.9% WER on the Librispeech
``test clean'' and ``test other'' subsets when no extra LM text is provided
Weak-Attention Suppression For Transformer Based Speech Recognition
Transformers, originally proposed for natural language processing (NLP)
tasks, have recently achieved great success in automatic speech recognition
(ASR). However, adjacent acoustic units (i.e., frames) are highly correlated,
and long-distance dependencies between them are weak, unlike text units. It
suggests that ASR will likely benefit from sparse and localized attention. In
this paper, we propose Weak-Attention Suppression (WAS), a method that
dynamically induces sparsity in attention probabilities. We demonstrate that
WAS leads to consistent Word Error Rate (WER) improvement over strong
transformer baselines. On the widely used LibriSpeech benchmark, our proposed
method reduced WER by 10%$ on test-clean and 5% on test-other for streamable
transformers, resulting in a new state-of-the-art among streaming models.
Further analysis shows that WAS learns to suppress attention of non-critical
and redundant continuous acoustic frames, and is more likely to suppress past
frames rather than future ones. It indicates the importance of lookahead in
attention-based ASR models.Comment: submitted to interspeech 202
Multilingual End-to-End Speech Recognition with A Single Transformer on Low-Resource Languages
Sequence-to-sequence attention-based models integrate an acoustic,
pronunciation and language model into a single neural network, which make them
very suitable for multilingual automatic speech recognition (ASR). In this
paper, we are concerned with multilingual speech recognition on low-resource
languages by a single Transformer, one of sequence-to-sequence attention-based
models. Sub-words are employed as the multilingual modeling unit without using
any pronunciation lexicon. First, we show that a single multilingual ASR
Transformer performs well on low-resource languages despite of some language
confusion. We then look at incorporating language information into the model by
inserting the language symbol at the beginning or at the end of the original
sub-words sequence under the condition of language information being known
during training. Experiments on CALLHOME datasets demonstrate that the
multilingual ASR Transformer with the language symbol at the end performs
better and can obtain relatively 10.5\% average word error rate (WER) reduction
compared to SHL-MLSTM with residual learning. We go on to show that, assuming
the language information being known during training and testing, about
relatively 12.4\% average WER reduction can be observed compared to SHL-MLSTM
with residual learning through giving the language symbol as the sentence start
token.Comment: arXiv admin note: text overlap with arXiv:1805.0623
End-to-End Speech Translation with Knowledge Distillation
End-to-end speech translation (ST), which directly translates from source
language speech into target language text, has attracted intensive attentions
in recent years. Compared to conventional pipeline systems, end-to-end ST
models have advantages of lower latency, smaller model size and less error
propagation. However, the combination of speech recognition and text
translation in one model is more difficult than each of these two tasks. In
this paper, we propose a knowledge distillation approach to improve ST model by
transferring the knowledge from text translation model. Specifically, we first
train a text translation model, regarded as a teacher model, and then ST model
is trained to learn output probabilities from teacher model through knowledge
distillation. Experiments on English- French Augmented LibriSpeech and
English-Chinese TED corpus show that end-to-end ST is possible to implement on
both similar and dissimilar language pairs. In addition, with the instruction
of teacher model, end-to-end ST model can gain significant improvements by over
3.5 BLEU points.Comment: Submitted to Interspeech 201
On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition
Recently, there has been a strong push to transition from hybrid models to
end-to-end (E2E) models for automatic speech recognition. Currently, there are
three promising E2E methods: recurrent neural network transducer (RNN-T), RNN
attention-based encoder-decoder (AED), and Transformer-AED. In this study, we
conduct an empirical comparison of RNN-T, RNN-AED, and Transformer-AED models,
in both non-streaming and streaming modes. We use 65 thousand hours of
Microsoft anonymized training data to train these models. As E2E models are
more data hungry, it is better to compare their effectiveness with large amount
of training data. To the best of our knowledge, no such comprehensive study has
been conducted yet. We show that although AED models are stronger than RNN-T in
the non-streaming mode, RNN-T is very competitive in streaming mode if its
encoder can be properly initialized. Among all three E2E models,
transformer-AED achieved the best accuracy in both streaming and non-streaming
mode. We show that both streaming RNN-T and transformer-AED models can obtain
better accuracy than a highly-optimized hybrid model.Comment: Accepted by Interspeech 202
Cross-task pre-training for acoustic scene classification
Acoustic scene classification(ASC) and acoustic event detection(AED) are
different but related tasks. Acoustic scenes can be shaped by occurred acoustic
events which can provide useful information in training ASC tasks. However,
most of the datasets are provided without either the acoustic event or scene
labels. Therefore, We explored cross-task pre-training mechanism to utilize
acoustic event information extracted from the pre-trained model to optimize the
ASC task. We present three cross-task pre-training architectures and evaluated
them in feature-based and fine-tuning strategies on two datasets respectively:
TAU Urban Acoustic Scenes 2019 dataset and TUT Acoustic Scenes 2017 dataset.
Results have shown that cross-task pre-training mechanism can significantly
improve the performance of ASC tasks and the performance of our best model
improved relatively 9.5% in the TAU Urban Acoustic Scenes 2019 dataset, and
also improved 10% in the TUT Acoustic Scenes 2017 dataset compared with the
official baseline.Comment: submitted to ICASSP202
NIESR: Nuisance Invariant End-to-end Speech Recognition
Deep neural network models for speech recognition have achieved great success
recently, but they can learn incorrect associations between the target and
nuisance factors of speech (e.g., speaker identities, background noise, etc.),
which can lead to overfitting. While several methods have been proposed to
tackle this problem, existing methods incorporate additional information about
nuisance factors during training to develop invariant models. However,
enumeration of all possible nuisance factors in speech data and the collection
of their annotations is difficult and expensive. We present a robust training
scheme for end-to-end speech recognition that adopts an unsupervised
adversarial invariance induction framework to separate out essential factors
for speech-recognition from nuisances without using any supplementary labels
besides the transcriptions. Experiments show that the speech recognition model
trained with the proposed training scheme achieves relative improvements of
5.48% on WSJ0, 6.16% on CHiME3, and 6.61% on TIMIT dataset over the base model.
Additionally, the proposed method achieves a relative improvement of 14.44% on
the combined WSJ0+CHiME3 dataset.Comment: To appear in Proceedings of Interspeech 201
Masked Pre-trained Encoder base on Joint CTC-Transformer
This study (The work was accomplished during the internship in Tencent AI
lab) addresses semi-supervised acoustic modeling, i.e. attaining high-level
representations from unsupervised audio data and fine-tuning the parameters of
pre-trained model with supervised data. The proposed approach adopts a
two-stage training framework, consisting of masked pre-trained encoder (MPE)
and Joint CTC-Transformer (JCT). In the MPE framework, part of input frames are
masked and reconstructed after the encoder with massive unsupervised data. In
JCT framework, compared with original Transformer, acoustic features are
applied as input instead of plain text. CTC loss performs as the prediction
target on top of the encoder, and decoder blocks remain unchanged. This paper
presents a comparison between two-stage training method and the fully
supervised JCT. In addition, this paper investigates the our approach's
robustness against different volumns of training data. Experiments on the
two-stage training method deliver much better performance than fully supervised
model. The word error rate (WER) with two-stage training which only exploits
30\% of WSJ labeled data achieves 17\% reduction than which trained by 50\% of
WSJ in a fully supervised way. Moreover, increasing unlabeled data for MPE from
WSJ (81h) to Librispeech (960h) attains about 22\% WER reduction
Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss
In this paper we present an end-to-end speech recognition model with
Transformer encoders that can be used in a streaming speech recognition system.
Transformer computation blocks based on self-attention are used to encode both
audio and label sequences independently. The activations from both audio and
label encoders are combined with a feed-forward layer to compute a probability
distribution over the label space for every combination of acoustic frame
position and label history. This is similar to the Recurrent Neural Network
Transducer (RNN-T) model, which uses RNNs for information encoding instead of
Transformer encoders. The model is trained with the RNN-T loss well-suited to
streaming decoding. We present results on the LibriSpeech dataset showing that
limiting the left context for self-attention in the Transformer layers makes
decoding computationally tractable for streaming, with only a slight
degradation in accuracy. We also show that the full attention version of our
model beats the-state-of-the art accuracy on the LibriSpeech benchmarks. Our
results also show that we can bridge the gap between full attention and limited
attention versions of our model by attending to a limited number of future
frames.Comment: This is the final version of the paper submitted to the ICASSP 2020
on Oct 21, 201
Transformer with Bidirectional Decoder for Speech Recognition
Attention-based models have made tremendous progress on end-to-end automatic
speech recognition(ASR) recently. However, the conventional transformer-based
approaches usually generate the sequence results token by token from left to
right, leaving the right-to-left contexts unexploited. In this work, we
introduce a bidirectional speech transformer to utilize the different
directional contexts simultaneously. Specifically, the outputs of our proposed
transformer include a left-to-right target, and a right-to-left target. In
inference stage, we use the introduced bidirectional beam search method, which
can not only generate left-to-right candidates but also generate right-to-left
candidates, and determine the best hypothesis by the score.
To demonstrate our proposed speech transformer with a bidirectional
decoder(STBD), we conduct extensive experiments on the AISHELL-1 dataset. The
results of experiments show that STBD achieves a 3.6\% relative CER
reduction(CERR) over the unidirectional speech transformer baseline. Besides,
the strongest model in this paper called STBD-Big can achieve 6.64\% CER on the
test set, without language model rescoring and any extra data augmentation
strategies.Comment: Accepted by InterSpeech 202