10 research outputs found
Semantic Mask for Transformer based End-to-End Speech Recognition
Attention-based encoder-decoder model has achieved impressive results for
both automatic speech recognition (ASR) and text-to-speech (TTS) tasks. This
approach takes advantage of the memorization capacity of neural networks to
learn the mapping from the input sequence to the output sequence from scratch,
without the assumption of prior knowledge such as the alignments. However, this
model is prone to overfitting, especially when the amount of training data is
limited. Inspired by SpecAugment and BERT, in this paper, we propose a semantic
mask based regularization for training such kind of end-to-end (E2E) model. The
idea is to mask the input features corresponding to a particular output token,
e.g., a word or a word-piece, in order to encourage the model to fill the token
based on the contextual information. While this approach is applicable to the
encoder-decoder framework with any type of neural network architecture, we
study the transformer-based model for ASR in this work. We perform experiments
on Librispeech 960h and TedLium2 data sets, and achieve the state-of-the-art
performance on the test set in the scope of E2E models
On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition
Recently, there has been a strong push to transition from hybrid models to
end-to-end (E2E) models for automatic speech recognition. Currently, there are
three promising E2E methods: recurrent neural network transducer (RNN-T), RNN
attention-based encoder-decoder (AED), and Transformer-AED. In this study, we
conduct an empirical comparison of RNN-T, RNN-AED, and Transformer-AED models,
in both non-streaming and streaming modes. We use 65 thousand hours of
Microsoft anonymized training data to train these models. As E2E models are
more data hungry, it is better to compare their effectiveness with large amount
of training data. To the best of our knowledge, no such comprehensive study has
been conducted yet. We show that although AED models are stronger than RNN-T in
the non-streaming mode, RNN-T is very competitive in streaming mode if its
encoder can be properly initialized. Among all three E2E models,
transformer-AED achieved the best accuracy in both streaming and non-streaming
mode. We show that both streaming RNN-T and transformer-AED models can obtain
better accuracy than a highly-optimized hybrid model.Comment: Accepted by Interspeech 202
Low Latency End-to-End Streaming Speech Recognition with a Scout Network
The attention-based Transformer model has achieved promising results for
speech recognition (SR) in the offline mode. However, in the streaming mode,
the Transformer model usually incurs significant latency to maintain its
recognition accuracy when applying a fixed-length look-ahead window in each
encoder layer. In this paper, we propose a novel low-latency streaming approach
for Transformer models, which consists of a scout network and a recognition
network. The scout network detects the whole word boundary without seeing any
future frames, while the recognition network predicts the next subword by
utilizing the information from all the frames before the predicted boundary.
Our model achieves the best performance (2.7/6.4 WER) with only 639 ms latency
on the test-clean and test-other data sets of Librispeech
Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of Any Number of Speakers
We propose an end-to-end speaker-attributed automatic speech recognition
model that unifies speaker counting, speech recognition, and speaker
identification on monaural overlapped speech. Our model is built on serialized
output training (SOT) with attention-based encoder-decoder, a recently proposed
method for recognizing overlapped speech comprising an arbitrary number of
speakers. We extend SOT by introducing a speaker inventory as an auxiliary
input to produce speaker labels as well as multi-speaker transcriptions. All
model parameters are optimized by speaker-attributed maximum mutual information
criterion, which represents a joint probability for overlapped speech
recognition and speaker identification. Experiments on LibriSpeech corpus show
that our proposed method achieves significantly better speaker-attributed word
error rate than the baseline that separately performs overlapped speech
recognition and speaker identification.Comment: Accepted to INTERSPEECH 202
Exploring Transformers for Large-Scale Speech Recognition
While recurrent neural networks still largely define state-of-the-art speech
recognition systems, the Transformer network has been proven to be a
competitive alternative, especially in the offline condition. Most studies with
Transformers have been constrained in a relatively small scale setting, and
some forms of data argumentation approaches are usually applied to combat the
data sparsity issue. In this paper, we aim at understanding the behaviors of
Transformers in the large-scale speech recognition setting, where we have used
around 65,000 hours of training data. We investigated various aspects on
scaling up Transformers, including model initialization, warmup training as
well as different Layer Normalization strategies. In the streaming condition,
we compared the widely used attention mask based future context lookahead
approach to the Transformer-XL network. From our experiments, we show that
Transformers can achieve around 6% relative word error rate (WER) reduction
compared to the BLSTM baseline in the offline fashion, while in the streaming
fashion, Transformer-XL is comparable to LC-BLSTM with 800 millisecond latency
constraint.Comment: 5 pages, 1 figure, Interspeech 2020 Camera Read
Continuous Speech Separation with Conformer
Continuous speech separation plays a vital role in complicated speech related
tasks such as conversation transcription. The separation model extracts a
single speaker signal from a mixed speech. In this paper, we use transformer
and conformer in lieu of recurrent neural networks in the separation system, as
we believe capturing global information with the self-attention based method is
crucial for the speech separation. Evaluating on the LibriCSS dataset, the
conformer separation model achieves state of the art results, with a relative
23.5% word error rate (WER) reduction from bi-directional LSTM (BLSTM) in the
utterance-wise evaluation and a 15.4% WER reduction in the continuous
evaluation
Investigation of Practical Aspects of Single Channel Speech Separation for ASR
Speech separation has been successfully applied as a frontend processing
module of conversation transcription systems thanks to its ability to handle
overlapped speech and its flexibility to combine with downstream tasks such as
automatic speech recognition (ASR). However, a speech separation model often
introduces target speech distortion, resulting in a sub-optimum word error rate
(WER). In this paper, we describe our efforts to improve the performance of a
single channel speech separation system. Specifically, we investigate a
two-stage training scheme that firstly applies a feature level optimization
criterion for pretraining, followed by an ASR-oriented optimization criterion
using an end-to-end (E2E) speech recognition model. Meanwhile, to keep the
model light-weight, we introduce a modified teacher-student learning technique
for model compression. By combining those approaches, we achieve a absolute
average WER improvement of 2.70% and 0.77% using models with less than 10M
parameters compared with the previous state-of-the-art results on the LibriCSS
dataset for utterance-wise evaluation and continuous evaluation, respectivelyComment: Accepted by Interspeech 202
MAM: Masked Acoustic Modeling for End-to-End Speech-to-Text Translation
End-to-end Speech-to-text Translation (E2E-ST), which directly translates
source language speech to target language text, is widely useful in practice,
but traditional cascaded approaches (ASR+MT) often suffer from error
propagation in the pipeline. On the other hand, existing end-to-end solutions
heavily depend on the source language transcriptions for pre-training or
multi-task training with Automatic Speech Recognition (ASR). We instead propose
a simple technique to learn a robust speech encoder in a self-supervised
fashion only on the speech side, which can utilize speech data without
transcription. This technique termed Masked Acoustic Modeling (MAM), not only
provides an alternative solution to improving E2E-ST, but also can perform
pre-training on any acoustic signals (including non-speech ones) without
annotation. We conduct our experiments over 8 different translation directions.
In the setting without using any transcriptions, our technique achieves an
average improvement of +1.1 BLEU, and +2.3 BLEU with MAM pre-training.
Pre-training of MAM with arbitrary acoustic signals also has an average
improvement with +1.6 BLEU for those languages. Compared with ASR multi-task
learning solution, which replies on transcription during training, our
pre-trained MAM model, which does not use transcription, achieves similar
accuracy.Comment: 12 page
Multi-microphone Complex Spectral Mapping for Utterance-wise and Continuous Speaker Separation
We propose multi-microphone complex spectral mapping, a simple way of
applying deep learning for time-varying non-linear beamforming, for offline
utterance-wise and block-online continuous speaker separation in reverberant
conditions, aiming at both speaker separation and dereverberation. Assuming a
fixed array geometry between training and testing, we train deep neural
networks (DNN) to predict the real and imaginary (RI) components of target
speech at a reference microphone from the RI components of multiple
microphones. We then integrate multi-microphone complex spectral mapping with
beamforming and post-filtering to further improve separation, and combine it
with frame-level speaker counting for block-online continuous speaker
separation (CSS). Although our system is trained on simulated room impulse
responses (RIR) based on a fixed number of microphones arranged in a given
geometry, it generalizes well to a real array with the same geometry.
State-of-the-art separation performance is obtained on the simulated two-talker
SMS-WSJ corpus and the real-recorded LibriCSS dataset.Comment: 10 pages, in submissio
Alignment Knowledge Distillation for Online Streaming Attention-based Speech Recognition
This article describes an efficient training method for online streaming
attention-based encoder-decoder (AED) automatic speech recognition (ASR)
systems. AED models have achieved competitive performance in offline scenarios
by jointly optimizing all components. They have recently been extended to an
online streaming framework via models such as monotonic chunkwise attention
(MoChA). However, the elaborate attention calculation process is not robust for
long-form speech utterances. Moreover, the sequence-level training objective
and time-restricted streaming encoder cause a nonnegligible delay in token
emission during inference. To address these problems, we propose CTC
synchronous training (CTC-ST), in which CTC alignments are leveraged as a
reference for token boundaries to enable a MoChA model to learn optimal
monotonic input-output alignments. We formulate a purely end-to-end training
objective to synchronize the boundaries of MoChA to those of CTC. The CTC model
shares an encoder with the MoChA model to enhance the encoder representation.
Moreover, the proposed method provides alignment information learned in the CTC
branch to the attention-based decoder. Therefore, CTC-ST can be regarded as
self-distillation of alignment knowledge from CTC to MoChA. Experimental
evaluations on a variety of benchmark datasets show that the proposed method
significantly reduces recognition errors and emission latency simultaneously,
especially for long-form and noisy speech. We also compare CTC-ST with several
methods that distill alignment knowledge from a hybrid ASR system and show that
the CTC-ST can achieve a comparable tradeoff of accuracy and latency without
relying on external alignment information. The best MoChA system shows
performance comparable to that of RNN-transducer (RNN-T)