34 research outputs found
Edit Distance based RL for RNNT decoding
RNN-T is currently considered the industry standard in ASR due to its
exceptional WERs in various benchmark tests and its ability to support seamless
streaming and longform transcription. However, its biggest drawback lies in the
significant discrepancy between its training and inference objectives. During
training, RNN-T maximizes all alignment probabilities by teacher forcing, while
during inference, it uses beam search which may not necessarily find the
maximum probable alignment. Additionally, RNN-T's inability to experience
mistakes during teacher forcing training makes it more problematic when a
mistake occurs in inference. To address this issue, this paper proposes a
Reinforcement Learning method that minimizes the gap between training and
inference time. Our Edit Distance based RL (EDRL) approach computes rewards
based on the edit distance, and trains the network at every action level. The
proposed approach yielded SoTA WERs on LibriSpeech for the 600M Conformer RNN-T
model.Comment: 5 pages, 2 figure
Pseudo Label Is Better Than Human Label
State-of-the-art automatic speech recognition (ASR) systems are trained with
tens of thousands of hours of labeled speech data. Human transcription is
expensive and time consuming. Factors such as the quality and consistency of
the transcription can greatly affect the performance of the ASR models trained
with these data. In this paper, we show that we can train a strong teacher
model to produce high quality pseudo labels by utilizing recent self-supervised
and semi-supervised learning techniques. Specifically, we use JUST (Joint
Unsupervised/Supervised Training) and iterative noisy student teacher training
to train a 600 million parameter bi-directional teacher model. This model
achieved 4.0% word error rate (WER) on a voice search task, 11.1% relatively
better than a baseline. We further show that by using this strong teacher model
to generate high-quality pseudo labels for training, we can achieve 13.6%
relative WER reduction (5.9% to 5.1%) for a streaming model compared to using
human labels.Comment: 6 pages, 2 figures, 9 tables, submitted to INTERSPEEC
Multi-Dialect Speech Recognition With A Single Sequence-To-Sequence Model
Sequence-to-sequence models provide a simple and elegant solution for
building speech recognition systems by folding separate components of a typical
system, namely acoustic (AM), pronunciation (PM) and language (LM) models into
a single neural network. In this work, we look at one such sequence-to-sequence
model, namely listen, attend and spell (LAS), and explore the possibility of
training a single model to serve different English dialects, which simplifies
the process of training multi-dialect systems without the need for separate AM,
PM and LMs for each dialect. We show that simply pooling the data from all
dialects into one LAS model falls behind the performance of a model fine-tuned
on each dialect. We then look at incorporating dialect-specific information
into the model, both by modifying the training targets by inserting the dialect
symbol at the end of the original grapheme sequence and also feeding a 1-hot
representation of the dialect information into all layers of the model.
Experimental results on seven English dialects show that our proposed system is
effective in modeling dialect variations within a single LAS model,
outperforming a LAS model trained individually on each of the seven dialects by
3.1 ~ 16.5% relative.Comment: submitted to ICASSP 201
Resource-Efficient Transfer Learning From Speech Foundation Model Using Hierarchical Feature Fusion
Self-supervised pre-training of a speech foundation model, followed by
supervised fine-tuning, has shown impressive quality improvements on automatic
speech recognition (ASR) tasks. Fine-tuning separate foundation models for many
downstream tasks are expensive since the foundation model is usually very big.
Parameter-efficient fine-tuning methods (e.g. adapter, sparse update methods)
offer an alternative paradigm where a small set of parameters are updated to
adapt the foundation model to new tasks. However, these methods still suffer
from a high computational memory cost and slow training speed because they
require backpropagation through the entire neural network at each step. In the
paper, we analyze the performance of features at different layers of a
foundation model on the speech recognition task and propose a novel
hierarchical feature fusion method for resource-efficient transfer learning
from speech foundation models. Experimental results show that the proposed
method can achieve better performance on speech recognition task than existing
algorithms with fewer number of trainable parameters, less computational memory
cost and faster training speed. After combining with Adapters at all layers,
the proposed method can achieve the same performance as fine-tuning the whole
model with fewer trainable encoder parameters and faster training
speed
Large vocabulary speech recognition for languages of Africa: multilingual modeling and self-supervised learning
Almost none of the 2,000+ languages spoken in Africa have widely available
automatic speech recognition systems, and the required data is also only
available for a few languages. We have experimented with two techniques which
may provide pathways to large vocabulary speech recognition for African
languages: multilingual modeling and self-supervised learning. We gathered
available open source data and collected data for 15 languages, and trained
experimental models using these techniques. Our results show that pooling the
small amounts of data available in multilingual end-to-end models, and
pre-training on unsupervised data can help improve speech recognition quality
for many African languages