23 research outputs found
Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models
The accuracy of end-to-end (E2E) automatic speech recognition (ASR) models
continues to improve as they are scaled to larger sizes, with some now reaching
billions of parameters. Widespread deployment and adoption of these models,
however, requires computationally efficient strategies for decoding. In the
present work, we study one such strategy: applying multiple frame reduction
layers in the encoder to compress encoder outputs into a small number of output
frames. While similar techniques have been investigated in previous work, we
achieve dramatically more reduction than has previously been demonstrated
through the use of multiple funnel reduction layers. Through ablations, we
study the impact of various architectural choices in the encoder to identify
the most effective strategies. We demonstrate that we can generate one encoder
output frame for every 2.56 sec of input speech, without significantly
affecting word error rate on a large-scale voice search task, while improving
encoder and decoder latencies by 48% and 92% respectively, relative to a strong
but computationally expensive baseline.Comment: Accepted to 2024 IEEE International Conference on Acoustics, Speech,
and Signal Processing (ICASSP 2024
On the Relation between Internal Language Model and Sequence Discriminative Training for Neural Transducers
Internal language model (ILM) subtraction has been widely applied to improve
the performance of the RNN-Transducer with external language model (LM) fusion
for speech recognition. In this work, we show that sequence discriminative
training has a strong correlation with ILM subtraction from both theoretical
and empirical points of view. Theoretically, we derive that the global optimum
of maximum mutual information (MMI) training shares a similar formula as ILM
subtraction. Empirically, we show that ILM subtraction and sequence
discriminative training achieve similar performance across a wide range of
experiments on Librispeech, including both MMI and minimum Bayes risk (MBR)
criteria, as well as neural transducers and LMs of both full and limited
context. The benefit of ILM subtraction also becomes much smaller after
sequence discriminative training. We also provide an in-depth study to show
that sequence discriminative training has a minimal effect on the commonly used
zero-encoder ILM estimation, but a joint effect on both encoder and prediction
+ joint network for posterior probability reshaping including both ILM and
blank suppression.Comment: submitted to ICASSP 202
A Comparison of Semi-Supervised Learning Techniques for Streaming ASR at Scale
Unpaired text and audio injection have emerged as dominant methods for
improving ASR performance in the absence of a large labeled corpus. However,
little guidance exists on deploying these methods to improve production ASR
systems that are trained on very large supervised corpora and with realistic
requirements like a constrained model size and CPU budget, streaming
capability, and a rich lattice for rescoring and for downstream NLU tasks. In
this work, we compare three state-of-the-art semi-supervised methods
encompassing both unpaired text and audio as well as several of their
combinations in a controlled setting using joint training. We find that in our
setting these methods offer many improvements beyond raw WER, including
substantial gains in tail-word WER, decoder computation during inference, and
lattice density
Hybrid Attention-based Encoder-decoder Model for Efficient Language Model Adaptation
Attention-based encoder-decoder (AED) speech recognition model has been
widely successful in recent years. However, the joint optimization of acoustic
model and language model in end-to-end manner has created challenges for text
adaptation. In particular, effectively, quickly and inexpensively adapting text
has become a primary concern for deploying AED systems in industry. To address
this issue, we propose a novel model, the hybrid attention-based
encoder-decoder (HAED) speech recognition model that preserves the modularity
of conventional hybrid automatic speech recognition systems. Our HAED model
separates the acoustic and language models, allowing for the use of
conventional text-based language model adaptation techniques. We demonstrate
that the proposed HAED model yields 21\% Word Error Rate (WER) improvements in
relative when out-of-domain text data is used for language model adaptation,
and with only a minor degradation in WER on a general test set compared with
conventional AED model
Large-scale Language Model Rescoring on Long-form Data
In this work, we study the impact of Large-scale Language Models (LLM) on
Automated Speech Recognition (ASR) of YouTube videos, which we use as a source
for long-form ASR. We demonstrate up to 8\% relative reduction in Word Error
Eate (WER) on US English (en-us) and code-switched Indian English (en-in)
long-form ASR test sets and a reduction of up to 30\% relative on Salient Term
Error Rate (STER) over a strong first-pass baseline that uses a maximum-entropy
based language model. Improved lattice processing that results in a lattice
with a proper (non-tree) digraph topology and carrying context from the 1-best
hypothesis of the previous segment(s) results in significant wins in rescoring
with LLMs. We also find that the gains in performance from the combination of
LLMs trained on vast quantities of available data (such as C4) and conventional
neural LMs is additive and significantly outperforms a strong first-pass
baseline with a maximum entropy LM.
Copyright 2023 IEEE. Personal use of this material is permitted. Permission
from IEEE must be obtained for all other uses, in any current or future media,
including reprinting/republishing this material for advertising or promotional
purposes, creating new collective works, for resale or redistribution to
servers or lists, or reuse of any copyrighted component of this work in other
works.Comment: 5 pages, accepted in ICASSP 202
Modular Domain Adaptation for Conformer-Based Streaming ASR
Speech data from different domains has distinct acoustic and linguistic
characteristics. It is common to train a single multidomain model such as a
Conformer transducer for speech recognition on a mixture of data from all
domains. However, changing data in one domain or adding a new domain would
require the multidomain model to be retrained. To this end, we propose a
framework called modular domain adaptation (MDA) that enables a single model to
process multidomain data while keeping all parameters domain-specific, i.e.,
each parameter is only trained by data from one domain. On a streaming
Conformer transducer trained only on video caption data, experimental results
show that an MDA-based model can reach similar performance as the multidomain
model on other domains such as voice search and dictation by adding per-domain
adapters and per-domain feed-forward networks in the Conformer encoder.Comment: Accepted to Interspeech 202
Internal Language Model Estimation Through Explicit Context Vector Learning for Attention-based Encoder-decoder ASR
An end-to-end (E2E) ASR model implicitly learns a prior Internal Language
Model (ILM) from the training transcripts. To fuse an external LM using Bayes
posterior theory, the log likelihood produced by the ILM has to be accurately
estimated and subtracted. In this paper we propose two novel approaches to
estimate the ILM based on Listen-Attend-Spell (LAS) framework. The first method
is to replace the context vector of the LAS decoder at every time step with a
vector that is learned with training transcripts. Furthermore, we propose
another method that uses a lightweight feed-forward network to directly map
query vector to context vector in a dynamic sense. Since the context vectors
are learned by minimizing the perplexities on training transcripts, and their
estimation is independent of encoder output, hence the ILMs are accurately
learned for both methods. Experiments show that the ILMs achieve the lowest
perplexity, indicating the efficacy of the proposed methods. In addition, they
also significantly outperform the shallow fusion method, as well as two
previously proposed ILM Estimation (ILME) approaches on several datasets.Comment: Proceedings of INTERSPEEC