331 research outputs found
External Language Model Integration for Factorized Neural Transducers
We propose an adaptation method for factorized neural transducers (FNT) with
external language models. We demonstrate that both neural and n-gram external
LMs add significantly more value when linearly interpolated with predictor
output compared to shallow fusion, thus confirming that FNT forces the
predictor to act like regular language models. Further, we propose a method to
integrate class-based n-gram language models into FNT framework resulting in
accuracy gains similar to a hybrid setup. We show average gains of 18% WERR
with lexical adaptation across various scenarios and additive gains of up to
60% WERR in one entity-rich scenario through a combination of class-based
n-gram and neural LMs
Updated Corpora and Benchmarks for Long-Form Speech Recognition
The vast majority of ASR research uses corpora in which both the training and
test data have been pre-segmented into utterances. In most real-word ASR
use-cases, however, test audio is not segmented, leading to a mismatch between
inference-time conditions and models trained on segmented utterances. In this
paper, we re-release three standard ASR corpora - TED-LIUM 3, Gigapeech, and
VoxPopuli-en - with updated transcription and alignments to enable their use
for long-form ASR research. We use these reconstituted corpora to study the
train-test mismatch problem for transducers and attention-based
encoder-decoders (AEDs), confirming that AEDs are more susceptible to this
issue. Finally, we benchmark a simple long-form training for these models,
showing its efficacy for model robustness under this domain shift.Comment: Submitted to ICASSP 202
Incorporating Class-based Language Model for Named Entity Recognition in Factorized Neural Transducer
In spite of the excellent strides made by end-to-end (E2E) models in speech
recognition in recent years, named entity recognition is still challenging but
critical for semantic understanding. In order to enhance the ability to
recognize named entities in E2E models, previous studies mainly focus on
various rule-based or attention-based contextual biasing algorithms. However,
their performance might be sensitive to the biasing weight or degraded by
excessive attention to the named entity list, along with a risk of false
triggering. Inspired by the success of the class-based language model (LM) in
named entity recognition in conventional hybrid systems and the effective
decoupling of acoustic and linguistic information in the factorized neural
Transducer (FNT), we propose a novel E2E model to incorporate class-based LMs
into FNT, which is referred as C-FNT. In C-FNT, the language model score of
named entities can be associated with the name class instead of its surface
form. The experimental results show that our proposed C-FNT presents
significant error reduction in named entities without hurting performance in
general word recognition
End-to-end speech recognition modeling from de-identified data
De-identification of data used for automatic speech recognition modeling is a
critical component in protecting privacy, especially in the medical domain.
However, simply removing all personally identifiable information (PII) from
end-to-end model training data leads to a significant performance degradation
in particular for the recognition of names, dates, locations, and words from
similar categories. We propose and evaluate a two-step method for partially
recovering this loss. First, PII is identified, and each occurrence is replaced
with a random word sequence of the same category. Then, corresponding audio is
produced via text-to-speech or by splicing together matching audio fragments
extracted from the corpus. These artificial audio/label pairs, together with
speaker turns from the original data without PII, are used to train models. We
evaluate the performance of this method on in-house data of medical
conversations and observe a recovery of almost the entire performance
degradation in the general word error rate while still maintaining a strong
diarization performance. Our main focus is the improvement of recall and
precision in the recognition of PII-related words. Depending on the PII
category, between of the performance degradation can be recovered
using our proposed method.Comment: Accepted to INTERSPEECH 202
How to Estimate Model Transferability of Pre-Trained Speech Models?
In this work, we introduce a ``score-based assessment'' framework for
estimating the transferability of pre-trained speech models (PSMs) for
fine-tuning target tasks. We leverage upon two representation theories,
Bayesian likelihood estimation and optimal transport, to generate rank scores
for the PSM candidates using the extracted representations. Our framework
efficiently computes transferability scores without actual fine-tuning of
candidate models or layers by making a temporal independent hypothesis. We
evaluate some popular supervised speech models (e.g., Conformer RNN-Transducer)
and self-supervised speech models (e.g., HuBERT) in cross-layer and cross-model
settings using public data. Experimental results show a high Spearman's rank
correlation and low -value between our estimation framework and fine-tuning
ground truth. Our proposed transferability framework requires less
computational time and resources, making it a resource-saving and
time-efficient approach for tuning speech foundation models.Comment: Accepted to Interspeech. Code will be release
Damage Control During Domain Adaptation for Transducer Based Automatic Speech Recognition
Automatic speech recognition models are often adapted to improve their
accuracy in a new domain. A potential drawback of model adaptation to new
domains is catastrophic forgetting, where the Word Error Rate on the original
domain is significantly degraded. This paper addresses the situation when we
want to simultaneously adapt automatic speech recognition models to a new
domain and limit the degradation of accuracy on the original domain without
access to the original training dataset. We propose several techniques such as
a limited training strategy and regularized adapter modules for the Transducer
encoder, prediction, and joiner network. We apply these methods to the Google
Speech Commands and to the UK and Ireland English Dialect speech data set and
obtain strong results on the new target domain while limiting the degradation
on the original domain.Comment: To appear in Proc. SLT 2022, Jan 09-12, 2023, Doha, Qata
End-to-end Lip-reading: A Preliminary Study
Deep lip-reading is the combination of the domains of computer vision and natural language processing. It uses deep neural networks to extract speech from silent videos. Most works in lip-reading use a multi staged training approach due to the complex nature of the task. A single stage, end-to-end, unified training approach, which is an ideal of machine learning, is also the goal in lip-reading. However, pure end-to-end systems have not yet been able to perform as good as non-end-to-end systems. Some exceptions to this are the very recent Temporal Convolutional Network (TCN) based architectures. This work lays out preliminary study of deep lip-reading, with a special focus on various end-to-end approaches. The research aims to test whether a purely end-to-end approach is justifiable for a task as complex as deep lip-reading. To achieve this, the meaning of pure end-to-end is first defined and several lip-reading systems that follow the definition are analysed. The system that most closely matches the definition is then adapted for pure end-to-end experiments. Four main contributions have been made: i) An analysis of 9 different end-to-end deep lip-reading systems, ii) Creation and public release of a pipeline1 to adapt sentence level Lipreading Sentences in the Wild 3 (LRS3) dataset into word level, iii) Pure end-to-end training of a TCN based network and evaluation on LRS3 word-level dataset as a proof of concept, iv) a public online portal2 to analyse visemes and experiment live end-to-end lip-reading inference. The study is able to verify that pure end-to-end is a sensible approach and an achievable goal for deep machine lip-reading
- …