2 research outputs found
A Density Ratio Approach to Language Model Fusion in End-To-End Automatic Speech Recognition
This article describes a density ratio approach to integrating external
Language Models (LMs) into end-to-end models for Automatic Speech Recognition
(ASR). Applied to a Recurrent Neural Network Transducer (RNN-T) ASR model
trained on a given domain, a matched in-domain RNN-LM, and a target domain
RNN-LM, the proposed method uses Bayes' Rule to define RNN-T posteriors for the
target domain, in a manner directly analogous to the classic hybrid model for
ASR based on Deep Neural Networks (DNNs) or LSTMs in the Hidden Markov Model
(HMM) framework (Bourlard & Morgan, 1994). The proposed approach is evaluated
in cross-domain and limited-data scenarios, for which a significant amount of
target domain text data is used for LM training, but only limited (or no)
{audio, transcript} training data pairs are used to train the RNN-T.
Specifically, an RNN-T model trained on paired audio & transcript data from
YouTube is evaluated for its ability to generalize to Voice Search data. The
Density Ratio method was found to consistently outperform the dominant approach
to LM and end-to-end ASR integration, Shallow Fusion.Comment: 8 pages, 4 figures, presented at 2019 IEEE Automatic Speech
Recognition and Understanding Workshop (ASRU 2019
Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition
The external language models (LM) integration remains a challenging task for
end-to-end (E2E) automatic speech recognition (ASR) which has no clear division
between acoustic and language models. In this work, we propose an internal LM
estimation (ILME) method to facilitate a more effective integration of the
external LM with all pre-existing E2E models with no additional model training,
including the most popular recurrent neural network transducer (RNN-T) and
attention-based encoder-decoder (AED) models. Trained with audio-transcript
pairs, an E2E model implicitly learns an internal LM that characterizes the
training data in the source domain. With ILME, the internal LM scores of an E2E
model are estimated and subtracted from the log-linear interpolation between
the scores of the E2E model and the external LM. The internal LM scores are
approximated as the output of an E2E model when eliminating its acoustic
components. ILME can alleviate the domain mismatch between training and
testing, or improve the multi-domain E2E ASR. Experimented with 30K-hour
trained RNN-T and AED models, ILME achieves up to 15.5% and 6.8% relative word
error rate reductions from Shallow Fusion on out-of-domain LibriSpeech and
in-domain Microsoft production test sets, respectively.Comment: 8 pages, 2 figures, SLT 202