3 research outputs found
Contextualizing ASR Lattice Rescoring with Hybrid Pointer Network Language Model
Videos uploaded on social media are often accompanied with textual
descriptions. In building automatic speech recognition (ASR) systems for
videos, we can exploit the contextual information provided by such video
metadata. In this paper, we explore ASR lattice rescoring by selectively
attending to the video descriptions. We first use an attention based method to
extract contextual vector representations of video metadata, and use these
representations as part of the inputs to a neural language model during lattice
rescoring. Secondly, we propose a hybrid pointer network approach to explicitly
interpolate the word probabilities of the word occurrences in metadata. We
perform experimental evaluations on both language modeling and ASR tasks, and
demonstrate that both proposed methods provide performance improvements by
selectively leveraging the video metadata
Contextual RNN-T For Open Domain ASR
End-to-end (E2E) systems for automatic speech recognition (ASR), such as RNN
Transducer (RNN-T) and Listen-Attend-Spell (LAS) blend the individual
components of a traditional hybrid ASR system - acoustic model, language model,
pronunciation model - into a single neural network. While this has some nice
advantages, it limits the system to be trained using only paired audio and
text. Because of this, E2E models tend to have difficulties with correctly
recognizing rare words that are not frequently seen during training, such as
entity names. In this paper, we propose modifications to the RNN-T model that
allow the model to utilize additional metadata text with the objective of
improving performance on these named entity words. We evaluate our approach on
an in-house dataset sampled from de-identified public social media videos,
which represent an open domain ASR task. By using an attention model and a
biasing model to leverage the contextual metadata that accompanies a video, we
observe a relative improvement of about 16% in Word Error Rate on Named
Entities (WER-NE) for videos with related metadata
Benchmarking LF-MMI, CTC and RNN-T Criteria for Streaming ASR
In this work, to measure the accuracy and efficiency for a latency-controlled
streaming automatic speech recognition (ASR) application, we perform
comprehensive evaluations on three popular training criteria: LF-MMI, CTC and
RNN-T. In transcribing social media videos of 7 languages with training data
3K-14K hours, we conduct large-scale controlled experimentation across each
criterion using identical datasets and encoder model architecture. We find that
RNN-T has consistent wins in ASR accuracy, while CTC models excel at inference
efficiency. Moreover, we selectively examine various modeling strategies for
different training criteria, including modeling units, encoder architectures,
pre-training, etc. Given such large-scale real-world streaming ASR application,
to our best knowledge, we present the first comprehensive benchmark on these
three widely used training criteria across a great many languages.Comment: Accepted for publication at IEEE Spoken Language Technology Workshop
(SLT), 202