7 research outputs found
A Comparison of Semi-Supervised Learning Techniques for Streaming ASR at Scale
Unpaired text and audio injection have emerged as dominant methods for
improving ASR performance in the absence of a large labeled corpus. However,
little guidance exists on deploying these methods to improve production ASR
systems that are trained on very large supervised corpora and with realistic
requirements like a constrained model size and CPU budget, streaming
capability, and a rich lattice for rescoring and for downstream NLU tasks. In
this work, we compare three state-of-the-art semi-supervised methods
encompassing both unpaired text and audio as well as several of their
combinations in a controlled setting using joint training. We find that in our
setting these methods offer many improvements beyond raw WER, including
substantial gains in tail-word WER, decoder computation during inference, and
lattice density
E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model
We explore unifying a neural segmenter with two-pass cascaded encoder ASR
into a single model. A key challenge is allowing the segmenter (which runs in
real-time, synchronously with the decoder) to finalize the 2nd pass (which runs
900 ms behind real-time) without introducing user-perceived latency or deletion
errors during inference. We propose a design where the neural segmenter is
integrated with the causal 1st pass decoder to emit a end-of-segment (EOS)
signal in real-time. The EOS signal is then used to finalize the non-causal 2nd
pass. We experiment with different ways to finalize the 2nd pass, and find that
a novel dummy frame injection strategy allows for simultaneous high quality 2nd
pass results and low finalization latency. On a real-world long-form captioning
task (YouTube), we achieve 2.4% relative WER and 140 ms EOS latency gains over
a baseline VAD-based segmenter with the same cascaded encoder
Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition
Language model fusion helps smart assistants recognize words which are rare
in acoustic data but abundant in text-only corpora (typed search logs).
However, such corpora have properties that hinder downstream performance,
including being (1) too large, (2) beset with domain-mismatched content, and
(3) heavy-headed rather than heavy-tailed (excessively many duplicate search
queries such as "weather"). We show that three simple strategies for selecting
language modeling data can dramatically improve rare-word recognition without
harming overall performance. First, to address the heavy-headedness, we
downsample the data according to a soft log function, which tunably reduces
high frequency (head) sentences. Second, to encourage rare-word exposure, we
explicitly filter for words rare in the acoustic data. Finally, we tackle
domain-mismatch via perplexity-based contrastive selection, filtering for
examples matched to the target domain. We down-select a large corpus of web
search queries by a factor of 53x and achieve better LM perplexities than
without down-selection. When shallow-fused with a state-of-the-art, production
speech engine, our LM achieves WER reductions of up to 24% relative on
rare-word sentences (without changing overall WER) compared to a baseline LM
trained on the raw corpus. These gains are further validated through favorable
side-by-side evaluations on live voice search traffic.Comment: Interspeech 202
E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR
Improving the performance of end-to-end ASR models on long utterances ranging
from minutes to hours in length is an ongoing challenge in speech recognition.
A common solution is to segment the audio in advance using a separate voice
activity detector (VAD) that decides segment boundary locations based purely on
acoustic speech/non-speech information. VAD segmenters, however, may be
sub-optimal for real-world speech where, e.g., a complete sentence that should
be taken as a whole may contain hesitations in the middle ("set an alarm for...
5 o'clock").
We propose to replace the VAD with an end-to-end ASR model capable of
predicting segment boundaries in a streaming fashion, allowing the segmentation
decision to be conditioned not only on better acoustic features but also on
semantic features from the decoded text with negligible extra computation. In
experiments on real world long-form audio (YouTube) with lengths of up to 30
minutes, we demonstrate 8.5% relative WER improvement and 250 ms reduction in
median end-of-segment latency compared to the VAD segmenter baseline on a
state-of-the-art Conformer RNN-T model.Comment: Interspeech 202