Automatic speech recognition (ASR) systems typically rely on an external
endpointer (EP) model to identify speech boundaries. In this work, we propose a
method to jointly train the ASR and EP tasks in a single end-to-end (E2E)
multitask model, improving EP quality by optionally leveraging information from
the ASR audio encoder. We introduce a "switch" connection, which trains the EP
to consume either the audio frames directly or low-level latent representations
from the ASR model. This results in a single E2E model that can be used during
inference to perform frame filtering at low cost, and also make high quality
end-of-query (EOQ) predictions based on ongoing ASR computation. We present
results on a voice search test set showing that, compared to separate
single-task models, this approach reduces median endpoint latency by 120 ms
(30.8% reduction), and 90th percentile latency by 170 ms (23.0% reduction),
without regressing word error rate. For continuous recognition, WER improves by
10.6% (relative).Comment: To be published in Spoken Language Technology Workshop (SLT) 202