Segmentation for continuous Automatic Speech Recognition (ASR) has
traditionally used silence timeouts or voice activity detectors (VADs), which
are both limited to acoustic features. This segmentation is often overly
aggressive, given that people naturally pause to think as they speak.
Consequently, segmentation happens mid-sentence, hindering both punctuation and
downstream tasks like machine translation for which high-quality segmentation
is critical. Model-based segmentation methods that leverage acoustic features
are powerful, but without an understanding of the language itself, these
approaches are limited. We present a hybrid approach that leverages both
acoustic and language information to improve segmentation. Furthermore, we show
that including one word as a look-ahead boosts segmentation quality. On
average, our models improve segmentation-F0.5 score by 9.8% over baseline. We
show that this approach works for multiple languages. For the downstream task
of machine translation, it improves the translation BLEU score by an average of
1.05 points