Recognizing a word shortly after it is spoken is an important requirement for
automatic speech recognition (ASR) systems in real-world scenarios. As a
result, a large body of work on streaming audio-only ASR models has been
presented in the literature. However, streaming audio-visual automatic speech
recognition (AV-ASR) has received little attention in earlier works. In this
work, we propose a streaming AV-ASR system based on a hybrid connectionist
temporal classification (CTC)/attention neural network architecture. The audio
and the visual encoder neural networks are both based on the conformer
architecture, which is made streamable using chunk-wise self-attention (CSA)
and causal convolution. Streaming recognition with a decoder neural network is
realized by using the triggered attention technique, which performs
time-synchronous decoding with joint CTC/attention scoring. For frame-level ASR
criteria, such as CTC, a synchronized response from the audio and visual
encoders is critical for a joint AV decision making process. In this work, we
propose a novel alignment regularization technique that promotes
synchronization of the audio and visual encoder, which in turn results in
better word error rates (WERs) at all SNR levels for streaming and offline
AV-ASR models. The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the
Lip Reading Sentences 3 (LRS3) dataset in an offline and online setup,
respectively, which both present state-of-the-art results when no external
training data are used.Comment: Submitted to ICASSP202