Achieving high accuracy with low latency has always been a challenge in
streaming end-to-end automatic speech recognition (ASR) systems. By attending
to more future contexts, a streaming ASR model achieves higher accuracy but
results in larger latency, which hurts the streaming performance. In the
Mask-CTC framework, an encoder network is trained to learn the feature
representation that anticipates long-term contexts, which is desirable for
streaming ASR. Mask-CTC-based encoder pre-training has been shown beneficial in
achieving low latency and high accuracy for triggered attention-based ASR.
However, the effectiveness of this method has not been demonstrated for various
model architectures, nor has it been verified that the encoder has the expected
look-ahead capability to reduce latency. This study, therefore, examines the
effectiveness of Mask-CTCbased pre-training for models with different
architectures, such as Transformer-Transducer and contextual block streaming
ASR. We also discuss the effect of the proposed pre-training method on
obtaining accurate output spike timing.Comment: Accepted to EUSIPCO 202