Mask-CTC-based Encoder Pre-training for Streaming End-to-End Speech
  Recognition

Higuchi, Yosuke; Kida, Yusuke; Kobayashi, Tetsunori; Ogawa, Tetsuji; Zhao, Huaibo

Mask-CTC-based Encoder Pre-training for Streaming End-to-End Speech Recognition

Authors: Yosuke Higuchi
Yusuke Kida
Tetsunori Kobayashi
Tetsuji Ogawa
Huaibo Zhao
Publication date: 8 September 2023
Publisher

Abstract

Achieving high accuracy with low latency has always been a challenge in streaming end-to-end automatic speech recognition (ASR) systems. By attending to more future contexts, a streaming ASR model achieves higher accuracy but results in larger latency, which hurts the streaming performance. In the Mask-CTC framework, an encoder network is trained to learn the feature representation that anticipates long-term contexts, which is desirable for streaming ASR. Mask-CTC-based encoder pre-training has been shown beneficial in achieving low latency and high accuracy for triggered attention-based ASR. However, the effectiveness of this method has not been demonstrated for various model architectures, nor has it been verified that the encoder has the expected look-ahead capability to reduce latency. This study, therefore, examines the effectiveness of Mask-CTCbased pre-training for models with different architectures, such as Transformer-Transducer and contextual block streaming ASR. We also discuss the effect of the proposed pre-training method on obtaining accurate output spike timing.Comment: Accepted to EUSIPCO 202

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2309.04654

Last time updated on 06/10/2023