566 research outputs found
Mask-CTC-based Encoder Pre-training for Streaming End-to-End Speech Recognition
Achieving high accuracy with low latency has always been a challenge in
streaming end-to-end automatic speech recognition (ASR) systems. By attending
to more future contexts, a streaming ASR model achieves higher accuracy but
results in larger latency, which hurts the streaming performance. In the
Mask-CTC framework, an encoder network is trained to learn the feature
representation that anticipates long-term contexts, which is desirable for
streaming ASR. Mask-CTC-based encoder pre-training has been shown beneficial in
achieving low latency and high accuracy for triggered attention-based ASR.
However, the effectiveness of this method has not been demonstrated for various
model architectures, nor has it been verified that the encoder has the expected
look-ahead capability to reduce latency. This study, therefore, examines the
effectiveness of Mask-CTCbased pre-training for models with different
architectures, such as Transformer-Transducer and contextual block streaming
ASR. We also discuss the effect of the proposed pre-training method on
obtaining accurate output spike timing.Comment: Accepted to EUSIPCO 202
Conversation-oriented ASR with multi-look-ahead CBS architecture
During conversations, humans are capable of inferring the intention of the
speaker at any point of the speech to prepare the following action promptly.
Such ability is also the key for conversational systems to achieve rhythmic and
natural conversation. To perform this, the automatic speech recognition (ASR)
used for transcribing the speech in real-time must achieve high accuracy
without delay. In streaming ASR, high accuracy is assured by attending to
look-ahead frames, which leads to delay increments. To tackle this trade-off
issue, we propose a multiple latency streaming ASR to achieve high accuracy
with zero look-ahead. The proposed system contains two encoders that operate in
parallel, where a primary encoder generates accurate outputs utilizing
look-ahead frames, and the auxiliary encoder recognizes the look-ahead portion
of the primary encoder without look-ahead. The proposed system is constructed
based on contextual block streaming (CBS) architecture, which leverages block
processing and has a high affinity for the multiple latency architecture.
Various methods are also studied for architecting the system, including
shifting the network to perform as different encoders; as well as generating
both encoders' outputs in one encoding pass.Comment: Submitted to ICASSP202
Semi-Autoregressive Streaming ASR With Label Context
Non-autoregressive (NAR) modeling has gained significant interest in speech
processing since these models achieve dramatically lower inference time than
autoregressive (AR) models while also achieving good transcription accuracy.
Since NAR automatic speech recognition (ASR) models must wait for the
completion of the entire utterance before processing, some works explore
streaming NAR models based on blockwise attention for low-latency applications.
However, streaming NAR models significantly lag in accuracy compared to
streaming AR and non-streaming NAR models. To address this, we propose a
streaming "semi-autoregressive" ASR model that incorporates the labels emitted
in previous blocks as additional context using a Language Model (LM)
subnetwork. We also introduce a novel greedy decoding algorithm that addresses
insertion and deletion errors near block boundaries while not significantly
increasing the inference time. Experiments show that our method outperforms the
existing streaming NAR model by 19% relative on Tedlium2, 16%/8% on
Librispeech-100 clean/other test sets, and 19%/8% on the Switchboard(SWB) /
Callhome(CH) test sets. It also reduced the accuracy gap with streaming AR and
non-streaming NAR models while achieving 2.5x lower latency. We also
demonstrate that our approach can effectively utilize external text data to
pre-train the LM subnetwork to further improve streaming ASR accuracy.Comment: Submitted to ICASSP 202
- …