39 research outputs found
Low-Latency Sequence-to-Sequence Speech Recognition and Translation by Partial Hypothesis Selection
Encoder-decoder models provide a generic architecture for
sequence-to-sequence tasks such as speech recognition and translation. While
offline systems are often evaluated on quality metrics like word error rates
(WER) and BLEU, latency is also a crucial factor in many practical use-cases.
We propose three latency reduction techniques for chunk-based incremental
inference and evaluate their efficiency in terms of accuracy-latency trade-off.
On the 300-hour How2 dataset, we reduce latency by 83% to 0.8 second by
sacrificing 1% WER (6% rel.) compared to offline transcription. Although our
experiments use the Transformer, the hypothesis selection strategies are
applicable to other encoder-decoder models. To avoid expensive re-computation,
we use a unidirectionally-attending encoder. After an adaptation procedure to
partial sequences, the unidirectional model performs on-par with the original
model. We further show that our approach is also applicable to low-latency
speech translation. On How2 English-Portuguese speech translation, we reduce
latency to 0.7 second (-84% rel.) while incurring a loss of 2.4 BLEU points (5%
rel.) compared to the offline system
Minimum Latency Training of Sequence Transducers for Streaming End-to-End Speech Recognition
Sequence transducers, such as the RNN-T and the Conformer-T, are one of the
most promising models of end-to-end speech recognition, especially in streaming
scenarios where both latency and accuracy are important. Although various
methods, such as alignment-restricted training and FastEmit, have been studied
to reduce the latency, latency reduction is often accompanied with a
significant degradation in accuracy. We argue that this suboptimal performance
might be caused because none of the prior methods explicitly model and reduce
the latency. In this paper, we propose a new training method to explicitly
model and reduce the latency of sequence transducer models. First, we define
the expected latency at each diagonal line on the lattice, and show that its
gradient can be computed efficiently within the forward-backward algorithm.
Then we augment the transducer loss with this expected latency, so that an
optimal trade-off between latency and accuracy is achieved. Experimental
results on the WSJ dataset show that the proposed minimum latency training
reduces the latency of causal Conformer-T from 220 ms to 27 ms within a WER
degradation of 0.7%, and outperforms conventional alignment-restricted training
(110 ms) and FastEmit (67 ms) methods.Comment: Presented at INTERSPEECH 202
4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders
The network architecture of end-to-end (E2E) automatic speech recognition
(ASR) can be classified into several models, including connectionist temporal
classification (CTC), recurrent neural network transducer (RNN-T), attention
mechanism, and non-autoregressive mask-predict models. Since each of these
network architectures has pros and cons, a typical use case is to switch these
separate models depending on the application requirement, resulting in the
increased overhead of maintaining all models. Several methods for integrating
two of these complementary models to mitigate the overhead issue have been
proposed; however, if we integrate more models, we will further benefit from
these complementary models and realize broader applications with a single
system. This paper proposes four-decoder joint modeling (4D) of CTC, attention,
RNN-T, and mask-predict, which has the following three advantages: 1) The four
decoders are jointly trained so that they can be easily switched depending on
the application scenarios. 2) Joint training may bring model regularization and
improve the model robustness thanks to their complementary properties. 3) Novel
one-pass joint decoding methods using CTC, attention, and RNN-T further
improves the performance. The experimental results showed that the proposed
model consistently reduced the WER.Comment: Accepted by INTERRSPEECH202