Search CORE

443 research outputs found

Low-Latency Sequence-to-Sequence Speech Recognition and Translation by Partial Hypothesis Selection

Author: Liu Danni
Niehues Jan
Spanakis Gerasimos
Publication venue
Publication date: 01/01/2020
Field of study

Encoder-decoder models provide a generic architecture for sequence-to-sequence tasks such as speech recognition and translation. While offline systems are often evaluated on quality metrics like word error rates (WER) and BLEU, latency is also a crucial factor in many practical use-cases. We propose three latency reduction techniques for chunk-based incremental inference and evaluate their efficiency in terms of accuracy-latency trade-off. On the 300-hour How2 dataset, we reduce latency by 83% to 0.8 second by sacrificing 1% WER (6% rel.) compared to offline transcription. Although our experiments use the Transformer, the hypothesis selection strategies are applicable to other encoder-decoder models. To avoid expensive re-computation, we use a unidirectionally-attending encoder. After an adaptation procedure to partial sequences, the unidirectional model performs on-par with the original model. We further show that our approach is also applicable to low-latency speech translation. On How2 English-Portuguese speech translation, we reduce latency to 0.7 second (-84% rel.) while incurring a loss of 2.4 BLEU points (5% rel.) compared to the offline system

arXiv.org e-Print Archive

Maastricht University Research Portal

Crossref

State-of-the-art Speech Recognition With Sequence-to-Sequence Models

Author: Bacchiani Michiel
Chen Zhifeng
Chiu Chung-Cheng
Chorowski Jan
Gonina Ekaterina
Jaitly Navdeep
Kannan Anjuli
Li Bo
Nguyen Patrick
Prabhavalkar Rohit
Rao Kanishka
Sainath Tara N.
Weiss Ron J.
Wu Yonghui
Publication venue
Publication date: 23/02/2018
Field of study

Attention-based encoder-decoder architectures such as Listen, Attend, and Spell (LAS), subsume the acoustic, pronunciation and language model components of a traditional automatic speech recognition (ASR) system into a single neural network. In previous work, we have shown that such architectures are comparable to state-of-theart ASR systems on dictation tasks, but it was not clear if such architectures would be practical for more challenging tasks such as voice search. In this work, we explore a variety of structural and optimization improvements to our LAS model which significantly improve performance. On the structural side, we show that word piece models can be used instead of graphemes. We also introduce a multi-head attention architecture, which offers improvements over the commonly-used single-head attention. On the optimization side, we explore synchronous training, scheduled sampling, label smoothing, and minimum word error rate optimization, which are all shown to improve accuracy. We present results with a unidirectional LSTM encoder for streaming recognition. On a 12, 500 hour voice search task, we find that the proposed changes improve the WER from 9.2% to 5.6%, while the best conventional system achieves 6.7%; on a dictation task our model achieves a WER of 4.1% compared to 5% for the conventional system.Comment: ICASSP camera-ready versio

arXiv.org e-Print Archive

Crossref

Minimum Latency Training of Sequence Transducers for Streaming End-to-End Speech Recognition

Author: Shinohara Yusuke
Watanabe Shinji
Publication venue
Publication date: 04/11/2022
Field of study

Sequence transducers, such as the RNN-T and the Conformer-T, are one of the most promising models of end-to-end speech recognition, especially in streaming scenarios where both latency and accuracy are important. Although various methods, such as alignment-restricted training and FastEmit, have been studied to reduce the latency, latency reduction is often accompanied with a significant degradation in accuracy. We argue that this suboptimal performance might be caused because none of the prior methods explicitly model and reduce the latency. In this paper, we propose a new training method to explicitly model and reduce the latency of sequence transducer models. First, we define the expected latency at each diagonal line on the lattice, and show that its gradient can be computed efficiently within the forward-backward algorithm. Then we augment the transducer loss with this expected latency, so that an optimal trade-off between latency and accuracy is achieved. Experimental results on the WSJ dataset show that the proposed minimum latency training reduces the latency of causal Conformer-T from 220 ms to 27 ms within a WER degradation of 0.7%, and outperforms conventional alignment-restricted training (110 ms) and FastEmit (67 ms) methods.Comment: Presented at INTERSPEECH 202

arXiv.org e-Print Archive

Alignment Knowledge Distillation for Online Streaming Attention-based Speech Recognition

Author: Inaguma Hirofumi
Kawahara Tatsuya
Publication venue: Institute of Electrical and Electronics Engineers (IEEE)
Publication date: 01/01/2023
Field of study

This article describes an efficient training method for online streaming attention-based encoder-decoder (AED) automatic speech recognition (ASR) systems. AED models have achieved competitive performance in offline scenarios by jointly optimizing all components. They have recently been extended to an online streaming framework via models such as monotonie chunkwise attention (MoChA). However, the elaborate attention calculation process is not robust against long-form speech utterances. Moreover, the sequence-level training objective and time-restricted streaming encoder cause a nonnegligible delay in token emission during inference. To address these problems, we propose CTC synchronous training (CTC-ST), in which CTC alignments are leveraged as a reference for token boundaries to enable a MoChA model to learn optimal monotonie input-output alignments. We formulate a purely end-to-end training objective to synchronize the boundaries of MoChA to those of CTC. The CTC model shares an encoder with the MoChA model to enhance the encoder representation. Moreover, the proposed method provides alignment information learned in the CTC branch to the attention-based decoder. Therefore, CTC-ST can be regarded as self-distillation of alignment knowledge from CTC to MoChA. Experimental evaluations on a variety of benchmark datasets show that the proposed method significantly reduces recognition errors and emission latency simultaneously. The robustness to long-form and noisy speech is also demonstrated. We compare CTC-ST with several methods that distill alignment knowledge from a hybrid ASR system and show that the CTC-ST can achieve a comparable tradeoff of accuracy and latency without relying on external alignment information

Kyoto University Research Information Repository

Streaming cascade-based speech translation leveraged by a direct segmentation model

Author: Abadi
Aggarwal
Arısoy
Bahar
Barik
Barrault
Bengio
Berard
Bourlard
Bozheniuk
Callison-Burch
Chen
Chen
Chen
Cherry
Cho
Cho
Cho
Cho
del Agua
European Parliament
Fügen
Gangi
Graves
Gu
Hinton
Hinton
Hochreiter
Iranzo-Sánchez
Iranzo-Sánchez
Jia
Jorge
Kneser
Koehn
Lederer
Lee
Lee
Lleida
Luong
Ma
Mnih
Ney
Ney
Niehues
Niehues
Nolden
Ott
Papineni
Pino
Popel
Povey
Raffel
Rangarajan Sridhar
Sainath
Schuster
Schwenk
Shannon
Shi
Stolcke
Tiedemann
Viterbi
Weiss
Yagi
Zeyer
Zeyer
Zeyer
Zheng
Zhou
Zolnay
Publication venue: 'Elsevier BV'
Publication date: 31/05/2021
Field of study

[EN] The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. Nowadays, state-of-the-art ST systems are populated with deep neural networks that are conceived to work in an offline setup in which the audio input to be translated is fully available in advance. However, a streaming setup defines a completely different picture, in which an unbounded audio input gradually becomes available and at the same time the translation needs to be generated under real-time constraints. In this work, we present a state-of-the-art streaming ST system in which neural-based models integrated in the ASR and MT components are carefully adapted in terms of their training and decoding procedures in order to run under a streaming setup. In addition, a direct segmentation model that adapts the continuous ASR output to the capacity of simultaneous MT systems trained at the sentence level is introduced to guarantee low latency while preserving the translation quality of the complete ST system. The resulting ST system is thoroughly evaluated on the real-life streaming Europarl-ST benchmark to gauge the trade-off between quality and latency for each component individually as well as for the complete ST system.The research leading to these results has received funding from the European Union's Horizon 2020 research and innovation program under grant agreement no. 761758 (X5Gon) and 952215 (TAILOR); the Government of Spain's research project Multisub, ref. RTI2018-094879-B-I00 (MCIU/AEI/FEDER,EU) and FPU scholarships FPU14/03981 and FPU18/04135; and the Generalitat Valenciana's research project Classroom Activity Recognition, ref. PROMETEO/2019/111 and predoctoral research scholarship ACIF/2017/055.Iranzo-Sánchez, J.; Jorge-Cano, J.; Baquero-Arnal, P.; Silvestre Cerdà, JA.; Giménez Pastor, A.; Civera Saiz, J.; Sanchis Navarro, JA.... (2021). Streaming cascade-based speech translation leveraged by a direct segmentation model. Neural Networks. 142:303-315. https://doi.org/10.1016/j.neunet.2021.05.013S30331514

Crossref

RiuNet