Search CORE

3,172 research outputs found

Low-Latency Sequence-to-Sequence Speech Recognition and Translation by Partial Hypothesis Selection

Author: Liu Danni
Niehues Jan
Spanakis Gerasimos
Publication venue
Publication date: 01/01/2020
Field of study

Encoder-decoder models provide a generic architecture for sequence-to-sequence tasks such as speech recognition and translation. While offline systems are often evaluated on quality metrics like word error rates (WER) and BLEU, latency is also a crucial factor in many practical use-cases. We propose three latency reduction techniques for chunk-based incremental inference and evaluate their efficiency in terms of accuracy-latency trade-off. On the 300-hour How2 dataset, we reduce latency by 83% to 0.8 second by sacrificing 1% WER (6% rel.) compared to offline transcription. Although our experiments use the Transformer, the hypothesis selection strategies are applicable to other encoder-decoder models. To avoid expensive re-computation, we use a unidirectionally-attending encoder. After an adaptation procedure to partial sequences, the unidirectional model performs on-par with the original model. We further show that our approach is also applicable to low-latency speech translation. On How2 English-Portuguese speech translation, we reduce latency to 0.7 second (-84% rel.) while incurring a loss of 2.4 BLEU points (5% rel.) compared to the offline system

arXiv.org e-Print Archive

Maastricht University Research Portal

Crossref

KIT Lecture Translator: Multilingual Speech Translation with One-Shot Learning

Author: Dessloch Florian
Ha Thanh-Le
Müller Markus
Nguyen Thai-Son
Niehues Jan
Pham Ngoc-Quan
Salesky Elizabeth
Sperber Matthias
Stüker Sebastian
Waibel Alex
Zenkel Thomas
Publication venue: Association for Computational Linguistics
Publication date: 20/04/2022
Field of study

KITopen

Working Memory in Writing: Empirical Evidence From the Dual-Task Technique

Author: Olive Thierry
Publication venue
Publication date: 01/10/2003
Field of study

The dual-task paradigm recently played a major role in understanding the role of working memory in writing. By reviewing recent findings in this field of research, this article highlights how the use of the dual-task technique allowed studying processing and short-term storage functions of working memory involved in writing. With respect to processing functions of working memory (namely, attentional and executive functions), studies investigated resources allocation, step-by-step management and parallel coordination of the writing processes. With respect to short-term storage in working memory, experiments mainly attempted to test Kellogg's (1996) proposals on the relationship between the writing processes and the slave systems of working memory. It is concluded that the dual-task technique revealed fruitful in understanding the relationship between writing and working memory

CiteSeerX

HAL Université de Tours

CogPrints Cognitive Sciences Eprint Archive

Yeah, Right, Uh-Huh: A Deep Learning Backchannel Predictor

Author: A Stolcke
A Waibel
J Niehues
K Laskowski
M Schroder
N Srivastava
N Ward
X Glorot
Publication venue
Publication date: 02/06/2017
Field of study

Using supporting backchannel (BC) cues can make human-computer interaction more social. BCs provide a feedback from the listener to the speaker indicating to the speaker that he is still listened to. BCs can be expressed in different ways, depending on the modality of the interaction, for example as gestures or acoustic cues. In this work, we only considered acoustic cues. We are proposing an approach towards detecting BC opportunities based on acoustic input features like power and pitch. While other works in the field rely on the use of a hand-written rule set or specialized features, we made use of artificial neural networks. They are capable of deriving higher order features from input features themselves. In our setup, we first used a fully connected feed-forward network to establish an updated baseline in comparison to our previously proposed setup. We also extended this setup by the use of Long Short-Term Memory (LSTM) networks which have shown to outperform feed-forward based setups on various tasks. Our best system achieved an F1-Score of 0.37 using power and pitch features. Adding linguistic information using word2vec, the score increased to 0.39

arXiv.org e-Print Archive

Crossref

End-to-End Evaluation for Low-Latency Simultaneous Speech Translation

Author: Constantin Stefan
Dinh Tu Anh
Huber Christian
Koneru Sai
Li Zhaolin
Liu Danni
Mullov Carlos
Nguyen Thai Binh
Niehues Jan
Pham Ngoc-Quan
Retkowski Fabian
Ugan Enes
Waibel Alexander
Publication venue: Association for Computational Linguistics
Publication date: 15/01/2024
Field of study

KITopen

End-to-End Evaluation for Low-Latency Simultaneous Speech Translation

Author: Constantin Stefan
Dinh Tu Anh
Huber Christian
Koneru Sai
Li Zhaolin
Liu Danni
Mullov Carlos
Nguyen Thai Binh
Niehues Jan
Pham Ngoc Quan
Retkowski Fabian
Ugan Enes Yavuz
Waibel Alexander
Publication venue
Publication date: 23/10/2023
Field of study

The challenge of low-latency speech translation has recently draw significant interest in the research community as shown by several publications and shared tasks. Therefore, it is essential to evaluate these different approaches in realistic scenarios. However, currently only specific aspects of the systems are evaluated and often it is not possible to compare different approaches. In this work, we propose the first framework to perform and evaluate the various aspects of low-latency speech translation under realistic conditions. The evaluation is carried out in an end-to-end fashion. This includes the segmentation of the audio as well as the run-time of the different components. Secondly, we compare different approaches to low-latency speech translation using this framework. We evaluate models with the option to revise the output as well as methods with fixed output. Furthermore, we directly compare state-of-the-art cascaded as well as end-to-end systems. Finally, the framework allows to automatically evaluate the translation quality as well as latency and also provides a web interface to show the low-latency model outputs to the user

arXiv.org e-Print Archive

Visualization: the missing factor in Simultaneous Speech Translation

Author: Negri Matteo
Papi Sara
Turchi Marco
Publication venue
Publication date: 01/01/2021
Field of study

Simultaneous speech translation (SimulST) is the task in which output generation has to be performed on partial, incremental speech input. In recent years, SimulST has become popular due to the spread of cross-lingual application scenarios, like international live conferences and streaming lectures, in which on-the-fly speech translation can facilitate users' access to audio-visual content. In this paper, we analyze the characteristics of the SimulST systems developed so far, discussing their strengths and weaknesses. We then concentrate on the evaluation framework required to properly assess systems' effectiveness. To this end, we raise the need for a broader performance analysis, also including the user experience standpoint. SimulST systems, indeed, should be evaluated not only in terms of quality/latency measures, but also via task-oriented metrics accounting, for instance, for the visualization strategy adopted. In light of this, we highlight which are the goals achieved by the community and what is still missing.Comment: Accepted at CLIC-it 202

arXiv.org e-Print Archive

Archivio della ricerca - Fondazione Bruno Kessler