Search CORE

6,100 research outputs found

Streaming cascade-based speech translation leveraged by a direct segmentation model

Author: Abadi
Aggarwal
Arısoy
Bahar
Barik
Barrault
Bengio
Berard
Bourlard
Bozheniuk
Callison-Burch
Chen
Chen
Chen
Cherry
Cho
Cho
Cho
Cho
del Agua
European Parliament
Fügen
Gangi
Graves
Gu
Hinton
Hinton
Hochreiter
Iranzo-Sánchez
Iranzo-Sánchez
Jia
Jorge
Kneser
Koehn
Lederer
Lee
Lee
Lleida
Luong
Ma
Mnih
Ney
Ney
Niehues
Niehues
Nolden
Ott
Papineni
Pino
Popel
Povey
Raffel
Rangarajan Sridhar
Sainath
Schuster
Schwenk
Shannon
Shi
Stolcke
Tiedemann
Viterbi
Weiss
Yagi
Zeyer
Zeyer
Zeyer
Zheng
Zhou
Zolnay
Publication venue: 'Elsevier BV'
Publication date: 31/05/2021
Field of study

[EN] The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. Nowadays, state-of-the-art ST systems are populated with deep neural networks that are conceived to work in an offline setup in which the audio input to be translated is fully available in advance. However, a streaming setup defines a completely different picture, in which an unbounded audio input gradually becomes available and at the same time the translation needs to be generated under real-time constraints. In this work, we present a state-of-the-art streaming ST system in which neural-based models integrated in the ASR and MT components are carefully adapted in terms of their training and decoding procedures in order to run under a streaming setup. In addition, a direct segmentation model that adapts the continuous ASR output to the capacity of simultaneous MT systems trained at the sentence level is introduced to guarantee low latency while preserving the translation quality of the complete ST system. The resulting ST system is thoroughly evaluated on the real-life streaming Europarl-ST benchmark to gauge the trade-off between quality and latency for each component individually as well as for the complete ST system.The research leading to these results has received funding from the European Union's Horizon 2020 research and innovation program under grant agreement no. 761758 (X5Gon) and 952215 (TAILOR); the Government of Spain's research project Multisub, ref. RTI2018-094879-B-I00 (MCIU/AEI/FEDER,EU) and FPU scholarships FPU14/03981 and FPU18/04135; and the Generalitat Valenciana's research project Classroom Activity Recognition, ref. PROMETEO/2019/111 and predoctoral research scholarship ACIF/2017/055.Iranzo-Sánchez, J.; Jorge-Cano, J.; Baquero-Arnal, P.; Silvestre Cerdà, JA.; Giménez Pastor, A.; Civera Saiz, J.; Sanchis Navarro, JA.... (2021). Streaming cascade-based speech translation leveraged by a direct segmentation model. Neural Networks. 142:303-315. https://doi.org/10.1016/j.neunet.2021.05.013S30331514

Crossref

RiuNet

Learning to Translate in Real-time with Neural Machine Translation

Author: Cho Kyunghyun
Gu Jiatao
Li Victor O. K.
Neubig Graham
Publication venue
Publication date: 01/01/2017
Field of study

Translating in real-time, a.k.a. simultaneous translation, outputs translation words before the input sentence ends, which is a challenging problem for conventional machine translation methods. We propose a neural machine translation (NMT) framework for simultaneous translation in which an agent learns to make decisions on when to translate from the interaction with a pre-trained NMT environment. To trade off quality and delay, we extensively explore various targets for delay and design a method for beam-search applicable in the simultaneous MT setting. Experiments against state-of-the-art baselines on two language pairs demonstrate the efficacy of the proposed framework both quantitatively and qualitatively.Comment: 10 pages, camera read

arXiv.org e-Print Archive

Crossref

HKU Scholars Hub

Lip Reading Sentences in the Wild

Author: Chung Joon Son
Senior Andrew
Vinyals Oriol
Zisserman Andrew
Publication venue
Publication date: 01/01/2017
Field of study

The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem - unconstrained natural language sentences, and in the wild videos. Our key contributions are: (1) a 'Watch, Listen, Attend and Spell' (WLAS) network that learns to transcribe videos of mouth motion to characters; (2) a curriculum learning strategy to accelerate training and to reduce overfitting; (3) a 'Lip Reading Sentences' (LRS) dataset for visual speech recognition, consisting of over 100,000 natural sentences from British television. The WLAS model trained on the LRS dataset surpasses the performance of all previous work on standard lip reading benchmark datasets, often by a significant margin. This lip reading performance beats a professional lip reader on videos from BBC television, and we also demonstrate that visual information helps to improve speech recognition performance even when the audio is available

arXiv.org e-Print Archive

Crossref

Oxford University Research Archive

Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff

Author: Bojar Ondřej
Polák Peter
Waibel Alex
Watanabe Shinji
Yan Brian
Publication venue
Publication date: 20/09/2023
Field of study

Blockwise self-attentional encoder models have recently emerged as one promising end-to-end approach to simultaneous speech translation. These models employ a blockwise beam search with hypothesis reliability scoring to determine when to wait for more input speech before translating further. However, this method maintains multiple hypotheses until the entire speech input is consumed -- this scheme cannot directly show a single \textit{incremental} translation to users. Further, this method lacks mechanisms for \textit{controlling} the quality vs. latency tradeoff. We propose a modified incremental blockwise beam search incorporating local agreement or hold-

n

policies for quality-latency control. We apply our framework to models trained for online or offline translation and demonstrate that both types can be effectively used in online mode. Experimental results on MuST-C show 0.6-3.6 BLEU improvement without changing latency or 0.8-1.4 s latency improvement without changing quality.Comment: Accepted at INTERSPEECH 202

arXiv.org e-Print Archive

Towards Stream Translation: Adaptive Computation Time for Simultaneous Machine Translation

Author: Schneider Felix
Waibel Alexander
Publication venue: Association for Computational Linguistics
Publication date: 01/01/2020
Field of study

Simultaneous machine translation systems rely on a policy to schedule read and write operations in order to begin translating a source sentence before it is complete. In this paper, we demonstrate the use of Adaptive Computation Time (ACT) as an adaptive, learned policy for simultaneous machine translation using the transformer model and as a more numerically stable alternative to Monotonic Infinite Lookback Attention (MILk). We achieve state-of-the-art results in terms of latency-quality tradeoffs. We also propose a method to use our model on unsegmented input, i.e. without sentence boundaries, simulating the condition of translating output from automatic speech recognition. We present first benchmark results on this task

Crossref

KITopen

Segmentation-Free Streaming Machine Translation

Author: Civera Jorge
Giménez Adrià
Iranzo-Sánchez Javier
Iranzo-Sánchez Jorge
Juan Alfons
Publication venue
Publication date: 26/09/2023
Field of study

Streaming Machine Translation (MT) is the task of translating an unbounded input text stream in real-time. The traditional cascade approach, which combines an Automatic Speech Recognition (ASR) and an MT system, relies on an intermediate segmentation step which splits the transcription stream into sentence-like units. However, the incorporation of a hard segmentation constrains the MT system and is a source of errors. This paper proposes a Segmentation-Free framework that enables the model to translate an unsegmented source stream by delaying the segmentation decision until the translation has been generated. Extensive experiments show how the proposed Segmentation-Free framework has better quality-latency trade-off than competing approaches that use an independent segmentation model. Software, data and models will be released upon paper acceptance.Comment: 11 pages, 5 figure

arXiv.org e-Print Archive