26,454 research outputs found
Efficient Wait-k Models for Simultaneous Machine Translation
Simultaneous machine translation consists in starting output generation
before the entire input sequence is available. Wait-k decoders offer a simple
but efficient approach for this problem. They first read k source tokens, after
which they alternate between producing a target token and reading another
source token. We investigate the behavior of wait-k decoding in low resource
settings for spoken corpora using IWSLT datasets. We improve training of these
models using unidirectional encoders, and training across multiple values of k.
Experiments with Transformer and 2D-convolutional architectures show that our
wait-k models generalize well across a wide range of latency levels. We also
show that the 2D-convolution architecture is competitive with Transformers for
simultaneous translation of spoken language.Comment: Accepted at INTERSPEECH 202
Low-Latency Sequence-to-Sequence Speech Recognition and Translation by Partial Hypothesis Selection
Encoder-decoder models provide a generic architecture for
sequence-to-sequence tasks such as speech recognition and translation. While
offline systems are often evaluated on quality metrics like word error rates
(WER) and BLEU, latency is also a crucial factor in many practical use-cases.
We propose three latency reduction techniques for chunk-based incremental
inference and evaluate their efficiency in terms of accuracy-latency trade-off.
On the 300-hour How2 dataset, we reduce latency by 83% to 0.8 second by
sacrificing 1% WER (6% rel.) compared to offline transcription. Although our
experiments use the Transformer, the hypothesis selection strategies are
applicable to other encoder-decoder models. To avoid expensive re-computation,
we use a unidirectionally-attending encoder. After an adaptation procedure to
partial sequences, the unidirectional model performs on-par with the original
model. We further show that our approach is also applicable to low-latency
speech translation. On How2 English-Portuguese speech translation, we reduce
latency to 0.7 second (-84% rel.) while incurring a loss of 2.4 BLEU points (5%
rel.) compared to the offline system
Non-autoregressive Streaming Transformer for Simultaneous Translation
Simultaneous machine translation (SiMT) models are trained to strike a
balance between latency and translation quality. However, training these models
to achieve high quality while maintaining low latency often leads to a tendency
for aggressive anticipation. We argue that such issue stems from the
autoregressive architecture upon which most existing SiMT models are built. To
address those issues, we propose non-autoregressive streaming Transformer
(NAST) which comprises a unidirectional encoder and a non-autoregressive
decoder with intra-chunk parallelism. We enable NAST to generate the blank
token or repetitive tokens to adjust its READ/WRITE strategy flexibly, and
train it to maximize the non-monotonic latent alignment with an alignment-based
latency loss. Experiments on various SiMT benchmarks demonstrate that NAST
outperforms previous strong autoregressive SiMT baselines.Comment: EMNLP 2023 main conference; Source code is available at
https://github.com/ictnlp/NAS
Enhanced Simultaneous Machine Translation with Word-level Policies
Recent years have seen remarkable advances in the field of Simultaneous
Machine Translation (SiMT) due to the introduction of innovative policies that
dictate whether to READ or WRITE at each step of the translation process.
However, a common assumption in many existing studies is that operations are
carried out at the subword level, even though the standard unit for input and
output in most practical scenarios is typically at the word level. This paper
demonstrates that policies devised and validated at the subword level are
surpassed by those operating at the word level, which process multiple subwords
to form a complete word in a single step. Additionally, we suggest a method to
boost SiMT models using language models (LMs), wherein the proposed word-level
policy plays a vital role in addressing the subword disparity between LMs and
SiMT models. Code is available at https://github.com/xl8-ai/WordSiMT.Comment: EMNLP 2023 Finding
An Empirical Study of End-to-end Simultaneous Speech Translation Decoding Strategies
This paper proposes a decoding strategy for end-to-end simultaneous speech
translation. We leverage end-to-end models trained in offline mode and conduct
an empirical study for two language pairs (English-to-German and
English-to-Portuguese). We also investigate different output token
granularities including characters and Byte Pair Encoding (BPE) units. The
results show that the proposed decoding approach allows to control BLEU/Average
Lagging trade-off along different latency regimes. Our best decoding settings
achieve comparable results with a strong cascade model evaluated on the
simultaneous translation track of IWSLT 2020 shared task.Comment: This paper has been accepted for presentation at IEEE ICASSP 202
Streaming cascade-based speech translation leveraged by a direct segmentation model
[EN] The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. Nowadays, state-of-the-art ST systems are populated with deep neural networks that are conceived to work in an offline setup in which the audio input to be translated is fully available in advance. However, a streaming setup defines a completely different picture, in which an unbounded audio input gradually becomes available and at the same time the translation needs to be generated under real-time constraints. In this work, we present a state-of-the-art streaming ST system in which neural-based models integrated in the ASR and MT components are carefully adapted in terms of their training and decoding procedures in order to run under a streaming setup. In addition, a direct segmentation model that adapts the continuous ASR output to the capacity of simultaneous MT systems trained at the sentence level is introduced to guarantee low latency while preserving the translation quality of the complete ST system. The resulting ST system is thoroughly evaluated on the real-life streaming Europarl-ST benchmark to gauge the trade-off between quality and latency for each component individually as well as for the complete ST system.The research leading to these results has received funding
from the European Union's Horizon 2020 research and innovation program under grant agreement no. 761758 (X5Gon) and
952215 (TAILOR); the Government of Spain's research project Multisub, ref. RTI2018-094879-B-I00 (MCIU/AEI/FEDER,EU) and
FPU scholarships FPU14/03981 and FPU18/04135; and the Generalitat Valenciana's research project Classroom Activity Recognition, ref. PROMETEO/2019/111 and predoctoral research scholarship ACIF/2017/055.Iranzo-Sánchez, J.; Jorge-Cano, J.; Baquero-Arnal, P.; Silvestre Cerdà , JA.; Giménez Pastor, A.; Civera Saiz, J.; Sanchis Navarro, JA.... (2021). Streaming cascade-based speech translation leveraged by a direct segmentation model. Neural Networks. 142:303-315. https://doi.org/10.1016/j.neunet.2021.05.013S30331514
Impact of Encoding and Segmentation Strategies on End-to-End Simultaneous Speech Translation
Boosted by the simultaneous translation shared task at IWSLT 2020, promising
end-to-end online speech translation approaches were recently proposed. They
consist in incrementally encoding a speech input (in a source language) and
decoding the corresponding text (in a target language) with the best possible
trade-off between latency and translation quality. This paper investigates two
key aspects of end-to-end simultaneous speech translation: (a) how to encode
efficiently the continuous speech flow, and (b) how to segment the speech flow
in order to alternate optimally between reading (R: encoding input) and writing
(W: decoding output) operations. We extend our previously proposed end-to-end
online decoding strategy and show that while replacing BLSTM by ULSTM encoding
degrades performance in offline mode, it actually improves both efficiency and
performance in online mode. We also measure the impact of different methods to
segment the speech signal (using fixed interval boundaries, oracle word
boundaries or randomly set boundaries) and show that our best end-to-end online
decoding strategy is surprisingly the one that alternates R/W operations on
fixed size blocks on our English-German speech translation setup.Comment: Accepted for presentation at Interspeech 202
Context Consistency between Training and Testing in Simultaneous Machine Translation
Simultaneous Machine Translation (SiMT) aims to yield a real-time partial
translation with a monotonically growing the source-side context. However,
there is a counterintuitive phenomenon about the context usage between training
and testing: e.g., the wait-k testing model consistently trained with wait-k is
much worse than that model inconsistently trained with wait-k' (k' is not equal
to k) in terms of translation quality. To this end, we first investigate the
underlying reasons behind this phenomenon and uncover the following two
factors: 1) the limited correlation between translation quality and training
(cross-entropy) loss; 2) exposure bias between training and testing. Based on
both reasons, we then propose an effective training approach called context
consistency training accordingly, which makes consistent the context usage
between training and testing by optimizing translation quality and latency as
bi-objectives and exposing the predictions to the model during the training.
The experiments on three language pairs demonstrate our intuition: our system
encouraging context consistency outperforms that existing systems with context
inconsistency for the first time, with the help of our context consistency
training approach
Visualization: the missing factor in Simultaneous Speech Translation
Simultaneous speech translation (SimulST) is the task in which output
generation has to be performed on partial, incremental speech input. In recent
years, SimulST has become popular due to the spread of cross-lingual
application scenarios, like international live conferences and streaming
lectures, in which on-the-fly speech translation can facilitate users' access
to audio-visual content. In this paper, we analyze the characteristics of the
SimulST systems developed so far, discussing their strengths and weaknesses. We
then concentrate on the evaluation framework required to properly assess
systems' effectiveness. To this end, we raise the need for a broader
performance analysis, also including the user experience standpoint. SimulST
systems, indeed, should be evaluated not only in terms of quality/latency
measures, but also via task-oriented metrics accounting, for instance, for the
visualization strategy adopted. In light of this, we highlight which are the
goals achieved by the community and what is still missing.Comment: Accepted at CLIC-it 202
- …