12,129 research outputs found
Low-Latency Sequence-to-Sequence Speech Recognition and Translation by Partial Hypothesis Selection
Encoder-decoder models provide a generic architecture for
sequence-to-sequence tasks such as speech recognition and translation. While
offline systems are often evaluated on quality metrics like word error rates
(WER) and BLEU, latency is also a crucial factor in many practical use-cases.
We propose three latency reduction techniques for chunk-based incremental
inference and evaluate their efficiency in terms of accuracy-latency trade-off.
On the 300-hour How2 dataset, we reduce latency by 83% to 0.8 second by
sacrificing 1% WER (6% rel.) compared to offline transcription. Although our
experiments use the Transformer, the hypothesis selection strategies are
applicable to other encoder-decoder models. To avoid expensive re-computation,
we use a unidirectionally-attending encoder. After an adaptation procedure to
partial sequences, the unidirectional model performs on-par with the original
model. We further show that our approach is also applicable to low-latency
speech translation. On How2 English-Portuguese speech translation, we reduce
latency to 0.7 second (-84% rel.) while incurring a loss of 2.4 BLEU points (5%
rel.) compared to the offline system
From Simultaneous to Streaming Machine Translation by Leveraging Streaming History
Simultaneous Machine Translation is the task of incrementally translating an
input sentence before it is fully available. Currently, simultaneous
translation is carried out by translating each sentence independently of the
previously translated text. More generally, Streaming MT can be understood as
an extension of Simultaneous MT to the incremental translation of a continuous
input text stream. In this work, a state-of-the-art simultaneous sentence-level
MT system is extended to the streaming setup by leveraging the streaming
history. Extensive empirical results are reported on IWSLT Translation Tasks,
showing that leveraging the streaming history leads to significant quality
gains. In particular, the proposed system proves to compare favorably to the
best performing systems.Comment: ACL 2022 - Camera ready; v3: expanded data pre-processin
Visualization: the missing factor in Simultaneous Speech Translation
Simultaneous speech translation (SimulST) is the task in which output
generation has to be performed on partial, incremental speech input. In recent
years, SimulST has become popular due to the spread of cross-lingual
application scenarios, like international live conferences and streaming
lectures, in which on-the-fly speech translation can facilitate users' access
to audio-visual content. In this paper, we analyze the characteristics of the
SimulST systems developed so far, discussing their strengths and weaknesses. We
then concentrate on the evaluation framework required to properly assess
systems' effectiveness. To this end, we raise the need for a broader
performance analysis, also including the user experience standpoint. SimulST
systems, indeed, should be evaluated not only in terms of quality/latency
measures, but also via task-oriented metrics accounting, for instance, for the
visualization strategy adopted. In light of this, we highlight which are the
goals achieved by the community and what is still missing.Comment: Accepted at CLIC-it 202
Rethinking the Reasonability of the Test Set for Simultaneous Machine Translation
Simultaneous machine translation (SimulMT) models start translation before
the end of the source sentence, making the translation monotonically aligned
with the source sentence. However, the general full-sentence translation test
set is acquired by offline translation of the entire source sentence, which is
not designed for SimulMT evaluation, making us rethink whether this will
underestimate the performance of SimulMT models. In this paper, we manually
annotate a monotonic test set based on the MuST-C English-Chinese test set,
denoted as SiMuST-C. Our human evaluation confirms the acceptability of our
annotated test set. Evaluations on three different SimulMT models verify that
the underestimation problem can be alleviated on our test set. Further
experiments show that finetuning on an automatically extracted monotonic
training set improves SimulMT models by up to 3 BLEU points.Comment: Accepted by 48th IEEE International Conference on Acoustics, Speech,
and Signal Processing (ICASSP 2023
Segmentation-Free Streaming Machine Translation
Streaming Machine Translation (MT) is the task of translating an unbounded
input text stream in real-time. The traditional cascade approach, which
combines an Automatic Speech Recognition (ASR) and an MT system, relies on an
intermediate segmentation step which splits the transcription stream into
sentence-like units. However, the incorporation of a hard segmentation
constrains the MT system and is a source of errors. This paper proposes a
Segmentation-Free framework that enables the model to translate an unsegmented
source stream by delaying the segmentation decision until the translation has
been generated. Extensive experiments show how the proposed Segmentation-Free
framework has better quality-latency trade-off than competing approaches that
use an independent segmentation model. Software, data and models will be
released upon paper acceptance.Comment: 11 pages, 5 figure
- …