209 research outputs found
Findings of the IWSLT 2022 Evaluation Campaign.
The evaluation campaign of the 19th International Conference on Spoken Language Translation featured eight shared tasks: (i) Simultaneous speech translation, (ii) Offline speech translation, (iii) Speech to speech translation, (iv) Low-resource speech translation, (v) Multilingual speech translation, (vi) Dialect speech translation, (vii) Formality control for speech translation, (viii) Isometric speech translation. A total of 27 teams participated in at least one of the shared tasks. This paper details, for each shared task, the purpose of the task, the data that were released, the evaluation metrics that were applied, the submissions that were received and the results that were achieved
End-to-End Speech Translation with Pre-trained Models and Adapters: UPC at IWSLT 2021
This paper describes the submission to the IWSLT 2021 offline speech
translation task by the UPC Machine Translation group. The task consists of
building a system capable of translating English audio recordings extracted
from TED talks into German text. Submitted systems can be either cascade or
end-to-end and use a custom or given segmentation. Our submission is an
end-to-end speech translation system, which combines pre-trained models
(Wav2Vec 2.0 and mBART) with coupling modules between the encoder and decoder,
and uses an efficient fine-tuning technique, which trains only 20% of its total
parameters. We show that adding an Adapter to the system and pre-training it,
can increase the convergence speed and the final result, with which we achieve
a BLEU score of 27.3 on the MuST-C test set. Our final model is an ensemble
that obtains 28.22 BLEU score on the same set. Our submission also uses a
custom segmentation algorithm that employs pre-trained Wav2Vec 2.0 for
identifying periods of untranscribable text and can bring improvements of 2.5
to 3 BLEU score on the IWSLT 2019 test set, as compared to the result with the
given segmentation.Comment: Submitted to IWSLT 2021; changed the title and added submission
result
Dealing with training and test segmentation mismatch: FBK@IWSLT2021
This paper describes FBK's system submission to the IWSLT 2021 Offline Speech
Translation task. We participated with a direct model, which is a
Transformer-based architecture trained to translate English speech audio data
into German texts. The training pipeline is characterized by knowledge
distillation and a two-step fine-tuning procedure. Both knowledge distillation
and the first fine-tuning step are carried out on manually segmented real and
synthetic data, the latter being generated with an MT system trained on the
available corpora. Differently, the second fine-tuning step is carried out on a
random segmentation of the MuST-C v2 En-De dataset. Its main goal is to reduce
the performance drops occurring when a speech translation model trained on
manually segmented data (i.e. an ideal, sentence-like segmentation) is
evaluated on automatically segmented audio (i.e. actual, more realistic testing
conditions). For the same purpose, a custom hybrid segmentation procedure that
accounts for both audio content (pauses) and for the length of the produced
segments is applied to the test data before passing them to the system. At
inference time, we compared this procedure with a baseline segmentation method
based on Voice Activity Detection (VAD). Our results indicate the effectiveness
of the proposed hybrid approach, shown by a reduction of the gap with manual
segmentation from 8.3 to 1.4 BLEU points.Comment: Accepted at IWSLT202
Long-Form End-to-End Speech Translation via Latent Alignment Segmentation
Current simultaneous speech translation models can process audio only up to a
few seconds long. Contemporary datasets provide an oracle segmentation into
sentences based on human-annotated transcripts and translations. However, the
segmentation into sentences is not available in the real world. Current speech
segmentation approaches either offer poor segmentation quality or have to trade
latency for quality. In this paper, we propose a novel segmentation approach
for a low-latency end-to-end speech translation. We leverage the existing
speech translation encoder-decoder architecture with ST CTC and show that it
can perform the segmentation task without supervision or additional parameters.
To the best of our knowledge, our method is the first that allows an actual
end-to-end simultaneous speech translation, as the same model is used for
translation and segmentation at the same time. On a diverse set of language
pairs and in- and out-of-domain data, we show that the proposed approach
achieves state-of-the-art quality at no additional computational cost.Comment: This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessibl
Efficient yet Competitive Speech Translation: FBK@IWSLT2022
The primary goal of this FBK’s systems submission to the IWSLT 2022 offline and simultaneous speech translation tasks is to reduce model training costs without sacrificing translation quality. As such, we first question the need of ASR pre-training, showing that it is not essential to achieve competitive results. Second, we focus on data filtering, showing that a simple method that looks at the ratio between source and target characters yields a quality improvement of 1 BLEU. Third, we compare different methods to reduce the detrimental effect of the audio segmentation mismatch between training data manually segmented at sentence level and inference data that is automatically segmented. Towards the same goal of training cost reduction, we participate in the simultaneous task with the same model trained for offline ST. The effectiveness of our lightweight training strategy is shown by the high score obtained on the MuST-C en-de corpus (26.7 BLEU) and is confirmed in high-resource data conditions by a 1.6 BLEU improvement on the IWSLT2020 test set over last year’s winning system
- …