21 research outputs found
Dealing with training and test segmentation mismatch: FBK@IWSLT2021
This paper describes FBK's system submission to the IWSLT 2021 Offline Speech
Translation task. We participated with a direct model, which is a
Transformer-based architecture trained to translate English speech audio data
into German texts. The training pipeline is characterized by knowledge
distillation and a two-step fine-tuning procedure. Both knowledge distillation
and the first fine-tuning step are carried out on manually segmented real and
synthetic data, the latter being generated with an MT system trained on the
available corpora. Differently, the second fine-tuning step is carried out on a
random segmentation of the MuST-C v2 En-De dataset. Its main goal is to reduce
the performance drops occurring when a speech translation model trained on
manually segmented data (i.e. an ideal, sentence-like segmentation) is
evaluated on automatically segmented audio (i.e. actual, more realistic testing
conditions). For the same purpose, a custom hybrid segmentation procedure that
accounts for both audio content (pauses) and for the length of the produced
segments is applied to the test data before passing them to the system. At
inference time, we compared this procedure with a baseline segmentation method
based on Voice Activity Detection (VAD). Our results indicate the effectiveness
of the proposed hybrid approach, shown by a reduction of the gap with manual
segmentation from 8.3 to 1.4 BLEU points.Comment: Accepted at IWSLT202
End-to-End Speech Translation with Pre-trained Models and Adapters: UPC at IWSLT 2021
This paper describes the submission to the IWSLT 2021 offline speech
translation task by the UPC Machine Translation group. The task consists of
building a system capable of translating English audio recordings extracted
from TED talks into German text. Submitted systems can be either cascade or
end-to-end and use a custom or given segmentation. Our submission is an
end-to-end speech translation system, which combines pre-trained models
(Wav2Vec 2.0 and mBART) with coupling modules between the encoder and decoder,
and uses an efficient fine-tuning technique, which trains only 20% of its total
parameters. We show that adding an Adapter to the system and pre-training it,
can increase the convergence speed and the final result, with which we achieve
a BLEU score of 27.3 on the MuST-C test set. Our final model is an ensemble
that obtains 28.22 BLEU score on the same set. Our submission also uses a
custom segmentation algorithm that employs pre-trained Wav2Vec 2.0 for
identifying periods of untranscribable text and can bring improvements of 2.5
to 3 BLEU score on the IWSLT 2019 test set, as compared to the result with the
given segmentation.Comment: Submitted to IWSLT 2021; changed the title and added submission
result
Towards Automatic Subtitling: Assessing the Quality of Old and New Resources
Growing needs in localising multimedia content for global audiences have resulted in Neural Machine Translation (NMT) gradually becoming an established practice in the field of subtitling in order to reduce costs and turn-around times. Contrary to text translation, subtitling is subject to spatial and temporal constraints, which greatly increase the post-processing effort required to restore the NMT output to a proper subtitle format. In our previous work (Karakanta, Negri, and Turchi 2019), we identified several missing elements in the corpora available for training NMT systems specifically tailored for subtitling. In this work, we compare the previously studied corpora with MuST-Cinema, a corpus enabling end-to-end speech to subtitles translation, in terms of the conformity to the constraints of: 1) length and reading speed; and 2) proper line breaks. We show that MuST-Cinema conforms to these constraints and discuss the recent progress the corpus has facilitated in end-to-end speech to subtitles translation
An Empirical Study of End-to-end Simultaneous Speech Translation Decoding Strategies
This paper proposes a decoding strategy for end-to-end simultaneous speech
translation. We leverage end-to-end models trained in offline mode and conduct
an empirical study for two language pairs (English-to-German and
English-to-Portuguese). We also investigate different output token
granularities including characters and Byte Pair Encoding (BPE) units. The
results show that the proposed decoding approach allows to control BLEU/Average
Lagging trade-off along different latency regimes. Our best decoding settings
achieve comparable results with a strong cascade model evaluated on the
simultaneous translation track of IWSLT 2020 shared task.Comment: This paper has been accepted for presentation at IEEE ICASSP 202
KIT’s IWSLT 2020 SLT Translation System
This paper describes KIT’s submissions to the IWSLT2020 Speech Translation evaluation campaign. We first participate in the simultaneous translation task, in which our simultaneous models are Transformer based and can be efficiently trained to obtain low latency with minimized compromise in quality. On the offline speech translation task, we applied our new Speech Transformer architecture to end-to-end speech translation. The obtained model can provide translation quality which is competitive to a complicated cascade. The latter still has the upper hand, thanks to the ability to transparently access to the transcription, and resegment the inputs to avoid fragmentation
Impact of Encoding and Segmentation Strategies on End-to-End Simultaneous Speech Translation
Boosted by the simultaneous translation shared task at IWSLT 2020, promising
end-to-end online speech translation approaches were recently proposed. They
consist in incrementally encoding a speech input (in a source language) and
decoding the corresponding text (in a target language) with the best possible
trade-off between latency and translation quality. This paper investigates two
key aspects of end-to-end simultaneous speech translation: (a) how to encode
efficiently the continuous speech flow, and (b) how to segment the speech flow
in order to alternate optimally between reading (R: encoding input) and writing
(W: decoding output) operations. We extend our previously proposed end-to-end
online decoding strategy and show that while replacing BLSTM by ULSTM encoding
degrades performance in offline mode, it actually improves both efficiency and
performance in online mode. We also measure the impact of different methods to
segment the speech signal (using fixed interval boundaries, oracle word
boundaries or randomly set boundaries) and show that our best end-to-end online
decoding strategy is surprisingly the one that alternates R/W operations on
fixed size blocks on our English-German speech translation setup.Comment: Accepted for presentation at Interspeech 202
CTC-based Compression for Direct Speech Translation
Previous studies demonstrated that a dynamic phone-informed compression of
the input audio is beneficial for speech translation (ST). However, they
required a dedicated model for phone recognition and did not test this solution
for direct ST, in which a single model translates the input audio into the
target language without intermediate representations. In this work, we propose
the first method able to perform a dynamic compression of the input indirect ST
models. In particular, we exploit the Connectionist Temporal Classification
(CTC) to compress the input sequence according to its phonetic characteristics.
Our experiments demonstrate that our solution brings a 1.3-1.5 BLEU improvement
over a strong baseline on two language pairs (English-Italian and
English-German), contextually reducing the memory footprint by more than 10%.Comment: Accepted at EACL202