9 research outputs found
KIT's Multilingual Speech Translation System for IWSLT 2023
Many existing speech translation benchmarks focus on native-English speech in
high-quality recording conditions, which often do not match the conditions in
real-life use-cases. In this paper, we describe our speech translation system
for the multilingual track of IWSLT 2023, which focuses on the translation of
scientific conference talks. The test condition features accented input speech
and terminology-dense contents. The tasks requires translation into 10
languages of varying amounts of resources. In absence of training data from the
target domain, we use a retrieval-based approach (kNN-MT) for effective
adaptation (+0.8 BLEU for speech translation). We also use adapters to easily
integrate incremental training data from data augmentation, and show that it
matches the performance of re-training. We observe that cascaded systems are
more easily adaptable towards specific target domains, due to their separate
modules. Our cascaded speech system substantially outperforms its end-to-end
counterpart on scientific talk translation, although their performance remains
similar on TED talks.Comment: IWSLT 202
End-to-End Evaluation for Low-Latency Simultaneous Speech Translation
The challenge of low-latency speech translation has recently draw significant
interest in the research community as shown by several publications and shared
tasks. Therefore, it is essential to evaluate these different approaches in
realistic scenarios. However, currently only specific aspects of the systems
are evaluated and often it is not possible to compare different approaches.
In this work, we propose the first framework to perform and evaluate the
various aspects of low-latency speech translation under realistic conditions.
The evaluation is carried out in an end-to-end fashion. This includes the
segmentation of the audio as well as the run-time of the different components.
Secondly, we compare different approaches to low-latency speech translation
using this framework. We evaluate models with the option to revise the output
as well as methods with fixed output. Furthermore, we directly compare
state-of-the-art cascaded as well as end-to-end systems. Finally, the framework
allows to automatically evaluate the translation quality as well as latency and
also provides a web interface to show the low-latency model outputs to the
user
CUNI-KIT System for Simultaneous Speech Translation Task at IWSLT 2022
In this paper, we describe our submission to the Simultaneous Speech Translation at IWSLT 2022. We explore strategies to utilize an offline model in a simultaneous setting without the need to modify the original model. In our experiments, we show that our onlinization algorithm is almost on par with the offline setting while being 3x faster than offline in terms of latency on the test set. We also show that the onlinized offline model outperforms the best IWSLT2021 simultaneous system in medium and high latency regimes and is almost on par in the low latency regime. We make our system publicly available
CUNI-KIT System for Simultaneous Speech Translation Task at IWSLT 2022
In this paper, we describe our submission to the Simultaneous Speech
Translation at IWSLT 2022. We explore strategies to utilize an offline model in
a simultaneous setting without the need to modify the original model. In our
experiments, we show that our onlinization algorithm is almost on par with the
offline setting while being faster than offline in terms of latency
on the test set. We also show that the onlinized offline model outperforms the
best IWSLT2021 simultaneous system in medium and high latency regimes and is
almost on par in the low latency regime. We make our system publicly available.Comment: Accepted to IWSLT2
Effective combination of pretrained models - KIT@IWSLT2022
Pretrained models in acoustic and textual modalities can potentially improve speech translation for both Cascade and End-to-end approaches. In this evaluation, we aim at empirically looking for the answer by using the wav2vec, mBART50 and DeltaLM models to improve text and speech translation models. The experiments showed that the presence of these models together with an advanced audio segmentation method results in an improvement over the previous end-to-end system by up to 7 BLEU points. More importantly, the experiments showed that given enough data and modeling capacity to overcome the training difficulty, we can outperform even very competitive Cascade systems. In our experiments, this gap can be as large as 2.0 BLEU points, the same gap that the Cascade often led over the years
Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos
In this paper, we propose a neural end-to-end system for voice preserving,
lip-synchronous translation of videos. The system is designed to combine
multiple component models and produces a video of the original speaker speaking
in the target language that is lip-synchronous with the target speech, yet
maintains emphases in speech, voice characteristics, face video of the original
speaker. The pipeline starts with automatic speech recognition including
emphasis detection, followed by a translation model. The translated text is
then synthesized by a Text-to-Speech model that recreates the original emphases
mapped from the original sentence. The resulting synthetic voice is then mapped
back to the original speakers' voice using a voice conversion model. Finally,
to synchronize the lips of the speaker with the translated audio, a conditional
generative adversarial network-based model generates frames of adapted lip
movements with respect to the input face image as well as the output of the
voice conversion model. In the end, the system combines the generated video
with the converted audio to produce the final output. The result is a video of
a speaker speaking in another language without actually knowing it. To evaluate
our design, we present a user study of the complete system as well as separate
evaluations of the single components. Since there is no available dataset to
evaluate our whole system, we collect a test set and evaluate our system on
this test set. The results indicate that our system is able to generate
convincing videos of the original speaker speaking the target language while
preserving the original speaker's characteristics. The collected dataset will
be shared