60 research outputs found
Who Are We Talking About? Handling Person Names in Speech Translation
Recent work has shown that systems for speech translation (ST) – similarly to automatic speech recognition (ASR) – poorly handle person names. This shortcoming does not only lead to errors that can seriously distort the meaning of the input, but also hinders the adoption of such systems in application scenarios (like computer-assisted interpreting) where the translation of named entities, like person names, is crucial. In this paper, we first analyse the outputs of ASR/ST systems to identify the reasons of failures in person name transcription/translation. Besides the frequency in the training data, we pinpoint the nationality of the referred person as a key factor. We then mitigate the problem by creating multilingual models, and further improve our ST systems by forcing them to jointly generate transcripts and translations, prioritising the former over the latter. Overall, our solutions result in a relative improvement in token-level person name accuracy by 47.8% on average for three language pairs (en->es,fr,it)
CTC-based Compression for Direct Speech Translation
Previous studies demonstrated that a dynamic phone-informed compression of
the input audio is beneficial for speech translation (ST). However, they
required a dedicated model for phone recognition and did not test this solution
for direct ST, in which a single model translates the input audio into the
target language without intermediate representations. In this work, we propose
the first method able to perform a dynamic compression of the input indirect ST
models. In particular, we exploit the Connectionist Temporal Classification
(CTC) to compress the input sequence according to its phonetic characteristics.
Our experiments demonstrate that our solution brings a 1.3-1.5 BLEU improvement
over a strong baseline on two language pairs (English-Italian and
English-German), contextually reducing the memory footprint by more than 10%.Comment: Accepted at EACL202
Does Simultaneous Speech Translation need Simultaneous Models?
In simultaneous speech translation (SimulST), finding the best trade-off
between high translation quality and low latency is a challenging task. To meet
the latency constraints posed by the different application scenarios, multiple
dedicated SimulST models are usually trained and maintained, generating high
computational costs. In this paper, motivated by the increased social and
environmental impact caused by these costs, we investigate whether a single
model trained offline can serve not only the offline but also the simultaneous
task without the need for any additional training or adaptation. Experiments on
en->{de, es} indicate that, aside from facilitating the adoption of
well-established offline techniques and architectures without affecting
latency, the offline solution achieves similar or better translation quality
compared to the same model trained in simultaneous settings, as well as being
competitive with the SimulST state of the art
Over-Generation Cannot Be Rewarded: Length-Adaptive Average Lagging for Simultaneous Speech Translation
Simultaneous speech translation (SimulST) systems aim at generating their output with the lowest possible latency, which is normally computed in terms of Average Lagging (AL). In this paper we highlight that, despite its widespread adoption, AL provides underestimated scores for systems that generate longer predictions compared to the corresponding references. We also show that this problem has practical relevance, as recent SimulST systems have indeed a tendency to over-generate. As a solution, we propose LAAL (Length-Adaptive Average Lagging), a modified version of the metric that takes into account the over-generation phenomenon and allows for unbiased evaluation of both under-/over-generating systems
Dealing with training and test segmentation mismatch: FBK@IWSLT2021
This paper describes FBK's system submission to the IWSLT 2021 Offline Speech
Translation task. We participated with a direct model, which is a
Transformer-based architecture trained to translate English speech audio data
into German texts. The training pipeline is characterized by knowledge
distillation and a two-step fine-tuning procedure. Both knowledge distillation
and the first fine-tuning step are carried out on manually segmented real and
synthetic data, the latter being generated with an MT system trained on the
available corpora. Differently, the second fine-tuning step is carried out on a
random segmentation of the MuST-C v2 En-De dataset. Its main goal is to reduce
the performance drops occurring when a speech translation model trained on
manually segmented data (i.e. an ideal, sentence-like segmentation) is
evaluated on automatically segmented audio (i.e. actual, more realistic testing
conditions). For the same purpose, a custom hybrid segmentation procedure that
accounts for both audio content (pauses) and for the length of the produced
segments is applied to the test data before passing them to the system. At
inference time, we compared this procedure with a baseline segmentation method
based on Voice Activity Detection (VAD). Our results indicate the effectiveness
of the proposed hybrid approach, shown by a reduction of the gap with manual
segmentation from 8.3 to 1.4 BLEU points.Comment: Accepted at IWSLT202
End-to-End Speech-Translation with Knowledge Distillation: FBK@IWSLT2020
This paper describes FBK's participation in the IWSLT 2020 offline speech
translation (ST) task. The task evaluates systems' ability to translate English
TED talks audio into German texts. The test talks are provided in two versions:
one contains the data already segmented with automatic tools and the other is
the raw data without any segmentation. Participants can decide whether to work
on custom segmentation or not. We used the provided segmentation. Our system is
an end-to-end model based on an adaptation of the Transformer for speech data.
Its training process is the main focus of this paper and it is based on: i)
transfer learning (ASR pretraining and knowledge distillation), ii) data
augmentation (SpecAugment, time stretch and synthetic data), iii) combining
synthetic and real data marked as different domains, and iv) multi-task
learning using the CTC loss. Finally, after the training with word-level
knowledge distillation is complete, our ST models are fine-tuned using label
smoothed cross entropy. Our best model scored 29 BLEU on the MuST-C En-De test
set, which is an excellent result compared to recent papers, and 23.7 BLEU on
the same data segmented with VAD, showing the need for researching solutions
addressing this specific data condition.Comment: Accepted at IWSLT202
On the Dynamics of Gender Learning in Speech Translation
Due to the complexity of bias and the opaque nature of current neural approaches, there is a rising interest in auditing language technologies. In this work, we contribute to such a line of inquiry by exploring the emergence of gender bias in Speech Translation (ST). As a new perspective, rather than focusing on the final systems only, we examine their evolution over the course of training. In this way, we are able to account for different variables related to the learning dynamics of gender translation, and investigate when and how gender divides emerge in ST. Accordingly, for three language pairs (en ? es, fr, it) we compare how ST systems behave for masculine and feminine translation at several levels of granularity. We find that masculine and feminine curves are dissimilar, with the feminine one being characterized by more erratic behaviour and late improvements over the course of training. Also, depending on the considered phenomena, their learning trends can be either antiphase or parallel. Overall, we show how such a progressive analysis can inform on the reliability and time-wise acquisition of gender, which is concealed by static evaluations and standard metrics
On Target Segmentation for Direct Speech Translation
Recent studies on direct speech translation show continuous improvements by
means of data augmentation techniques and bigger deep learning models. While
these methods are helping to close the gap between this new approach and the
more traditional cascaded one, there are many incongruities among different
studies that make it difficult to assess the state of the art. Surprisingly,
one point of discussion is the segmentation of the target text. Character-level
segmentation has been initially proposed to obtain an open vocabulary, but it
results on long sequences and long training time. Then, subword-level
segmentation became the state of the art in neural machine translation as it
produces shorter sequences that reduce the training time, while being superior
to word-level models. As such, recent works on speech translation started using
target subwords despite the initial use of characters and some recent claims of
better results at the character level. In this work, we perform an extensive
comparison of the two methods on three benchmarks covering 8 language
directions and multilingual training. Subword-level segmentation compares
favorably in all settings, outperforming its character-level counterpart in a
range of 1 to 3 BLEU points.Comment: 14 pages single column, 4 figures, accepted for presentation at the
AMTA2020 research trac
Reproducibility is Nothing without Correctness: The Importance of Testing Code in NLP
Despite its pivotal role in research experiments, code correctness is often
presumed only on the basis of the perceived quality of the results. This comes
with the risk of erroneous outcomes and potentially misleading findings. To
address this issue, we posit that the current focus on result reproducibility
should go hand in hand with the emphasis on coding best practices. We bolster
our call to the NLP community by presenting a case study, in which we identify
(and correct) three bugs in widely used open-source implementations of the
state-of-the-art Conformer architecture. Through comparative experiments on
automatic speech recognition and translation in various language settings, we
demonstrate that the existence of bugs does not prevent the achievement of good
and reproducible results and can lead to incorrect conclusions that potentially
misguide future research. In response to this, this study is a call to action
toward the adoption of coding best practices aimed at fostering correctness and
improving the quality of the developed software
Test Suites Task: Evaluation of Gender Fairness in MT with MuST-SHE and INES
As part of the WMT-2023 "Test suites" shared task, in this paper we summarize
the results of two test suites evaluations: MuST-SHE-WMT23 and INES. By
focusing on the en-de and de-en language pairs, we rely on these newly created
test suites to investigate systems' ability to translate feminine and masculine
gender and produce gender-inclusive translations. Furthermore we discuss
metrics associated with our test suites and validate them by means of human
evaluations. Our results indicate that systems achieve reasonable and
comparable performance in correctly translating both feminine and masculine
gender forms for naturalistic gender phenomena. Instead, the generation of
inclusive language forms in translation emerges as a challenging task for all
the evaluated MT models, indicating room for future improvements and research
on the topic.Comment: Accepted at WMT 202
- …