127 research outputs found
Good, but not always Fair: An Evaluation of Gender Bias for three commercial Machine Translation Systems
Machine Translation (MT) continues to make significant strides in quality and
is increasingly adopted on a larger scale. Consequently, analyses have been
redirected to more nuanced aspects, intricate phenomena, as well as potential
risks that may arise from the widespread use of MT tools. Along this line, this
paper offers a meticulous assessment of three commercial MT systems - Google
Translate, DeepL, and Modern MT - with a specific focus on gender translation
and bias. For three language pairs (English/Spanish, English/Italian, and
English/French), we scrutinize the behavior of such systems at several levels
of granularity and on a variety of naturally occurring gender phenomena in
translation. Our study takes stock of the current state of online MT tools, by
revealing significant discrepancies in the gender translation of the three
systems, with each system displaying varying degrees of bias despite their
overall translation quality.Comment: Under review at HERMES Journa
Good, but not always Fair: An Evaluation of Gender Bias for three Commercial Machine Translation Systems
Machine Translation (MT) continues to make significant strides in quality and is increasingly adopted on a larger scale. Consequently, analyses have been redirected to more nuanced aspects, intricate phenomena, as well as potential risks that may arise from the widespread use of MT tools. Along this line, this paper offers a meticulous assessment of three commercial MT systems - Google Translate, DeepL, and Modern MT - with a specific focus on gender translation and bias. For three language pairs (English-Spanish, English-Italian, and English-French), we scrutinize the behavior of such systems at several levels of granularity and on a variety of naturally occurring gender phenomena in translation. Our study takes stock of the current state of online MT tools, by revealing significant discrepancies in the gender translation of the three systems, with each system displaying varying degrees of bias despite their overall translation quality
How To Build Competitive Multi-gender Speech Translation Models For Controlling Speaker Gender Translation
When translating from notional gender languages (e.g., English) into
grammatical gender languages (e.g., Italian), the generated translation
requires explicit gender assignments for various words, including those
referring to the speaker. When the source sentence does not convey the
speaker's gender, speech translation (ST) models either rely on the
possibly-misleading vocal traits of the speaker or default to the masculine
gender, the most frequent in existing training corpora. To avoid such biased
and not inclusive behaviors, the gender assignment of speaker-related
expressions should be guided by externally-provided metadata about the
speaker's gender. While previous work has shown that the most effective
solution is represented by separate, dedicated gender-specific models, the goal
of this paper is to achieve the same results by integrating the speaker's
gender metadata into a single "multi-gender" neural ST model, easier to
maintain. Our experiments demonstrate that a single multi-gender model
outperforms gender-specialized ones when trained from scratch (with gender
accuracy gains up to 12.9 for feminine forms), while fine-tuning from existing
ST models does not lead to competitive results.Comment: To appear in CLiC-it 202
MAGMATic: A Multi-domain Academic Gold Standard with Manual Annotation of Terminology for Machine Translation Evaluation
This paper presents MAGMATic (Multidomain Academic Gold Standard with Manual Annotation of Terminology), a novel Italian–English benchmark which allows MT evaluation focused on terminology translation. The data set comprises 2,056 parallel sentences extracted from institutional academic texts, namely course unit and degree program descriptions. This text type is particularly interesting since it contains terminology from multiple domains, e.g. education and different academic disciplines described in the texts. All terms in the English target side of the data set were manually identified and annotated with a domain label, for a total of 7,517 annotated terms. Due to their peculiar features, institutional academic texts represent an interesting test bed for MT. As a further contribution of this paper, we investigate the feasibility of exploiting MT for the translation of this type of documents. To this aim, we evaluate two stateof-the-art Neural MT systems on MAGMATic, focusing on their ability to translate domain-specific terminology
Do translator trainees trust machine translation? An experiment on post-editing and revision
Despite the importance of trust in any work environment, this concept has rarely been investigated for MT. The present contribution aims at filling this gap by presenting a post-editing experiment carried out with translator trainees. An institutional academic text was translated from Italian into English. All participants worked on the same target text. Half of them were told that the text was a human translation needing revision, while the other half was told that it was an MT output to be postedited. Temporal and technical effort were measured based on words per second and HTER. Results were complemented with a manual analysis of a subset of the observations
Test Suites Task: Evaluation of Gender Fairness in MT with MuST-SHE and INES
As part of the WMT-2023 "Test suites" shared task, in this paper we summarize
the results of two test suites evaluations: MuST-SHE-WMT23 and INES. By
focusing on the en-de and de-en language pairs, we rely on these newly created
test suites to investigate systems' ability to translate feminine and masculine
gender and produce gender-inclusive translations. Furthermore we discuss
metrics associated with our test suites and validate them by means of human
evaluations. Our results indicate that systems achieve reasonable and
comparable performance in correctly translating both feminine and masculine
gender forms for naturalistic gender phenomena. Instead, the generation of
inclusive language forms in translation emerges as a challenging task for all
the evaluated MT models, indicating room for future improvements and research
on the topic.Comment: Accepted at WMT 202
Test Suites Task: Evaluation of Gender Fairness in MT with MuST-SHE and INES
As part of the WMT-2023 “Test suites” shared task, in this paper we summarize the results of two test suites evaluations: MuST-SHEWMT23 and INES. By focusing on the en-de and de-en language pairs, we rely on these newly created test suites to investigate systems’ ability to translate feminine and masculine gender and produce gender-inclusive translations. Furthermore we discuss metrics associated with our test suites and validate them by means of human evaluations. Our results indicate that systems achieve reasonable and comparable performance in correctly translating both feminine and masculine gender forms for naturalistic gender phenomena. Instead, the generation of inclusive language forms in translation emerges as a challenging task for all the evaluated MT models, indicating room for future improvements and research on the topic. We make MuST-SHEWMT23 and INES freely available
How do Hyenas deal with Human Speech? Speech Recognition and Translation with ConfHyena
The attention mechanism, a cornerstone of state-of-the-art neural models,
faces computational hurdles in processing long sequences due to its quadratic
complexity. Consequently, research efforts in the last few years focused on
finding more efficient alternatives. Among them, Hyena (Poli et al., 2023)
stands out for achieving competitive results in both language modeling and
image classification, while offering sub-quadratic memory and computational
complexity. Building on these promising results, we propose ConfHyena, a
Conformer whose encoder self-attentions are replaced with an adaptation of
Hyena for speech processing, where the long input sequences cause high
computational costs. Through experiments in automatic speech recognition (for
English) and translation (from English into 8 target languages), we show that
our best ConfHyena model significantly reduces the training time by 27%, at the
cost of minimal quality degradation (~1%), which, in most cases, is not
statistically significant.Comment: Accepted at LREC-COLING 202
Report on the 11th IWSLT Evaluation Campaign
The paper overviews the 11th evaluation campaign organized by the IWSLT workshop. The 2014 evaluation offered multiple tracks on lecture transcription and translation based on the TED Talks corpus. In particular, this year IWSLT included three automatic speech recognition tracks, on English, German and Italian, five speech translation tracks, from English to French, English to German, German to English, English to Italian, and Italian to English, and five text translation track, also from English to French, English to German, German to English, English to Italian, and Italian to English. In addition to the official tracks, speech and text translation optional tracks were offered, globally involving 12 other languages: Arabic, Spanish, Portuguese (B), Hebrew, Chinese, Polish, Persian, Slovenian, Turkish, Dutch, Romanian, Russian. Overall, 21 teams participated in the evaluation, for a total of 76 primary runs submitted. Participants were also asked to submit runs on the 2013 test set (progress test set), in order to measure the progress of systems with respect to the previous year. All runs were evaluated with objective metrics, and submissions for two of the official text translation tracks were also evaluated with human post-editing
- …