9 research outputs found
Are the Best Multilingual Document Embeddings simply Based on Sentence Embeddings?
Dense vector representations for textual data are crucial in modern NLP. Word
embeddings and sentence embeddings estimated from raw texts are key in
achieving state-of-the-art results in various tasks requiring semantic
understanding. However, obtaining embeddings at the document level is
challenging due to computational requirements and lack of appropriate data.
Instead, most approaches fall back on computing document embeddings based on
sentence representations. Although there exist architectures and models to
encode documents fully, they are in general limited to English and few other
high-resourced languages. In this work, we provide a systematic comparison of
methods to produce document-level representations from sentences based on
LASER, LaBSE, and Sentence BERT pre-trained multilingual models. We compare
input token number truncation, sentence averaging as well as some simple
windowing and in some cases new augmented and learnable approaches, on 3 multi-
and cross-lingual tasks in 8 languages belonging to 3 different language
families. Our task-based extrinsic evaluations show that, independently of the
language, a clever combination of sentence embeddings is usually better than
encoding the full document as a single unit, even when this is possible. We
demonstrate that while a simple sentence average results in a strong baseline
for classification tasks, more complex combinations are necessary for semantic
tasks.Comment: EACL 2023 Findings paper, to present at LoResM
The effect of domain and diacritics in Yorùbá-English neural machine translation
International audienceMassively multilingual machine translation (MT) has shown impressive capabilities, including zero and few-shot translation between low-resource language pairs. However, these models are often evaluated on high-resource languages with the assumption that they generalize to low-resource ones. The difficulty of evaluating MT models on low-resource pairs is often due to lack of standardized evaluation datasets. In this paper, we present MENYO-20k, the first multi-domain parallel corpus with a special focus on clean orthography for Yorùbá-English with standardized train-test splits for benchmarking. We provide several neural MT benchmarks and compare them to the performance of popular pre-trained (massively multilingual) MT models both for the heterogeneous test set and its subdomains. Since these pre-trained models use huge amounts of data with uncertain quality, we also analyze the effect of diacritics, a major characteristic of Yorùbá, in the training data. We investigate how and when this training condition affects the final quality and intelligibility of a translation. Our models outperform massively multilingual models such as Google (+8.7 BLEU) and Facebook M2M (+9.1 BLEU) when translating to Yorùbá, setting a high quality benchmark for future research
Findings of the Second WMT Shared Task on Sign Language Translation (WMT-SLT23)
International audienceThis paper presents the results of the Second WMT Shared Task on Sign Language Translation (WMT-SLT23) 1 . This shared task is concerned with automatic translation between signed and spoken 2 languages. The task is unusual in the sense that it requires processing visual information (such as video frames or human pose estimation) beyond the well-known paradigm of text-to-text machine translation (MT). The task offers four tracks involving the following languages: Swiss German Sign Language (DSGS), French Sign Language of Switzerland (LSF-CH), Italian Sign Language of Switzerland (LIS-CH), German, French and Italian. Four teams (including one working on a baseline submission) participated in this second edition of the task, all submitting to the DSGS-to-German track. Besides a system ranking and system papers describing state-of-the-art techniques, this shared task makes the following scientific contributions: novel corpora and reproducible baseline systems. Finally, the task also resulted in publicly available sets of system outputs and more human evaluation scores for sign language translation
Findings of the Second WMT Shared Task on Sign Language Translation (WMT-SLT23)
International audienceThis paper presents the results of the Second WMT Shared Task on Sign Language Translation (WMT-SLT23) 1 . This shared task is concerned with automatic translation between signed and spoken 2 languages. The task is unusual in the sense that it requires processing visual information (such as video frames or human pose estimation) beyond the well-known paradigm of text-to-text machine translation (MT). The task offers four tracks involving the following languages: Swiss German Sign Language (DSGS), French Sign Language of Switzerland (LSF-CH), Italian Sign Language of Switzerland (LIS-CH), German, French and Italian. Four teams (including one working on a baseline submission) participated in this second edition of the task, all submitting to the DSGS-to-German track. Besides a system ranking and system papers describing state-of-the-art techniques, this shared task makes the following scientific contributions: novel corpora and reproducible baseline systems. Finally, the task also resulted in publicly available sets of system outputs and more human evaluation scores for sign language translation
First WMT Shared Task on Sign Language Translation (WMT-SLT22)
International audienceThis paper is a brief summary of the First WMT Shared Task on Sign Language Translation (WMT-SLT22), a project partly funded by EAMT. The focus of this shared task is automatic translation between signed and spoken languages. Details can be found on our website 1 or in the findings paper (Müller et al., 2022)
First WMT Shared Task on Sign Language Translation (WMT-SLT22)
International audienceThis paper is a brief summary of the First WMT Shared Task on Sign Language Translation (WMT-SLT22), a project partly funded by EAMT. The focus of this shared task is automatic translation between signed and spoken languages. Details can be found on our website 1 or in the findings paper (Müller et al., 2022)
Role of age and comorbidities in mortality of patients with infective endocarditis.
The aim of this study was to analyse the characteristics of patients with IE in three groups of age and to assess the ability of age and the Charlson Comorbidity Index (CCI) to predict mortality. Prospective cohort study of all patients with IE included in the GAMES Spanish database between 2008 and 2015.Patients were stratified into three age groups: A total of 3120 patients with IE (1327 There were no differences in the clinical presentation of IE between the groups. Age ≥ 80 years, high comorbidity (measured by CCI),and non-performance of surgery were independent predictors of mortality in patients with IE.CCI could help to identify those patients with IE and surgical indication who present a lower risk of in-hospital and 1-year mortality after surgery, especially in th