9 research outputs found

    Are the Best Multilingual Document Embeddings simply Based on Sentence Embeddings?

    Full text link
    Dense vector representations for textual data are crucial in modern NLP. Word embeddings and sentence embeddings estimated from raw texts are key in achieving state-of-the-art results in various tasks requiring semantic understanding. However, obtaining embeddings at the document level is challenging due to computational requirements and lack of appropriate data. Instead, most approaches fall back on computing document embeddings based on sentence representations. Although there exist architectures and models to encode documents fully, they are in general limited to English and few other high-resourced languages. In this work, we provide a systematic comparison of methods to produce document-level representations from sentences based on LASER, LaBSE, and Sentence BERT pre-trained multilingual models. We compare input token number truncation, sentence averaging as well as some simple windowing and in some cases new augmented and learnable approaches, on 3 multi- and cross-lingual tasks in 8 languages belonging to 3 different language families. Our task-based extrinsic evaluations show that, independently of the language, a clever combination of sentence embeddings is usually better than encoding the full document as a single unit, even when this is possible. We demonstrate that while a simple sentence average results in a strong baseline for classification tasks, more complex combinations are necessary for semantic tasks.Comment: EACL 2023 Findings paper, to present at LoResM

    The effect of domain and diacritics in Yorùbá-English neural machine translation

    No full text
    International audienceMassively multilingual machine translation (MT) has shown impressive capabilities, including zero and few-shot translation between low-resource language pairs. However, these models are often evaluated on high-resource languages with the assumption that they generalize to low-resource ones. The difficulty of evaluating MT models on low-resource pairs is often due to lack of standardized evaluation datasets. In this paper, we present MENYO-20k, the first multi-domain parallel corpus with a special focus on clean orthography for Yorùbá-English with standardized train-test splits for benchmarking. We provide several neural MT benchmarks and compare them to the performance of popular pre-trained (massively multilingual) MT models both for the heterogeneous test set and its subdomains. Since these pre-trained models use huge amounts of data with uncertain quality, we also analyze the effect of diacritics, a major characteristic of Yorùbá, in the training data. We investigate how and when this training condition affects the final quality and intelligibility of a translation. Our models outperform massively multilingual models such as Google (+8.7 BLEU) and Facebook M2M (+9.1 BLEU) when translating to Yorùbá, setting a high quality benchmark for future research

    Findings of the Second WMT Shared Task on Sign Language Translation (WMT-SLT23)

    Get PDF
    International audienceThis paper presents the results of the Second WMT Shared Task on Sign Language Translation (WMT-SLT23) 1 . This shared task is concerned with automatic translation between signed and spoken 2 languages. The task is unusual in the sense that it requires processing visual information (such as video frames or human pose estimation) beyond the well-known paradigm of text-to-text machine translation (MT). The task offers four tracks involving the following languages: Swiss German Sign Language (DSGS), French Sign Language of Switzerland (LSF-CH), Italian Sign Language of Switzerland (LIS-CH), German, French and Italian. Four teams (including one working on a baseline submission) participated in this second edition of the task, all submitting to the DSGS-to-German track. Besides a system ranking and system papers describing state-of-the-art techniques, this shared task makes the following scientific contributions: novel corpora and reproducible baseline systems. Finally, the task also resulted in publicly available sets of system outputs and more human evaluation scores for sign language translation

    Findings of the Second WMT Shared Task on Sign Language Translation (WMT-SLT23)

    No full text
    International audienceThis paper presents the results of the Second WMT Shared Task on Sign Language Translation (WMT-SLT23) 1 . This shared task is concerned with automatic translation between signed and spoken 2 languages. The task is unusual in the sense that it requires processing visual information (such as video frames or human pose estimation) beyond the well-known paradigm of text-to-text machine translation (MT). The task offers four tracks involving the following languages: Swiss German Sign Language (DSGS), French Sign Language of Switzerland (LSF-CH), Italian Sign Language of Switzerland (LIS-CH), German, French and Italian. Four teams (including one working on a baseline submission) participated in this second edition of the task, all submitting to the DSGS-to-German track. Besides a system ranking and system papers describing state-of-the-art techniques, this shared task makes the following scientific contributions: novel corpora and reproducible baseline systems. Finally, the task also resulted in publicly available sets of system outputs and more human evaluation scores for sign language translation

    First WMT Shared Task on Sign Language Translation (WMT-SLT22)

    No full text
    International audienceThis paper is a brief summary of the First WMT Shared Task on Sign Language Translation (WMT-SLT22), a project partly funded by EAMT. The focus of this shared task is automatic translation between signed and spoken languages. Details can be found on our website 1 or in the findings paper (Müller et al., 2022)

    First WMT Shared Task on Sign Language Translation (WMT-SLT22)

    No full text
    International audienceThis paper is a brief summary of the First WMT Shared Task on Sign Language Translation (WMT-SLT22), a project partly funded by EAMT. The focus of this shared task is automatic translation between signed and spoken languages. Details can be found on our website 1 or in the findings paper (Müller et al., 2022)

    Role of age and comorbidities in mortality of patients with infective endocarditis.

    No full text
    The aim of this study was to analyse the characteristics of patients with IE in three groups of age and to assess the ability of age and the Charlson Comorbidity Index (CCI) to predict mortality. Prospective cohort study of all patients with IE included in the GAMES Spanish database between 2008 and 2015.Patients were stratified into three age groups: A total of 3120 patients with IE (1327  There were no differences in the clinical presentation of IE between the groups. Age ≥ 80 years, high comorbidity (measured by CCI),and non-performance of surgery were independent predictors of mortality in patients with IE.CCI could help to identify those patients with IE and surgical indication who present a lower risk of in-hospital and 1-year mortality after surgery, especially in th
    corecore