Search CORE

4 research outputs found

Are the Best Multilingual Document Embeddings simply Based on Sentence Embeddings?

Author: Espana-Bonet Cristina
Sannigrahi Sonal
van Genabith Josef
Publication venue
Publication date: 28/04/2023
Field of study

Dense vector representations for textual data are crucial in modern NLP. Word embeddings and sentence embeddings estimated from raw texts are key in achieving state-of-the-art results in various tasks requiring semantic understanding. However, obtaining embeddings at the document level is challenging due to computational requirements and lack of appropriate data. Instead, most approaches fall back on computing document embeddings based on sentence representations. Although there exist architectures and models to encode documents fully, they are in general limited to English and few other high-resourced languages. In this work, we provide a systematic comparison of methods to produce document-level representations from sentences based on LASER, LaBSE, and Sentence BERT pre-trained multilingual models. We compare input token number truncation, sentence averaging as well as some simple windowing and in some cases new augmented and learnable approaches, on 3 multi- and cross-lingual tasks in 8 languages belonging to 3 different language families. Our task-based extrinsic evaluations show that, independently of the language, a clever combination of sentence embeddings is usually better than encoding the full document as a single unit, even when this is possible. We demonstrate that while a simple sentence average results in a strong baseline for classification tasks, more complex combinations are necessary for semantic tasks.Comment: EACL 2023 Findings paper, to present at LoResM

arXiv.org e-Print Archive

Investigating Lexical Sharing in Multilingual Machine Translation for Indian Languages

Author: Bawden Rachel
Sannigrahi Sonal
Publication venue: HAL CCSD
Publication date: 12/06/2023
Field of study

International audienceMultilingual language models have shown impressive cross-lingual transfer ability across a diverse set of languages and tasks. To improve the cross-lingual ability of these models, some strategies include transliteration and finer-grained segmentation into characters as opposed to subwords. In this work, we investigate lexical sharing in multilingual machine translation (MT) from Hindi, Gujarati, Nepali into English. We explore the trade-offs that exist in translation performance between data sampling and vocabulary size, and we explore whether transliteration is useful in encouraging cross-script generalisation. We also verify how the different settings generalise to unseen languages (Marathi and Bengali). We find that transliteration does not give pronounced improvements and our analysis suggests that our multilingual MT models trained on original scripts seem to already be robust to cross-script differences even for relatively low-resource languages. Our code will be made publicly available.

INRIA a CCSD electronic archive server

Investigating Lexical Sharing in Multilingual Machine Translation for Indian Languages

Author: Bawden Rachel
Sannigrahi Sonal
Publication venue
Publication date: 04/05/2023
Field of study

Multilingual language models have shown impressive cross-lingual transfer ability across a diverse set of languages and tasks. To improve the cross-lingual ability of these models, some strategies include transliteration and finer-grained segmentation into characters as opposed to subwords. In this work, we investigate lexical sharing in multilingual machine translation (MT) from Hindi, Gujarati, Nepali into English. We explore the trade-offs that exist in translation performance between data sampling and vocabulary size, and we explore whether transliteration is useful in encouraging cross-script generalisation. We also verify how the different settings generalise to unseen languages (Marathi and Bengali). We find that transliteration does not give pronounced improvements and our analysis suggests that our multilingual MT models trained on original scripts seem to already be robust to cross-script differences even for relatively low-resource languagesComment: EAMT main conferenc

arXiv.org e-Print Archive

Isomorphic Cross-lingual Embeddings for Low-Resource Languages

Author: Jesse Read
Sonal Sannigrahi
Publication venue: HAL CCSD
Publication date: 01/01/2022
Field of study

International audienc

arXiv.org e-Print Archive

HAL-Polytechnique