4 research outputs found
Are the Best Multilingual Document Embeddings simply Based on Sentence Embeddings?
Dense vector representations for textual data are crucial in modern NLP. Word
embeddings and sentence embeddings estimated from raw texts are key in
achieving state-of-the-art results in various tasks requiring semantic
understanding. However, obtaining embeddings at the document level is
challenging due to computational requirements and lack of appropriate data.
Instead, most approaches fall back on computing document embeddings based on
sentence representations. Although there exist architectures and models to
encode documents fully, they are in general limited to English and few other
high-resourced languages. In this work, we provide a systematic comparison of
methods to produce document-level representations from sentences based on
LASER, LaBSE, and Sentence BERT pre-trained multilingual models. We compare
input token number truncation, sentence averaging as well as some simple
windowing and in some cases new augmented and learnable approaches, on 3 multi-
and cross-lingual tasks in 8 languages belonging to 3 different language
families. Our task-based extrinsic evaluations show that, independently of the
language, a clever combination of sentence embeddings is usually better than
encoding the full document as a single unit, even when this is possible. We
demonstrate that while a simple sentence average results in a strong baseline
for classification tasks, more complex combinations are necessary for semantic
tasks.Comment: EACL 2023 Findings paper, to present at LoResM
Investigating Lexical Sharing in Multilingual Machine Translation for Indian Languages
International audienceMultilingual language models have shown impressive cross-lingual transfer ability across a diverse set of languages and tasks. To improve the cross-lingual ability of these models, some strategies include transliteration and finer-grained segmentation into characters as opposed to subwords. In this work, we investigate lexical sharing in multilingual machine translation (MT) from Hindi, Gujarati, Nepali into English. We explore the trade-offs that exist in translation performance between data sampling and vocabulary size, and we explore whether transliteration is useful in encouraging cross-script generalisation. We also verify how the different settings generalise to unseen languages (Marathi and Bengali). We find that transliteration does not give pronounced improvements and our analysis suggests that our multilingual MT models trained on original scripts seem to already be robust to cross-script differences even for relatively low-resource languages. Our code will be made publicly available.
Investigating Lexical Sharing in Multilingual Machine Translation for Indian Languages
Multilingual language models have shown impressive cross-lingual transfer
ability across a diverse set of languages and tasks. To improve the
cross-lingual ability of these models, some strategies include transliteration
and finer-grained segmentation into characters as opposed to subwords. In this
work, we investigate lexical sharing in multilingual machine translation (MT)
from Hindi, Gujarati, Nepali into English. We explore the trade-offs that exist
in translation performance between data sampling and vocabulary size, and we
explore whether transliteration is useful in encouraging cross-script
generalisation. We also verify how the different settings generalise to unseen
languages (Marathi and Bengali). We find that transliteration does not give
pronounced improvements and our analysis suggests that our multilingual MT
models trained on original scripts seem to already be robust to cross-script
differences even for relatively low-resource languagesComment: EAMT main conferenc