Search CORE

113 research outputs found

Subword Evenness (SuE) as a Predictor of Cross-lingual Transfer to Low-resource Languages

Author: Pelloni Olga
Samardžić Tanja
Shaitarova Anastassia
Publication venue
Publication date: 11/12/2022
Field of study

Pre-trained multilingual models, such as mBERT, XLM-R and mT5, are used to improve the performance on various tasks in low-resource languages via cross-lingual transfer. In this framework, English is usually seen as the most natural choice for a transfer language (for fine-tuning or continued training of a multilingual pre-trained model), but it has been revealed recently that this is often not the best choice. The success of cross-lingual transfer seems to depend on some properties of languages, which are currently hard to explain. Successful transfer often happens between unrelated languages and it often cannot be explained by data-dependent factors.In this study, we show that languages written in non-Latin and non-alphabetic scripts (mostly Asian languages) are the best choices for improving performance on the task of Masked Language Modelling (MLM) in a diverse set of 30 low-resource languages and that the success of the transfer is well predicted by our novel measure of Subword Evenness (SuE). Transferring language models over the languages that score low on our measure results in the lowest average perplexity over target low-resource languages. Our correlation coefficients obtained with three different pre-trained multilingual models are consistently higher than all the other predictors, including text-based measures (type-token ratio, entropy) and linguistically motivated choice (genealogical and typological proximity)

ZORA

Bridging linguistic typology and multilingual machine translation with multi-view language representations

Author: Birch Alexandra
Haddow Barry
Oncevay Arturo
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2020
Field of study

Sparse language vectors from linguistic typology databases and learned embeddings from tasks like multilingual machine translation have been investigated in isolation, without analysing how they could benefit from each other's language characterisation. We propose to fuse both views using singular vector canonical correlation analysis and study what kind of information is induced from each source. By inferring typological features and language phylogenies, we observe that our representations embed typology and strengthen correlations with language relationships. We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy in tasks that require information about language similarities, such as language clustering and ranking candidates for multilingual transfer. With our method, we can easily project and assess new languages without expensive retraining of massive multilingual or ranking models, which are major disadvantages of related approaches.Comment: 15 pages, 6 figure

arXiv.org e-Print Archive

Crossref

Edinburgh Research Explorer

When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages

Author: Arnett Catherine
Bergen Benjamin K.
Chang Tyler A.
Tu Zhuowen
Publication venue
Publication date: 15/11/2023
Field of study

Multilingual language models are widely used to extend NLP systems to low-resource languages. However, concrete evidence for the effects of multilinguality on language modeling performance in individual languages remains scarce. Here, we pre-train over 10,000 monolingual and multilingual language models for over 250 languages, including multiple language families that are under-studied in NLP. We assess how language modeling performance in each language varies as a function of (1) monolingual dataset size, (2) added multilingual dataset size, (3) linguistic similarity of the added languages, and (4) model size (up to 45M parameters). We find that in moderation, adding multilingual data improves low-resource language modeling performance, similar to increasing low-resource dataset sizes by up to 33%. Improvements depend on the syntactic similarity of the added multilingual data, with marginal additional effects of vocabulary overlap. However, high-resource languages consistently perform worse in multilingual pre-training scenarios. As dataset sizes increase, adding multilingual data begins to hurt performance for both low-resource and high-resource languages, likely due to limited model capacity (the "curse of multilinguality"). These results suggest that massively multilingual pre-training may not be optimal for any languages involved, but that more targeted models can significantly improve performance

arXiv.org e-Print Archive

Language Modeling for Morphologically Rich Languages: Character-Aware Modeling for Word-Level Prediction

Author: Gerz Daniela
Korhonen Anna
Naradowsky Jason
Ponti Edoardo
Reichart Roi
Vulić Ivan
Publication venue: 'MIT Press - Journals'
Publication date: 01/07/2018
Field of study

Edinburgh Research Explorer

On the Relation between Linguistic Typology and (Limitations of) Multilingual Language Modeling

Author: Gerz Daniela
Korhonen Anna
Ponti Edoardo Maria
Reichart Roi
Vulić Ivan
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2018
Field of study

Crossref

Edinburgh Research Explorer

On the relation between linguistic typology and (limitations of) multilingual language modeling

Author: Gerz D
Korhonen A
Ponti EM
Reichart R
Vulić I
Publication venue: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018
Publication date: 01/01/2018
Field of study

A key challenge in cross-lingual NLP is developing general language-independent architectures that are equally applicable to any language. However, this ambition is largely hampered by the variation in structural and semantic properties, i.e. the typological profiles of the world's languages. In this work, we analyse the implications of this variation on the language modeling (LM) task. We present a large-scale study of state-of-the art n-gram based and neural language models on 50 typologically diverse languages covering a wide variety of morphological systems. Operating in the full vocabulary LM setup focused on word-level prediction, we demonstrate that a coarse typology of morphological systems is predictive of absolute LM performance. Moreover, fine-grained typological features such as exponence, flexivity, fusion, and inflectional synthesis are borne out to be responsible for the proliferation of low-frequency phenomena which are organically difficult to model by statistical architectures, or for the meaning ambiguity of character n-grams. Our study strongly suggests that these features have to be taken into consideration during the construction of next-level language-agnostic LM architectures, capable of handling morphologically complex languages such as Tamil or Korean.ERC grant Lexica

Crossref

Edinburgh Research Explorer

Apollo (Cambridge)