28 research outputs found
Evaluating Semantic Vectors for Norwegian
In this article, we present two benchmark data sets for evaluating models of semantic word similarity for Norwegian. While such resources are available for English, they did not exist for Norwegian prior to this work. Furthermore, we produce large-coverage semantic vectors trained on the Norwegian Newspaper Corpus using several popular word embedding frameworks. Finally, we demonstrate the usefulness of the created resources for evaluating performance of different word embedding models on the tasks of analogical reasoning and synonym detection. The benchmark data sets and word embeddings are all made freely available
Does Manipulating Tokenization Aid Cross-Lingual Transfer? A Study on POS Tagging for Non-Standardized Languages
One of the challenges with finetuning pretrained language models (PLMs) is that their tokenizer is optimized for the language(s) it was pretrained on, but brittle when it comes to previously unseen variations in the data. This can for instance be observed when finetuning PLMs on one language and evaluating them on data in a closely related language variety with no standardized orthography. Despite the high linguistic similarity, tokenization no longer corresponds to meaningful representations of the target data, leading to low performance in, e.g., part-of-speech tagging. In this work, we finetune PLMs on seven languages from three different families and analyze their zero-shot performance on closely related, non-standardized varieties. We consider different measures for the divergence in the tokenization of the source and target data, and the way they can be adjusted by manipulating the tokenization during the finetuning step. Overall, we find that the similarity between the percentage of words that get split into subwords in the source and target data (the split word ratio difference) is the strongest predictor for model performance on target data
Investigating multilingual approaches for parsing universal dependencies
Multilingual dependency parsing encapsulates any attempt to parse multiple languages. It can involve parsing multiple languages in isolation (poly-monolingual), leveraging training data from multiple languages to process any of the included languages (polyglot), or training on one or multiple languages to process a low-resource language with no training data (zero-shot). In this thesis, we explore multilingual dependency parsing across all three paradigms, first analysing whether polyglot training on a number of source languages is beneficial for processing a target language in a zero-shot cross-lingual dependency parsing experiment using annotation projection. The results of this experiment show that polyglot training produces an overall trend of better results on the target language but a highly-related single source language can still be better for transfer.
We then look at the role of pretrained language models in processing a moderately low-resource language in Irish. Here, we develop our own monolingual Irish BERT model gaBERT from scratch and compare it to a number of multilingual baselines, showing that developing a monolingual language model for Irish is worthwhile. We then turn to the topic of parsing Enhanced Universal Dependencies (EUD) Graphs, which are an extension of basic Universal Dependencies trees, where we describe the DCU-EPFL submission to the 2021 IWPT shared task on EUD parsing. Here, we developed a multitask model to jointly learn the tasks of basic dependency parsing and EUD graph parsing, showing improvements over a single-task basic dependency parser. Lastly, we revisit the topic of polyglot parsing and investigate whether multiview learning can be applied to the problem of multilingual dependency parsing. Here, we learn different views based on the dataset source. We show that multiview learning can be used to train parsers with multiple datasets, showing a general improvement over single-view baselines
Explaining landscape preference heterogeneity using machine learning-based survey analysis
We conducted a national survey on a high-quality internet panel to study landscape preferences in Norway, using photos as stimuli. We examined preference heterogeneity with respect to socio-demographic characteristics and latent topics brought up by the respondents, using ordinal logistic regression and structural topic modelling (STM), a machine learning-based analysis. We found that pasture landscapes are the most favoured (55%), while densely planted spruce forests are the least favoured (8%). The contrast was particularly strong between eastern and western Norway, between men and women, and between young and old. STM revealed that the choices were mainly driven by the preference for landscape openness, especially by women. Other important drivers were concerns regarding reforestation of former farmlands, aesthetic properties, forest management, biodiversity issues, and cultural values. Our results suggest that landscape policies may clash with socio-cultural preferences, and failure to account for these may undermine the success of a policy.publishedVersio
Transfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese
Multilingual language models have pushed state-of-the-art in cross-lingual
NLP transfer. The majority of zero-shot cross-lingual transfer, however, use
one and the same massively multilingual transformer (e.g., mBERT or XLM-R) to
transfer to all target languages, irrespective of their typological,
etymological, and phylogenetic relations to other languages. In particular,
readily available data and models of resource-rich sibling languages are often
ignored. In this work, we empirically show, in a case study for Faroese -- a
low-resource language from a high-resource language family -- that by
leveraging the phylogenetic information and departing from the
'one-size-fits-all' paradigm, one can improve cross-lingual transfer to
low-resource languages. In particular, we leverage abundant resources of other
Scandinavian languages (i.e., Danish, Norwegian, Swedish, and Icelandic) for
the benefit of Faroese. Our evaluation results show that we can substantially
improve the transfer performance to Faroese by exploiting data and models of
closely-related high-resource languages. Further, we release a new web corpus
of Faroese and Faroese datasets for named entity recognition (NER), semantic
text similarity (STS), and new language models trained on all Scandinavian
languages