Search CORE

106 research outputs found

Understanding the effects of language-specific class imbalance in multilingual fine-tuning

Author: Jung Vincent
van der Plas Lonneke
Publication venue
Publication date: 20/02/2024
Field of study

We study the effect of one type of imbalance often present in real-life multilingual classification datasets: an uneven distribution of labels across languages. We show evidence that fine-tuning a transformer-based Large Language Model (LLM) on a dataset with this imbalance leads to worse performance, a more pronounced separation of languages in the latent space, and the promotion of uninformative features. We modify the traditional class weighing approach to imbalance by calculating class weights separately for each language and show that this helps mitigate those detrimental effects. These results create awareness of the negative effects of language-specific class imbalance in multilingual fine-tuning and the way in which the model learns to rely on the separation of languages to perform the task.Comment: To be published in: Findings of the Association for Computational Linguistics: EACL 202

arXiv.org e-Print Archive

Learning to Predict Novel Noun-Noun Compounds

Author: Dhar Prajit
van der Plas Lonneke
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2019
Field of study

We introduce temporally and contextually-aware models for the novel task of predicting unseen but plausible concepts, as conveyed by noun-noun compounds in a time-stamped corpus. We train compositional models on observed compounds, more specifically the composed distributed representations of their constituents across a time-stamped corpus, while giving it corrupted instances (where head or modifier are replaced by a random constituent) as negative evidence. The model captures generalisations over this data and learns what combinations give rise to plausible compounds and which ones do not. After training, we query the model for the plausibility of automatically generated novel combinations and verify whether the classifications are accurate. For our best model, we find that in around 85% of the cases, the novel compounds generated are attested in previously unseen data. An additional estimated 5% are plausible despite not being attested in the recent corpus, based on judgments from independent human raters.Comment: 9 pages, 3 figures, To appear at Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019) at ACL 2019. V3 - Fixed some typos and updated the Data Preprocessing sectio

arXiv.org e-Print Archive

Crossref

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

Can language models learn analogical reasoning? Investigating training objectives and comparisons to human performance

Author: Petersen Molly R.
van der Plas Lonneke
Publication venue
Publication date: 22/10/2023
Field of study

While analogies are a common way to evaluate word embeddings in NLP, it is also of interest to investigate whether or not analogical reasoning is a task in itself that can be learned. In this paper, we test several ways to learn basic analogical reasoning, specifically focusing on analogies that are more typical of what is used to evaluate analogical reasoning in humans than those in commonly used NLP benchmarks. Our experiments find that models are able to learn analogical reasoning, even with a small amount of data. We additionally compare our models to a dataset with a human baseline, and find that after training, models approach human performance

arXiv.org e-Print Archive

Measuring the Compositionality of Noun-Noun Compounds over Time

Author: Dhar Prajit
Pagel Janis
van der Plas Lonneke
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2019
Field of study

We present work in progress on the temporal progression of compositionality in noun-noun compounds. Previous work has proposed computational methods for determining the compositionality of compounds. These methods try to automatically determine how transparent the meaning of the compound as a whole is with respect to the meaning of its parts. We hypothesize that such a property might change over time. We use the time-stamped Google Books corpus for our diachronic investigations, and first examine whether the vector-based semantic spaces extracted from this corpus are able to predict compositionality ratings, despite their inherent limitations. We find that using temporal information helps predicting the ratings, although correlation with the ratings is lower than reported for other corpora. Finally, we show changes in compositionality over time for a selection of compounds.Comment: 6 pages, 3 figures, To appear in the proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change 2019 @ ACL 2019, Fixed typos, Increased figure size

arXiv.org e-Print Archive

Crossref

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

The Scope and the Sources of Variation in Verbal Predicates in English and French

Author: Kashaeva Goljihan
Merlo Paola
Samardžić Tanja
van der Plas Lonneke
Publication venue
Publication date: 01/12/2010
Field of study

Proceedings of the Ninth International Workshop on Treebanks and Linguistic Theories. Editors: Markus Dickinson, Kaili Müürisep and Marco Passarotti. NEALT Proceedings Series, Vol. 9 (2010), 199-210. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15891

DSpace at Tartu University Library

Analysis of Data Augmentation Methods for Low-Resource Maltese ASR

Author: Borg Claudia
DeMarco Andrea
Gatt Albert
Mena Carlos
van der Plas Lonneke
Williams Aiden
Publication venue
Publication date: 20/01/2023
Field of study

Recent years have seen an increased interest in the computational speech processing of Maltese, but resources remain sparse. In this paper, we consider data augmentation techniques for improving speech recognition for low-resource languages, focusing on Maltese as a test case. We consider three different types of data augmentation: unsupervised training, multilingual training and the use of synthesized speech as training data. The goal is to determine which of these techniques, or combination of them, is the most effective to improve speech recognition for languages where the starting point is a small corpus of approximately 7 hours of transcribed speech. Our results show that combining the data augmentation techniques studied here lead us to an absolute WER improvement of 15% without the use of a language model.Comment: 12 page

arXiv.org e-Print Archive

Annotating for Hate Speech: The MaNeCo Corpus and Some Input from Critical Discourse Analysis

Author: 12th Language Resources and Evaluation Conference
Assimakopoulos Stavros
Gatt Albert
van der Plas Lonneke
Vella Muskat Rebecca
Publication venue
Publication date: 01/01/2020
Field of study

This paper presents a novel scheme for the annotation of hate speech in corpora of Web 2.0 commentary. The proposed scheme is motivated by the critical analysis of posts made in reaction to news reports on the Mediterranean migration crisis and LGBTIQ+ matters in Malta, which was conducted under the auspices of the EU-funded C.O.N.T.A.C.T. project. Based on the realization that hate speech is not a clear-cut category to begin with, appears to belong to a continuum of discriminatory discourse and is often realized through the use of indirect linguistic means, it is argued that annotation schemes for its detection should refrain from directly including the label 'hate speech,' as different annotators might have different thresholds as to what constitutes hate speech and what not. In view of this, we suggest a multi-layer annotation scheme, which is pilot-tested against a binary +/- hate speech classification and appears to yield higher inter-annotator agreement. Motivating the postulation of our scheme, we then present the MaNeCo corpus on which it will eventually be used; a substantial corpus of on-line newspaper comments spanning 10 years.Comment: 10 pages, 1 table. Appears in Proceedings of the 12th edition of the Language Resources and Evaluation Conference (LREC'20

arXiv.org e-Print Archive

OAR@UM