106 research outputs found
Understanding the effects of language-specific class imbalance in multilingual fine-tuning
We study the effect of one type of imbalance often present in real-life
multilingual classification datasets: an uneven distribution of labels across
languages. We show evidence that fine-tuning a transformer-based Large Language
Model (LLM) on a dataset with this imbalance leads to worse performance, a more
pronounced separation of languages in the latent space, and the promotion of
uninformative features. We modify the traditional class weighing approach to
imbalance by calculating class weights separately for each language and show
that this helps mitigate those detrimental effects. These results create
awareness of the negative effects of language-specific class imbalance in
multilingual fine-tuning and the way in which the model learns to rely on the
separation of languages to perform the task.Comment: To be published in: Findings of the Association for Computational
Linguistics: EACL 202
Learning to Predict Novel Noun-Noun Compounds
We introduce temporally and contextually-aware models for the novel task of
predicting unseen but plausible concepts, as conveyed by noun-noun compounds in
a time-stamped corpus. We train compositional models on observed compounds,
more specifically the composed distributed representations of their
constituents across a time-stamped corpus, while giving it corrupted instances
(where head or modifier are replaced by a random constituent) as negative
evidence. The model captures generalisations over this data and learns what
combinations give rise to plausible compounds and which ones do not. After
training, we query the model for the plausibility of automatically generated
novel combinations and verify whether the classifications are accurate. For our
best model, we find that in around 85% of the cases, the novel compounds
generated are attested in previously unseen data. An additional estimated 5%
are plausible despite not being attested in the recent corpus, based on
judgments from independent human raters.Comment: 9 pages, 3 figures, To appear at Joint Workshop on Multiword
Expressions and WordNet (MWE-WN 2019) at ACL 2019. V3 - Fixed some typos and
updated the Data Preprocessing sectio
Can language models learn analogical reasoning? Investigating training objectives and comparisons to human performance
While analogies are a common way to evaluate word embeddings in NLP, it is
also of interest to investigate whether or not analogical reasoning is a task
in itself that can be learned. In this paper, we test several ways to learn
basic analogical reasoning, specifically focusing on analogies that are more
typical of what is used to evaluate analogical reasoning in humans than those
in commonly used NLP benchmarks. Our experiments find that models are able to
learn analogical reasoning, even with a small amount of data. We additionally
compare our models to a dataset with a human baseline, and find that after
training, models approach human performance
Measuring the Compositionality of Noun-Noun Compounds over Time
We present work in progress on the temporal progression of compositionality
in noun-noun compounds. Previous work has proposed computational methods for
determining the compositionality of compounds. These methods try to
automatically determine how transparent the meaning of the compound as a whole
is with respect to the meaning of its parts. We hypothesize that such a
property might change over time. We use the time-stamped Google Books corpus
for our diachronic investigations, and first examine whether the vector-based
semantic spaces extracted from this corpus are able to predict compositionality
ratings, despite their inherent limitations. We find that using temporal
information helps predicting the ratings, although correlation with the ratings
is lower than reported for other corpora. Finally, we show changes in
compositionality over time for a selection of compounds.Comment: 6 pages, 3 figures, To appear in the proceedings of the 1st
International Workshop on Computational Approaches to Historical Language
Change 2019 @ ACL 2019, Fixed typos, Increased figure size
The Scope and the Sources of Variation in Verbal Predicates in English and French
Proceedings of the Ninth International Workshop
on Treebanks and Linguistic Theories.
Editors: Markus Dickinson, Kaili Müürisep and Marco Passarotti.
NEALT Proceedings Series, Vol. 9 (2010), 199-210.
© 2010 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/15891
Analysis of Data Augmentation Methods for Low-Resource Maltese ASR
Recent years have seen an increased interest in the computational speech
processing of Maltese, but resources remain sparse. In this paper, we consider
data augmentation techniques for improving speech recognition for low-resource
languages, focusing on Maltese as a test case. We consider three different
types of data augmentation: unsupervised training, multilingual training and
the use of synthesized speech as training data. The goal is to determine which
of these techniques, or combination of them, is the most effective to improve
speech recognition for languages where the starting point is a small corpus of
approximately 7 hours of transcribed speech. Our results show that combining
the data augmentation techniques studied here lead us to an absolute WER
improvement of 15% without the use of a language model.Comment: 12 page
Annotating for Hate Speech: The MaNeCo Corpus and Some Input from Critical Discourse Analysis
This paper presents a novel scheme for the annotation of hate speech in
corpora of Web 2.0 commentary. The proposed scheme is motivated by the critical
analysis of posts made in reaction to news reports on the Mediterranean
migration crisis and LGBTIQ+ matters in Malta, which was conducted under the
auspices of the EU-funded C.O.N.T.A.C.T. project. Based on the realization that
hate speech is not a clear-cut category to begin with, appears to belong to a
continuum of discriminatory discourse and is often realized through the use of
indirect linguistic means, it is argued that annotation schemes for its
detection should refrain from directly including the label 'hate speech,' as
different annotators might have different thresholds as to what constitutes
hate speech and what not. In view of this, we suggest a multi-layer annotation
scheme, which is pilot-tested against a binary +/- hate speech classification
and appears to yield higher inter-annotator agreement. Motivating the
postulation of our scheme, we then present the MaNeCo corpus on which it will
eventually be used; a substantial corpus of on-line newspaper comments spanning
10 years.Comment: 10 pages, 1 table. Appears in Proceedings of the 12th edition of the
Language Resources and Evaluation Conference (LREC'20
- …