13 research outputs found
FEED PETs: Further Experimentation and Expansion on the Disambiguation of Potentially Euphemistic Terms
Transformers have been shown to work well for the task of English euphemism
disambiguation, in which a potentially euphemistic term (PET) is classified as
euphemistic or non-euphemistic in a particular context. In this study, we
expand on the task in two ways. First, we annotate PETs for vagueness, a
linguistic property associated with euphemisms, and find that transformers are
generally better at classifying vague PETs, suggesting linguistic differences
in the data that impact performance. Second, we present novel euphemism corpora
in three different languages: Yoruba, Spanish, and Mandarin Chinese. We perform
euphemism disambiguation experiments in each language using multilingual
transformer models mBERT and XLM-RoBERTa, establishing preliminary results from
which to launch future work
AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages
In recent years, multilingual pre-trained language models have gained
prominence due to their remarkable performance on numerous downstream Natural
Language Processing tasks (NLP). However, pre-training these large multilingual
language models requires a lot of training data, which is not available for
African Languages. Active learning is a semi-supervised learning algorithm, in
which a model consistently and dynamically learns to identify the most
beneficial samples to train itself on, in order to achieve better optimization
and performance on downstream tasks. Furthermore, active learning effectively
and practically addresses real-world data scarcity. Despite all its benefits,
active learning, in the context of NLP and especially multilingual language
models pretraining, has received little consideration. In this paper, we
present AfroLM, a multilingual language model pretrained from scratch on 23
African languages (the largest effort to date) using our novel self-active
learning framework. Pretrained on a dataset significantly (14x) smaller than
existing baselines, AfroLM outperforms many multilingual pretrained language
models (AfriBERTa, XLMR-base, mBERT) on various NLP downstream tasks (NER, text
classification, and sentiment analysis). Additional out-of-domain sentiment
analysis experiments show that \textbf{AfroLM} is able to generalize well
across various domains. We release the code source, and our datasets used in
our framework at https://github.com/bonaventuredossou/MLM_AL.Comment: Third Workshop on Simple and Efficient Natural Language Processing,
EMNLP 202
MEDs for PETs: Multilingual Euphemism Disambiguation for Potentially Euphemistic Terms
This study investigates the computational processing of euphemisms, a
universal linguistic phenomenon, across multiple languages. We train a
multilingual transformer model (XLM-RoBERTa) to disambiguate potentially
euphemistic terms (PETs) in multilingual and cross-lingual settings. In line
with current trends, we demonstrate that zero-shot learning across languages
takes place. We also show cases where multilingual models perform better on the
task compared to monolingual models by a statistically significant margin,
indicating that multilingual data presents additional opportunities for models
to learn about cross-lingual, computational properties of euphemisms. In a
follow-up analysis, we focus on universal euphemistic "categories" such as
death and bodily functions among others. We test to see whether cross-lingual
data of the same domain is more important than within-language data of other
domains to further understand the nature of the cross-lingual transfer
Masakhane-Afrisenti at SemEval-2023 Task 12: Sentiment Analysis using Afro-centric Language Models and Adapters for Low-resource African Languages
AfriSenti-SemEval Shared Task 12 of SemEval-2023. The task aims to perform
monolingual sentiment classification (sub-task A) for 12 African languages,
multilingual sentiment classification (sub-task B), and zero-shot sentiment
classification (task C). For sub-task A, we conducted experiments using
classical machine learning classifiers, Afro-centric language models, and
language-specific models. For task B, we fine-tuned multilingual pre-trained
language models that support many of the languages in the task. For task C, we
used we make use of a parameter-efficient Adapter approach that leverages
monolingual texts in the target language for effective zero-shot transfer. Our
findings suggest that using pre-trained Afro-centric language models improves
performance for low-resource African languages. We also ran experiments using
adapters for zero-shot tasks, and the results suggest that we can obtain
promising results by using adapters with a limited amount of resources.Comment: SemEval 202
MasakhaNEWS: News Topic Classification for African languages
African languages are severely under-represented in NLP research due to lack
of datasets covering several NLP tasks. While there are individual language
specific datasets that are being expanded to different tasks, only a handful of
NLP tasks (e.g. named entity recognition and machine translation) have
standardized benchmark datasets covering several geographical and
typologically-diverse African languages. In this paper, we develop MasakhaNEWS
-- a new benchmark dataset for news topic classification covering 16 languages
widely spoken in Africa. We provide an evaluation of baseline models by
training classical machine learning models and fine-tuning several language
models. Furthermore, we explore several alternatives to full fine-tuning of
language models that are better suited for zero-shot and few-shot learning such
as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern
exploiting training (PET), prompting language models (like ChatGPT), and
prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API).
Our evaluation in zero-shot setting shows the potential of prompting ChatGPT
for news topic classification in low-resource African languages, achieving an
average performance of 70 F1 points without leveraging additional supervision
like MAD-X. In few-shot setting, we show that with as little as 10 examples per
label, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance of
full supervised training (92.6 F1 points) leveraging the PET approach.Comment: Accepted to IJCNLP-AACL 2023 (main conference
AfriQA:Cross-lingual Open-Retrieval Question Answering for African Languages
African languages have far less in-language content available digitally, making it challenging for question answering systems to satisfy the information needs of users. Cross-lingual open-retrieval question answering (XOR QA) systems -- those that retrieve answer content from other languages while serving people in their native language -- offer a means of filling this gap. To this end, we create AfriQA, the first cross-lingual QA dataset with a focus on African languages. AfriQA includes 12,000+ XOR QA examples across 10 African languages. While previous datasets have focused primarily on languages where cross-lingual QA augments coverage from the target language, AfriQA focuses on languages where cross-lingual answer content is the only high-coverage source of answer content. Because of this, we argue that African languages are one of the most important and realistic use cases for XOR QA. Our experiments demonstrate the poor performance of automatic translation and multilingual retrieval methods. Overall, AfriQA proves challenging for state-of-the-art QA models. We hope that the dataset enables the development of more equitable QA technology
AfriMTE and AfriCOMET : Empowering COMET to Embrace Under-resourced African Languages
Despite the progress we have recorded in scaling multilingual machine translation (MT) models and evaluation data to several under-resourced African languages, it is difficult to measure accurately the progress we have made on these languages because evaluation is often performed on n-gram matching metrics like BLEU that often have worse correlation with human judgments. Embedding-based metrics such as COMET correlate better; however, lack of evaluation data with human ratings for under-resourced languages, complexity of annotation guidelines like Multidimensional Quality Metrics (MQM), and limited language coverage of multilingual encoders have hampered their applicability to African languages. In this paper, we address these challenges by creating high-quality human evaluation data with a simplified MQM guideline for error-span annotation and direct assessment (DA) scoring for 13 typologically diverse African languages. Furthermore, we develop AfriCOMET, a COMET evaluation metric for African languages by leveraging DA training data from high-resource languages and African-centric multilingual encoder (AfroXLM-Roberta) to create the state-of-the-art evaluation metric for African languages MT with respect to Spearman-rank correlation with human judgments (+0.406)
MasakhaNEWS:News Topic Classification for African languages
African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS -- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa. We provide an evaluation of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning such as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern exploiting training (PET), prompting language models (like ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API). Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in low-resource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X. In few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach
NollySenti: Leveraging Transfer Learning and Machine Translation for Nigerian Movie Sentiment Classification
Africa has over 2000 indigenous languages but they are under-represented in
NLP research due to lack of datasets. In recent years, there have been progress
in developing labeled corpora for African languages. However, they are often
available in a single domain and may not generalize to other domains. In this
paper, we focus on the task of sentiment classification for cross domain
adaptation. We create a new dataset, NollySenti - based on the Nollywood movie
reviews for five languages widely spoken in Nigeria (English, Hausa, Igbo,
Nigerian-Pidgin, and Yoruba. We provide an extensive empirical evaluation using
classical machine learning methods and pre-trained language models. Leveraging
transfer learning, we compare the performance of cross-domain adaptation from
Twitter domain, and cross-lingual adaptation from English language. Our
evaluation shows that transfer from English in the same target domain leads to
more than 5% improvement in accuracy compared to transfer from Twitter in the
same language. To further mitigate the domain difference, we leverage machine
translation (MT) from English to other Nigerian languages, which leads to a
further improvement of 7% over cross-lingual evaluation. While MT to
low-resource languages are often of low quality, through human evaluation, we
show that most of the translated sentences preserve the sentiment of the
original English reviews