16 research outputs found
Automatic Translation of Hate Speech to Non-hate Speech in Social Media Texts
In this paper, we investigate the issue of hate speech by presenting a novel
task of translating hate speech into non-hate speech text while preserving its
meaning. As a case study, we use Spanish texts. We provide a dataset and
several baselines as a starting point for further research in the task. We
evaluated our baseline results using multiple metrics, including BLEU scores.
The aim of this study is to contribute to the development of more effective
methods for reducing the spread of hate speech in online communities
First Attempt at Building Parallel Corpora for Machine Translation of Northeast India's Very Low-Resource Languages
This paper presents the creation of initial bilingual corpora for thirteen
very low-resource languages of India, all from Northeast India. It also
presents the results of initial translation efforts in these languages. It
creates the first-ever parallel corpora for these languages and provides
initial benchmark neural machine translation results for these languages. We
intend to extend these corpora to include a large number of low-resource Indian
languages and integrate the effort with our prior work with African and
American-Indian languages to create corpora covering a large number of
languages from across the world.Comment: Accepted to ICON 202
Enhancing Translation for Indigenous Languages: Experiments with Multilingual Models
This paper describes CIC NLP's submission to the AmericasNLP 2023 Shared Task
on machine translation systems for indigenous languages of the Americas. We
present the system descriptions for three methods. We used two multilingual
models, namely M2M-100 and mBART50, and one bilingual (one-to-one) -- Helsinki
NLP Spanish-English translation model, and experimented with different transfer
learning setups. We experimented with 11 languages from America and report the
setups we used as well as the results we achieved. Overall, the mBART setup was
able to improve upon the baseline for three out of the eleven languages.Comment: Accepted to Third Workshop on NLP for Indigenous Languages of the
America
Parallel Corpus for Indigenous Language Translation: Spanish-Mazatec and Spanish-Mixtec
In this paper, we present a parallel Spanish-Mazatec and Spanish-Mixtec
corpus for machine translation (MT) tasks, where Mazatec and Mixtec are two
indigenous Mexican languages. We evaluated the usability of the collected
corpus using three different approaches: transformer, transfer learning, and
fine-tuning pre-trained multilingual MT models. Fine-tuning the Facebook
M2M100-48 model outperformed the other approaches, with BLEU scores of 12.09
and 22.25 for Mazatec-Spanish and Spanish-Mazatec translations, respectively,
and 16.75 and 22.15 for Mixtec-Spanish and Spanish-Mixtec translations,
respectively. The findings show that the dataset size (9,799 sentences in
Mazatec and 13,235 sentences in Mixtec) affects translation performance and
that indigenous languages work better when used as target languages. The
findings emphasize the importance of creating parallel corpora for indigenous
languages and fine-tuning models for low-resource translation tasks. Future
research will investigate zero-shot and few-shot learning approaches to further
improve translation performance in low-resource settings. The dataset and
scripts are available at
\url{https://github.com/atnafuatx/Machine-Translation-Resources}Comment: Accepted to Third Workshop on NLP for Indigenous Languages of the
America
AfriNames: Most ASR models "butcher" African Names
Useful conversational agents must accurately capture named entities to
minimize error for downstream tasks, for example, asking a voice assistant to
play a track from a certain artist, initiating navigation to a specific
location, or documenting a laboratory result for a patient. However, where
named entities such as ``Ukachukwu`` (Igbo), ``Lakicia`` (Swahili), or
``Ingabire`` (Rwandan) are spoken, automatic speech recognition (ASR) models'
performance degrades significantly, propagating errors to downstream systems.
We model this problem as a distribution shift and demonstrate that such model
bias can be mitigated through multilingual pre-training, intelligent data
augmentation strategies to increase the representation of African-named
entities, and fine-tuning multilingual ASR models on multiple African accents.
The resulting fine-tuned models show an 81.5\% relative WER improvement
compared with the baseline on samples with African-named entities.Comment: Accepted at Interspeech 2023 (Main Conference
Adapting Pretrained ASR Models to Low-resource Clinical Speech using Epistemic Uncertainty-based Data Selection
While there has been significant progress in ASR, African-accented clinical
ASR has been understudied due to a lack of training datasets. Building robust
ASR systems in this domain requires large amounts of annotated or labeled data,
for a wide variety of linguistically and morphologically rich accents, which
are expensive to create. Our study aims to address this problem by reducing
annotation expenses through informative uncertainty-based data selection. We
show that incorporating epistemic uncertainty into our adaptation rounds
outperforms several baseline results, established using state-of-the-art (SOTA)
ASR models, while reducing the required amount of labeled data, and hence
reducing annotation costs. Our approach also improves out-of-distribution
generalization for very low-resource accents, demonstrating the viability of
our approach for building generalizable ASR models in the context of accented
African clinical ASR, where training datasets are predominantly scarce
AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages
In recent years, multilingual pre-trained language models have gained
prominence due to their remarkable performance on numerous downstream Natural
Language Processing tasks (NLP). However, pre-training these large multilingual
language models requires a lot of training data, which is not available for
African Languages. Active learning is a semi-supervised learning algorithm, in
which a model consistently and dynamically learns to identify the most
beneficial samples to train itself on, in order to achieve better optimization
and performance on downstream tasks. Furthermore, active learning effectively
and practically addresses real-world data scarcity. Despite all its benefits,
active learning, in the context of NLP and especially multilingual language
models pretraining, has received little consideration. In this paper, we
present AfroLM, a multilingual language model pretrained from scratch on 23
African languages (the largest effort to date) using our novel self-active
learning framework. Pretrained on a dataset significantly (14x) smaller than
existing baselines, AfroLM outperforms many multilingual pretrained language
models (AfriBERTa, XLMR-base, mBERT) on various NLP downstream tasks (NER, text
classification, and sentiment analysis). Additional out-of-domain sentiment
analysis experiments show that \textbf{AfroLM} is able to generalize well
across various domains. We release the code source, and our datasets used in
our framework at https://github.com/bonaventuredossou/MLM_AL.Comment: Third Workshop on Simple and Efficient Natural Language Processing,
EMNLP 202
The Effect of Normalization for Bi-directional Amharic-English Neural Machine Translation
Machine translation (MT) is one of the main tasks in natural language
processing whose objective is to translate texts automatically from one natural
language to another. Nowadays, using deep neural networks for MT tasks has
received great attention. These networks require lots of data to learn abstract
representations of the input and store it in continuous vectors. This paper
presents the first relatively large-scale Amharic-English parallel sentence
dataset. Using these compiled data, we build bi-directional Amharic-English
translation models by fine-tuning the existing Facebook M2M100 pre-trained
model achieving a BLEU score of 37.79 in Amharic-English 32.74 in
English-Amharic translation. Additionally, we explore the effects of Amharic
homophone normalization on the machine translation task. The results show that
the normalization of Amharic homophone characters increases the performance of
Amharic-English machine translation in both directions
Masakhane-Afrisenti at SemEval-2023 Task 12: Sentiment Analysis using Afro-centric Language Models and Adapters for Low-resource African Languages
AfriSenti-SemEval Shared Task 12 of SemEval-2023. The task aims to perform
monolingual sentiment classification (sub-task A) for 12 African languages,
multilingual sentiment classification (sub-task B), and zero-shot sentiment
classification (task C). For sub-task A, we conducted experiments using
classical machine learning classifiers, Afro-centric language models, and
language-specific models. For task B, we fine-tuned multilingual pre-trained
language models that support many of the languages in the task. For task C, we
used we make use of a parameter-efficient Adapter approach that leverages
monolingual texts in the target language for effective zero-shot transfer. Our
findings suggest that using pre-trained Afro-centric language models improves
performance for low-resource African languages. We also ran experiments using
adapters for zero-shot tasks, and the results suggest that we can obtain
promising results by using adapters with a limited amount of resources.Comment: SemEval 202
The African Stopwords project:curating stopwords for African languages
Stopwords are fundamental in Natural Language Processing (NLP) techniques for information retrieval. One of the common tasks in preprocessing of text data is the removal of stopwords. Currently, while high-resource languages like English benefit from the availability of several stopwords, low-resource languages, such as those found in the African continent, have none that are standardized and available for use in NLP packages. Stopwords in the context of African languages are understudied and can reveal information about the crossover between languages. The \textit{African Stopwords} project aims to study and curate stopwords for African languages. In this paper, we present our current progress on ten African languages as well as future plans for the project