Search CORE

16 research outputs found

Automatic Translation of Hate Speech to Non-hate Speech in Social Media Texts

Author: Kolesnikova Olga
Kostiuk Yevhen
Sidorov Grigori
Tonja Atnafu Lambebo
Publication venue
Publication date: 02/06/2023
Field of study

In this paper, we investigate the issue of hate speech by presenting a novel task of translating hate speech into non-hate speech text while preserving its meaning. As a case study, we use Spanish texts. We provide a dataset and several baselines as a starting point for further research in the task. We evaluated our baseline results using multiple metrics, including BLEU scores. The aim of this study is to contribute to the development of more effective methods for reducing the spread of hate speech in online communities

arXiv.org e-Print Archive

First Attempt at Building Parallel Corpora for Machine Translation of Northeast India's Very Low-Resource Languages

Author: Kalita Ananya
Kalita Jugal
Kolesnikova Olga
Mersha Melkamu
Tonja Atnafu Lambebo
Publication venue
Publication date: 07/12/2023
Field of study

This paper presents the creation of initial bilingual corpora for thirteen very low-resource languages of India, all from Northeast India. It also presents the results of initial translation efforts in these languages. It creates the first-ever parallel corpora for these languages and provides initial benchmark neural machine translation results for these languages. We intend to extend these corpora to include a large number of low-resource Indian languages and integrate the effort with our prior work with African and American-Indian languages to create corpora covering a large number of languages from across the world.Comment: Accepted to ICON 202

arXiv.org e-Print Archive

Enhancing Translation for Indigenous Languages: Experiments with Multilingual Models

Author: Gelbukh Alexander
Kalita Jugal
Kolesnikova Olga
Nigatu Hellina Hailu
Sidorov Grigori
Tonja Atnafu Lambebo
Publication venue
Publication date: 27/05/2023
Field of study

This paper describes CIC NLP's submission to the AmericasNLP 2023 Shared Task on machine translation systems for indigenous languages of the Americas. We present the system descriptions for three methods. We used two multilingual models, namely M2M-100 and mBART50, and one bilingual (one-to-one) -- Helsinki NLP Spanish-English translation model, and experimented with different transfer learning setups. We experimented with 11 languages from America and report the setups we used as well as the results we achieved. Overall, the mBART setup was able to improve upon the baseline for three out of the eleven languages.Comment: Accepted to Third Workshop on NLP for Indigenous Languages of the America

arXiv.org e-Print Archive

Parallel Corpus for Indigenous Language Translation: Spanish-Mazatec and Spanish-Mixtec

Author: Castillo David Alejandro Mendoza
Castro-Sánchez Noé
Gelbukh Alexander
Kolesnikova Olga
Maldonado-Sifuentes Christian
Sidorov Grigori
Tonja Atnafu Lambebo
Publication venue
Publication date: 27/05/2023
Field of study

In this paper, we present a parallel Spanish-Mazatec and Spanish-Mixtec corpus for machine translation (MT) tasks, where Mazatec and Mixtec are two indigenous Mexican languages. We evaluated the usability of the collected corpus using three different approaches: transformer, transfer learning, and fine-tuning pre-trained multilingual MT models. Fine-tuning the Facebook M2M100-48 model outperformed the other approaches, with BLEU scores of 12.09 and 22.25 for Mazatec-Spanish and Spanish-Mazatec translations, respectively, and 16.75 and 22.15 for Mixtec-Spanish and Spanish-Mixtec translations, respectively. The findings show that the dataset size (9,799 sentences in Mazatec and 13,235 sentences in Mixtec) affects translation performance and that indigenous languages work better when used as target languages. The findings emphasize the importance of creating parallel corpora for indigenous languages and fine-tuning models for low-resource translation tasks. Future research will investigate zero-shot and few-shot learning approaches to further improve translation performance in low-resource settings. The dataset and scripts are available at \url{https://github.com/atnafuatx/Machine-Translation-Resources}Comment: Accepted to Third Workshop on NLP for Indigenous Languages of the America

arXiv.org e-Print Archive

AfriNames: Most ASR models "butcher" African Names

Author: Afonja Tejumade
Dossou Bonaventure F. P.
Emezue Chris Chinenye
Olatunji Tobi
Rufai Amina Mardiyyah
Singh Sahib
Tonja Atnafu Lambebo
Publication venue
Publication date: 02/06/2023
Field of study

Useful conversational agents must accurately capture named entities to minimize error for downstream tasks, for example, asking a voice assistant to play a track from a certain artist, initiating navigation to a specific location, or documenting a laboratory result for a patient. However, where named entities such as ``Ukachukwu`` (Igbo), ``Lakicia`` (Swahili), or ``Ingabire`` (Rwandan) are spoken, automatic speech recognition (ASR) models' performance degrades significantly, propagating errors to downstream systems. We model this problem as a distribution shift and demonstrate that such model bias can be mitigated through multilingual pre-training, intelligent data augmentation strategies to increase the representation of African-named entities, and fine-tuning multilingual ASR models on multiple African accents. The resulting fine-tuned models show an 81.5\% relative WER improvement compared with the baseline on samples with African-named entities.Comment: Accepted at Interspeech 2023 (Main Conference

arXiv.org e-Print Archive

Adapting Pretrained ASR Models to Low-resource Clinical Speech using Epistemic Uncertainty-based Data Selection

Author: Adewumi Tosin
Dossou Bonaventure F. P.
Emezue Chris Chinenye
Etori Naome A
Olatunji Tobi
Osei Salomey
Singh Sahib
Tonja Atnafu Lambebo
Publication venue
Publication date: 03/06/2023
Field of study

While there has been significant progress in ASR, African-accented clinical ASR has been understudied due to a lack of training datasets. Building robust ASR systems in this domain requires large amounts of annotated or labeled data, for a wide variety of linguistically and morphologically rich accents, which are expensive to create. Our study aims to address this problem by reducing annotation expenses through informative uncertainty-based data selection. We show that incorporating epistemic uncertainty into our adaptation rounds outperforms several baseline results, established using state-of-the-art (SOTA) ASR models, while reducing the required amount of labeled data, and hence reducing annotation costs. Our approach also improves out-of-distribution generalization for very low-resource accents, demonstrating the viability of our approach for building generalizable ASR models in the context of accented African clinical ASR, where training datasets are predominantly scarce

arXiv.org e-Print Archive

AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages

Author: Awoyomi Oluwabusayo Olufunke
Dossou Bonaventure F. P.
Emezue Chris Chinenye
Oppong Abigail
Osei Salomey
Shode Iyanuoluwa
Tonja Atnafu Lambebo
Yousuf Oreen
Publication venue
Publication date: 23/11/2022
Field of study

In recent years, multilingual pre-trained language models have gained prominence due to their remarkable performance on numerous downstream Natural Language Processing tasks (NLP). However, pre-training these large multilingual language models requires a lot of training data, which is not available for African Languages. Active learning is a semi-supervised learning algorithm, in which a model consistently and dynamically learns to identify the most beneficial samples to train itself on, in order to achieve better optimization and performance on downstream tasks. Furthermore, active learning effectively and practically addresses real-world data scarcity. Despite all its benefits, active learning, in the context of NLP and especially multilingual language models pretraining, has received little consideration. In this paper, we present AfroLM, a multilingual language model pretrained from scratch on 23 African languages (the largest effort to date) using our novel self-active learning framework. Pretrained on a dataset significantly (14x) smaller than existing baselines, AfroLM outperforms many multilingual pretrained language models (AfriBERTa, XLMR-base, mBERT) on various NLP downstream tasks (NER, text classification, and sentiment analysis). Additional out-of-domain sentiment analysis experiments show that \textbf{AfroLM} is able to generalize well across various domains. We release the code source, and our datasets used in our framework at https://github.com/bonaventuredossou/MLM_AL.Comment: Third Workshop on Simple and Efficient Natural Language Processing, EMNLP 202

arXiv.org e-Print Archive

The Effect of Normalization for Bi-directional Amharic-English Neural Machine Translation

Author: Ayele Abinew Ali
Belay Tadesse Destaw
Gelbukh Alexander
Haile Silesh Bogale
Kolesnikova Olga
Sidorov Grigori
Tonja Atnafu Lambebo
Yimam Seid Muhie
Publication venue
Publication date: 27/10/2022
Field of study

Machine translation (MT) is one of the main tasks in natural language processing whose objective is to translate texts automatically from one natural language to another. Nowadays, using deep neural networks for MT tasks has received great attention. These networks require lots of data to learn abstract representations of the input and store it in continuous vectors. This paper presents the first relatively large-scale Amharic-English parallel sentence dataset. Using these compiled data, we build bi-directional Amharic-English translation models by fine-tuning the existing Facebook M2M100 pre-trained model achieving a BLEU score of 37.79 in Amharic-English 32.74 in English-Amharic translation. Additionally, we explore the effects of Amharic homophone normalization on the machine translation task. The results show that the normalization of Amharic homophone characters increases the performance of Amharic-English machine translation in both directions

arXiv.org e-Print Archive

Masakhane-Afrisenti at SemEval-2023 Task 12: Sentiment Analysis using Afro-centric Language Models and Adapters for Low-resource African Languages

Author: Adewumi Tosin
Al-Azzawi Sana Sabah
Alabi Jesujoba
Awokoya Ayodele
Awosan Oyinkansola
Azime Israel Abebe
Fanijo Samuel
Oduwole Mardiyyah
Shode Iyanuoluwa
Tonja Atnafu Lambebo
Yousuf Oreen
Publication venue
Publication date: 13/04/2023
Field of study

AfriSenti-SemEval Shared Task 12 of SemEval-2023. The task aims to perform monolingual sentiment classification (sub-task A) for 12 African languages, multilingual sentiment classification (sub-task B), and zero-shot sentiment classification (task C). For sub-task A, we conducted experiments using classical machine learning classifiers, Afro-centric language models, and language-specific models. For task B, we fine-tuned multilingual pre-trained language models that support many of the languages in the task. For task C, we used we make use of a parameter-efficient Adapter approach that leverages monolingual texts in the target language for effective zero-shot transfer. Our findings suggest that using pre-trained Afro-centric language models improves performance for low-resource African languages. We also ran experiments using adapters for zero-shot tasks, and the results suggest that we can obtain promising results by using adapters with a limited amount of resources.Comment: SemEval 202

arXiv.org e-Print Archive

The African Stopwords project:curating stopwords for African languages

Author: Abdulmumin Idris
Aina Kaosarat
Ajibade Benjamin
Chukwuneke Chiamaka
David Davis
Dossou Bonaventure F. P.
Emezue Chris
Emezue Handel
Emmanuel Mbonu Chinedu
Etori Naome A.
Ige Ifeoluwatayo A.
Joshua Oviawe
Louis Lerato
Muhammad Shamsuddeen
Nigatu Hellina
Onwuegbuzia Emeka
Oyerinde Samuel
Samuel Olanrewaju
Thinwa Cynthia
Tonja Atnafu Lambebo
Yousuf Oreen
Zhou Helper
Publication venue: 'Center for Open Science'
Publication date: 21/03/2023
Field of study

Stopwords are fundamental in Natural Language Processing (NLP) techniques for information retrieval. One of the common tasks in preprocessing of text data is the removal of stopwords. Currently, while high-resource languages like English benefit from the availability of several stopwords, low-resource languages, such as those found in the African continent, have none that are standardized and available for use in NLP packages. Stopwords in the context of African languages are understudied and can reveal information about the crossover between languages. The \textit{African Stopwords} project aims to study and curate stopwords for African languages. In this paper, we present our current progress on ten African languages as well as future plans for the project

Lancaster E-Prints