Search CORE

9 research outputs found

Text Categorization Can Enhance Domain-Agnostic Stopword Extraction

Author: Aouicha Mohamed Ben
Awokoya Ayodele
Emezue Chris Chinenye
Etori Naome A.
Lawan Falalu Ibrahim
Nixdorf Doreen
Omotayo Abdul-Hakeem
Taieb Mohamed Ali Hadj
Turki Houcemeddine
Publication venue
Publication date: 24/01/2024
Field of study

This paper investigates the role of text categorization in streamlining stopword extraction in natural language processing (NLP), specifically focusing on nine African languages alongside French. By leveraging the MasakhaNEWS, African Stopwords Project, and MasakhaPOS datasets, our findings emphasize that text categorization effectively identifies domain-agnostic stopwords with over 80% detection success rate for most examined languages. Nevertheless, linguistic variances result in lower detection rates for certain languages. Interestingly, we find that while over 40% of stopwords are common across news categories, less than 15% are unique to a single category. Uncommon stopwords add depth to text but their classification as stopwords depends on context. Therefore combining statistical and linguistic approaches creates comprehensive stopword lists, highlighting the value of our hybrid method. This research enhances NLP for African languages and underscores the importance of text categorization in stopword extraction.Comment: A Project Report for the Masakhane Research Communit

arXiv.org e-Print Archive

Masakhane-Afrisenti at SemEval-2023 Task 12: Sentiment Analysis using Afro-centric Language Models and Adapters for Low-resource African Languages

Author: Adewumi Tosin
Al-Azzawi Sana Sabah
Alabi Jesujoba
Awokoya Ayodele
Awosan Oyinkansola
Azime Israel Abebe
Fanijo Samuel
Oduwole Mardiyyah
Shode Iyanuoluwa
Tonja Atnafu Lambebo
Yousuf Oreen
Publication venue
Publication date: 13/04/2023
Field of study

AfriSenti-SemEval Shared Task 12 of SemEval-2023. The task aims to perform monolingual sentiment classification (sub-task A) for 12 African languages, multilingual sentiment classification (sub-task B), and zero-shot sentiment classification (task C). For sub-task A, we conducted experiments using classical machine learning classifiers, Afro-centric language models, and language-specific models. For task B, we fine-tuned multilingual pre-trained language models that support many of the languages in the task. For task C, we used we make use of a parameter-efficient Adapter approach that leverages monolingual texts in the target language for effective zero-shot transfer. Our findings suggest that using pre-trained Afro-centric language models improves performance for low-resource African languages. We also ran experiments using adapters for zero-shot tasks, and the results suggest that we can obtain promising results by using adapters with a limited amount of resources.Comment: SemEval 202

arXiv.org e-Print Archive

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Author: Adeyemi Mofetoluwa
Agrawal Sweta
Ahia Oghenefego
Ahia Orevaoghene
Ataman Duygu
Awokoya Ayodele
Azime Israel Abebe
Baljekar Pallavi
Ballı Sakine Çabuk
Bapna Ankur
Baruwa Ahmed
Battisti Alessia
Biderman Stella
Caswell Isaac
de Silva Nisansa
Dlamini Sakhile
Dossou Bonaventure F. P.
Firat Orhan
Jenny Mathias
Jernite Yacine
Kreutzer Julia
Kudugunta Sneha
Lawson Nze
Leong Colin
Matangira Tapiwanashe
Mirzakhalov Jamshidbek
Mnyakeni Ayanda
Muhammad Nanda
Muhammad Shamsuddeen Hassan
Müller André
Müller Mathias
Nguyen Toan Q.
Ogueji Kelechi
Orife Iroro
Osei Salomey
Papadimitriou Isabel
Rios Annette
Rivera Clara
Rubungo Andre Niyongabo
Sagot Benoît
Samb Sokhar
Sarin Supheakmungkol
Setyawan Monang
Sikasote Claytone
Sokolov Artem
Subramani Nishant
Suárez Pedro Ortiz
Tapo Allahsera
Ulzii-Orshikh Nasanbayar
van Esch Daan
Wahab Ahsan
Wang Lisa
Publication venue
Publication date: 23/03/2021
Field of study

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.Comment: Accepted at TACL; pre-MIT Press publication versio

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL-Rennes 1

AfriQA:Cross-lingual Open-Retrieval Question Answering for African Languages

Author: Abdou Aziz DIOP
Adelani David Ifeoluwa
Adeyemi Mofetoluwa
Adhiambo Sonia
Ahia Orevaoghene
Ahmad Ibrahim Said
Ajayi Tunde Oluwaseyi
Ajisafe Daniel A.
Alabi Jesujoba O.
Amuok Priscilla A.
Anuoluwapo Aremu
Arthur Steven
Asai Akari
Awosan Oyinkansola
Ayodele Awokoya
Buzaaba Happy
Chinedu Mbonu
Chukwuneke Chiamaka
Clark Jonathan H.
Dossou Bonaventure F. P.
Emezue Chris
Ezeani Ignatius
Gwadabe Tajuddeen R.
Hacheme Gilles
Iro Ruqayya Nasir
Kahira Albert Njoroge
Lawan Falalu Ibrahim
Mabuya Rooweither
Mbow Habib
Mngoma Ndumiso
Muhammad Shamsuddeen H.
Mukonde Eunice
Mwase Christine
Namukombo Martin
Niyomutabazi Emile
Ogundepo Odunayo
Oladipo Akintunde
Onwuegbuzia Emeka Felix
Opoku Bernard
Osei Salomey
Otiende Verrah
Owodunni Abraham Toluwase
Phiri Mofya
Putini Neo
Rivera Clara E.
Rubungo Andre Niyongabo
Ruder Sebastian
Shode Iyanuoluwa
Sikasote Claytone
Sinkala Boyd
Siro Clemencia
Tonja Atnafu Lambebo
Publication venue: 'Center for Open Science'
Publication date: 11/05/2023
Field of study

African languages have far less in-language content available digitally, making it challenging for question answering systems to satisfy the information needs of users. Cross-lingual open-retrieval question answering (XOR QA) systems -- those that retrieve answer content from other languages while serving people in their native language -- offer a means of filling this gap. To this end, we create AfriQA, the first cross-lingual QA dataset with a focus on African languages. AfriQA includes 12,000+ XOR QA examples across 10 African languages. While previous datasets have focused primarily on languages where cross-lingual QA augments coverage from the target language, AfriQA focuses on languages where cross-lingual answer content is the only high-coverage source of answer content. Because of this, we argue that African languages are one of the most important and realistic use cases for XOR QA. Our experiments demonstrate the poor performance of automatic translation and multilingual retrieval methods. Overall, AfriQA proves challenging for state-of-the-art QA models. We hope that the dataset enables the development of more equitable QA technology

Lancaster E-Prints

The effect of domain and diacritics in Yorùbá-English neural machine translation

Author: Adebonojo Damilola
Adelani David,
Adeyemi Mofetoluwa
Alabi Jesujoba,
Awokoya Ayodele
Ayeni Adesina
Espana-Bonet Cristina
Ruiter Dana
Publication venue: HAL CCSD
Publication date: 16/08/2021
Field of study

International audienceMassively multilingual machine translation (MT) has shown impressive capabilities, including zero and few-shot translation between low-resource language pairs. However, these models are often evaluated on high-resource languages with the assumption that they generalize to low-resource ones. The difficulty of evaluating MT models on low-resource pairs is often due to lack of standardized evaluation datasets. In this paper, we present MENYO-20k, the first multi-domain parallel corpus with a special focus on clean orthography for Yorùbá-English with standardized train-test splits for benchmarking. We provide several neural MT benchmarks and compare them to the performance of popular pre-trained (massively multilingual) MT models both for the heterogeneous test set and its subdomains. Since these pre-trained models use huge amounts of data with uncertain quality, we also analyze the effect of diacritics, a major characteristic of Yorùbá, in the training data. We investigate how and when this training condition affects the final quality and intelligibility of a translation. Our models outperform massively multilingual models such as Google (+8.7 BLEU) and Facebook M2M (+9.1 BLEU) when translating to Yorùbá, setting a high quality benchmark for future research

HAL Descartes

AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages

Author: Adelani David Ifeoluwa
Adeyemi Mofetoluwa
Adhiambo Sonia
Ahia Orevaoghene
Ahmad Ibrahim Said
Ajayi Tunde Oluwaseyi
Ajisafe Daniel A.
Alabi Jesujoba O.
Amuok Priscilla A.
Anuoluwapo Aremu
Arthur Steven
Asai Akari
Awosan Oyinkansola
Ayodele Awokoya
Buzaaba Happy
Chinedu Mbonu
Chukwuneke Chiamaka
Clark Jonathan H.
DIOP Abdou Aziz
Dossou Bonaventure F. P.
Emezue Chris
Ezeani Ignatius
Gwadabe Tajuddeen R.
Hacheme Gilles
Iro Ruqayya Nasir
Kahira Albert Njoroge
Lawan Falalu Ibrahim
Mabuya Rooweither
Mbow Habib
Mngoma Ndumiso
Muhammad Shamsuddeen H.
Mukonde Eunice
Mwase Christine
Namukombo Martin
Niyomutabazi Emile
Ogundepo Odunayo
Oladipo Akintunde
Onwuegbuzia Emeka Felix
Opoku Bernard
Osei Salomey
Otiende Verrah
Owodunni Abraham Toluwase
Phiri Mofya
Putini Neo
Rivera Clara E.
Rubungo Andre Niyongabo
Ruder Sebastian
Shode Iyanuoluwa
Sikasote Claytone
Sinkala Boyd
Siro Clemencia
Tonja Atnafu Lambebo
Publication venue
Publication date: 11/05/2023
Field of study

arXiv.org e-Print Archive

MasakhaNER: Named entity recognition for African languages

Author: Abbott Jade
Abebe Azime Israel
Adelani David,
Adewumi Tosin
Adeyemi Mofetoluwa
Ahia Orevaoghene
Akinfaderin Adewale
Akinode Victor
Alabi Jesujoba
Anebi Emmanuel
Aremu Anuoluwapo
Awokoya Ayodele
Buzaaba Happy
Chinenye Emezue Chris
Chukwuneke Chiamaka
d'Souza Daniel
David Davis
Diallo Abdoulaye
Dossou Bonaventure,
Ezeani Ignatius
Faye Abdoulaye
Gebreyohannes Dibora
Gitau Catherine
Katusiime Maurice
Kreutzer Julia
Lignos Constantine
Marengereke Tendai
Mayhew Stephen
Mbaye Derguene
Mboup Mouhamadane
Muhammad Shamsuddeen,
Mukiibi Jonathan
Muriuki Gerald
Nabagereka Deborah
Nakatumba-Nabende Joyce
Neubig Graham
Ngom Samba
Niyongabo Rubungo,
Nwaike Kelechi
Odu Nkiruka
Ogayo Perez
Ogueji Kelechi
Oloyede Temilola
Orife Iroro
Osei Salomey
Otiende Verrah
Oyerinde Samuel
Palen-Michel Chester
Rabiu Gwadabe Tajuddeen
Rayson Paul
Rijhwani Shruti
Ruder Sebastian
Saul Bateesa Tobius
Sibanda Blessing
Siro Clemencia
Thierno Ibrahima
Tilaye Henok
Wairagala Eric,
Wambui Yvonne
Wolde Degaga
Yimam Seid,
Publication venue: 'MIT Press - Journals'
Publication date: 14/06/2021
Field of study

International audienceWe take a step towards addressing the underrepresentation of the African continent in NLP research by bringing together different stakeholders to create the first large, publicly available, high-quality dataset for named entity recognition (NER) in ten African languages. We detail the characteristics of these languages to help researchers and practitioners better understand the challenges they pose for NER tasks. We analyze our datasets and conduct an extensive empirical evaluation of stateof-the-art methods across both supervised and transfer learning settings. Finally, we release the data, code, and models to inspire future research on African NLP.

Hal-Diderot