Search CORE

12 research outputs found

Intriguing Properties of Compression on Multilingual Models

Author: Ahia Orevaoghene
Gehrmann Sebastian
Hooker Sara
Kreutzer Julia
Ogueji Kelechi
Onilude Gbemileke
Publication venue
Publication date: 25/11/2022
Field of study

Multilingual models are often particularly dependent on scaling to generalize to a growing number of languages. Compression techniques are widely relied upon to reconcile the growth in model size with real world resource constraints, but compression can have a disparate effect on model performance for low-resource languages. It is thus crucial to understand the trade-offs between scale, multilingualism, and compression. In this work, we propose an experimental framework to characterize the impact of sparsifying multilingual pre-trained language models during fine-tuning. Applying this framework to mBERT named entity recognition models across 40 languages, we find that compression confers several intriguing and previously unknown generalization properties. In contrast to prior findings, we find that compression may improve model robustness over dense models. We additionally observe that under certain sparsification regimes compression may aid, rather than disproportionately impact the performance of low-resource languages.Comment: Accepted to EMNLP 202

arXiv.org e-Print Archive

That was the last straw, we need more: Are Translation Systems Sensitive to Disambiguating Context?

Author: Ahia Orevaoghene
Gonen Hila
Lee Jaechan
Liu Alisa
Smith Noah A.
Publication venue
Publication date: 23/10/2023
Field of study

The translation of ambiguous text presents a challenge for translation systems, as it requires using the surrounding context to disambiguate the intended meaning as much as possible. While prior work has studied ambiguities that result from different grammatical features of the source and target language, we study semantic ambiguities that exist in the source (English in this work) itself. In particular, we focus on idioms that are open to both literal and figurative interpretations (e.g., goose egg), and collect TIDE, a dataset of 512 pairs of English sentences containing idioms with disambiguating context such that one is literal (it laid a goose egg) and another is figurative (they scored a goose egg, as in a score of zero). In experiments, we compare MT-specific models and language models for (i) their preference when given an ambiguous subsentence, (ii) their sensitivity to disambiguating context, and (iii) the performance disparity between figurative and literal source sentences. We find that current MT models consistently translate English idioms literally, even when the context suggests a figurative interpretation. On the other hand, LMs are far more context-aware, although there remain disparities across target languages. Our findings underline the potential of LMs as a strong backbone for context-aware translation.Comment: EMNLP 2023 Finding

arXiv.org e-Print Archive

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Author: Adeyemi Mofetoluwa
Agrawal Sweta
Ahia Oghenefego
Ahia Orevaoghene
Ataman Duygu
Awokoya Ayodele
Azime Israel Abebe
Baljekar Pallavi
Ballı Sakine Çabuk
Bapna Ankur
Baruwa Ahmed
Battisti Alessia
Biderman Stella
Caswell Isaac
de Silva Nisansa
Dlamini Sakhile
Dossou Bonaventure F. P.
Firat Orhan
Jenny Mathias
Jernite Yacine
Kreutzer Julia
Kudugunta Sneha
Lawson Nze
Leong Colin
Matangira Tapiwanashe
Mirzakhalov Jamshidbek
Mnyakeni Ayanda
Muhammad Nanda
Muhammad Shamsuddeen Hassan
Müller André
Müller Mathias
Nguyen Toan Q.
Ogueji Kelechi
Orife Iroro
Osei Salomey
Papadimitriou Isabel
Rios Annette
Rivera Clara
Rubungo Andre Niyongabo
Sagot Benoît
Samb Sokhar
Sarin Supheakmungkol
Setyawan Monang
Sikasote Claytone
Sokolov Artem
Subramani Nishant
Suárez Pedro Ortiz
Tapo Allahsera
Ulzii-Orshikh Nasanbayar
van Esch Daan
Wahab Ahsan
Wang Lisa
Publication venue
Publication date: 23/03/2021
Field of study

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.Comment: Accepted at TACL; pre-MIT Press publication versio

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL-Rennes 1

Participatory Research for Low-resourced Machine Translation:A Case Study in African Languages

Author: Abbott Jade
Adeyemi Mofe
Ahia Orevaoghene
Akinfaderin Adewale
Akinola Solomon Oluwole
Ali Jamiil Toure
Bashir Abdallah
Bassey Blessing Itoro
Biljon Elan van
Dangana Idris Abdulkabir
Degila Kevin
Dossou Bonaventure
Duru Goodness
Elsahar Hady
Emezue Chris
Ezeani Ignatius
Fagbohungbe Taiwo
Fasubaa Timi
Freshia Sackey
Kabongo Salomon
Kamper Herman
Kioko Ghollah
Kolawole Tajudeen
Kreutzer Julia
Macharm Ricky
Marivate Vukosi
Martinus Laura Jane
Matsila Tshinondiwa
Meressa Musie
Mokgesi-Selinga Masabata
Muhammad Shamsuddeen Hassan
Murhabazi Espoir
Nekoto Wilhelmina
Niyongabo Rubungo Andre
Ogayo Perez
Ogueji Kelechi
Okegbemi Lawrence
Olabiyi Ayodele
Onyefuluchi Christopher
Orife Iroro
Osei Salomey
Ramkilowan Arshath
Sibanda Blessing
Siminyu Kathleen
Tajudeen Kolawole
Webster Jason
Whitenack Daniel
Öktem Alp
Publication venue
Publication date: 01/01/2020
Field of study

Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. "Low-resourced"-ness is a complex problem going beyond data availability and reflects systemic problems in society. In this paper, we focus on the task of Machine Translation (MT), that plays a crucial role for information accessibility and communication worldwide. Despite immense improvements in MT over the past decade, MT is centered around a few high-resourced languages. As MT researchers cannot solve the problem of low-resourcedness alone, we propose participatory research as a means to involve all necessary agents required in the MT development process. We demonstrate the feasibility and scalability of participatory research with a case study on MT for African languages. Its implementation leads to a collection of novel translation datasets, MT benchmarks for over 30 languages, with human evaluations for a third of them, and enables participants without formal training to make a unique scientific contribution. Benchmarks, models, data, code, and evaluation results are released under https://github.com/masakhane-io/masakhane-mt

arXiv.org e-Print Archive

Crossref

Lancaster E-Prints

AfriQA:Cross-lingual Open-Retrieval Question Answering for African Languages

Author: Abdou Aziz DIOP
Adelani David Ifeoluwa
Adeyemi Mofetoluwa
Adhiambo Sonia
Ahia Orevaoghene
Ahmad Ibrahim Said
Ajayi Tunde Oluwaseyi
Ajisafe Daniel A.
Alabi Jesujoba O.
Amuok Priscilla A.
Anuoluwapo Aremu
Arthur Steven
Asai Akari
Awosan Oyinkansola
Ayodele Awokoya
Buzaaba Happy
Chinedu Mbonu
Chukwuneke Chiamaka
Clark Jonathan H.
Dossou Bonaventure F. P.
Emezue Chris
Ezeani Ignatius
Gwadabe Tajuddeen R.
Hacheme Gilles
Iro Ruqayya Nasir
Kahira Albert Njoroge
Lawan Falalu Ibrahim
Mabuya Rooweither
Mbow Habib
Mngoma Ndumiso
Muhammad Shamsuddeen H.
Mukonde Eunice
Mwase Christine
Namukombo Martin
Niyomutabazi Emile
Ogundepo Odunayo
Oladipo Akintunde
Onwuegbuzia Emeka Felix
Opoku Bernard
Osei Salomey
Otiende Verrah
Owodunni Abraham Toluwase
Phiri Mofya
Putini Neo
Rivera Clara E.
Rubungo Andre Niyongabo
Ruder Sebastian
Shode Iyanuoluwa
Sikasote Claytone
Sinkala Boyd
Siro Clemencia
Tonja Atnafu Lambebo
Publication venue: 'Center for Open Science'
Publication date: 11/05/2023
Field of study

African languages have far less in-language content available digitally, making it challenging for question answering systems to satisfy the information needs of users. Cross-lingual open-retrieval question answering (XOR QA) systems -- those that retrieve answer content from other languages while serving people in their native language -- offer a means of filling this gap. To this end, we create AfriQA, the first cross-lingual QA dataset with a focus on African languages. AfriQA includes 12,000+ XOR QA examples across 10 African languages. While previous datasets have focused primarily on languages where cross-lingual QA augments coverage from the target language, AfriQA focuses on languages where cross-lingual answer content is the only high-coverage source of answer content. Because of this, we argue that African languages are one of the most important and realistic use cases for XOR QA. Our experiments demonstrate the poor performance of automatic translation and multilingual retrieval methods. Overall, AfriQA proves challenging for state-of-the-art QA models. We hope that the dataset enables the development of more equitable QA technology

Lancaster E-Prints

What a Creole Wants, What a Creole Needs

Author: Ahia Orevaoghene
de Lhoneux Miryam
Lent Heather
Ogueji Kelechi
Søgaard Anders
Publication venue
Publication date: 01/01/2022
Field of study

In recent years, the natural language processing (NLP) community has given increased attention to the disparity of efforts directed towards high-resource languages over low-resource ones. Efforts to remedy this delta often begin with translations of existing English datasets into other languages. However, this approach ignores that different language communities have different needs. We consider a group of low-resource languages, Creole languages. Creoles are both largely absent from the NLP literature, and also often ignored by society at large due to stigma, despite these languages having sizable and vibrant communities. We demonstrate, through conversations with Creole experts and surveys of Creole-speaking communities, how the things needed from language technology can change dramatically from one language to another, even when the languages are considered to be very similar to each other, as with Creoles. We discuss the prominent themes arising from these conversations, and ultimately demonstrate that useful language technology cannot be built without involving the relevant community

Publikationer från Uppsala Universitet

Digitala Vetenskapliga Arkivet - Academic Archive On-line

MasakhaNER 2.0:Africa-centric Transfer Learning for Named Entity Recognition

Author: Abdulmumin Idris
Adelani David Ifeoluwa
Adewumi Tosin P.
Adeyemi Mofetoluwa
Ahia Orevaoghene
Alabi Jesujoba O.
Anuoluwapo Aremu
Beukman Michael
Bukula Andiswa
Buzaaba Happy
Chukwuneke Chiamaka
Dione Cheikh M. Bamba
Dossou Bonaventure F. P.
Emezue Chris Chinenye
Ezeani Ignatius
Gitau Catherine
Gwadabe Tajuddeen
Hacheme Gilles
Kabore Fatoumata Ouoba
Kalipe Godson
Klakow Dietrich
Koagne Victoire Memdjokam
Lignos Constantine
Mabuya Rooweither
Macucwa Tebogo
Marivate Vukosi
Mbaye Derguene
Mboning Elvis
Mokono Neo L.
Muhammad Shamsuddeen Hassan
Mukiibi Jonathan
Munkoh-Buabeng Edwin
Nabende Peter
Nakatumba-Nabende Joyce
Neubig Graham
Ngoli Tatiana Moteu
Ogayo Perez
Ogundepo Odunayo
Palen-Michel Chester
Rijhwani Shruti
Ruder Sebastian
Sibanda Blessing
Tapo Allahsera Auguste
Taylor Amelia
Yousuf Oreen
Publication venue
Publication date: 15/11/2022
Field of study

African languages are spoken by over a billion people, but are underrepresented in NLP research and development. The challenges impeding progress include the limited availability of annotated datasets, as well as a lack of understanding of the settings where current methods are effective. In this paper, we make progress towards solutions for these challenges, focusing on the task of named entity recognition (NER). We create the largest human-annotated NER dataset for 20 African languages, and we study the behavior of state-of-the-art cross-lingual transfer methods in an Africa-centric setting, demonstrating that the choice of source language significantly affects performance. We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points across 20 languages compared to using English. Our results highlight the need for benchmark datasets and models that cover typologically-diverse African languages

Lancaster E-Prints