Search CORE

7 research outputs found

ChatGPT MT: Competitive for High- (but not Low-) Resource Languages

Author: Mortensen David R.
Neubig Graham
Ogayo Perez
Robinson Nathaniel R.
Publication venue
Publication date: 14/09/2023
Field of study

Large language models (LLMs) implicitly learn to perform a range of language tasks, including machine translation (MT). Previous studies explore aspects of LLMs' MT capabilities. However, there exist a wide variety of languages for which recent LLM MT performance has never before been evaluated. Without published experimental evidence on the matter, it is difficult for speakers of the world's diverse languages to know how and whether they can use LLMs for their languages. We present the first experimental evidence for an expansive set of 204 languages, along with MT cost analysis, using the FLORES-200 benchmark. Trends reveal that GPT models approach or exceed traditional MT model performance for some high-resource languages (HRLs) but consistently lag for low-resource languages (LRLs), under-performing traditional MT for 84.1% of languages we covered. Our analysis reveals that a language's resource level is the most important feature in determining ChatGPT's relative ability to translate it, and suggests that ChatGPT is especially disadvantaged for LRLs and African languages.Comment: 27 pages, 9 figures, 14 table

arXiv.org e-Print Archive

Participatory Research for Low-resourced Machine Translation:A Case Study in African Languages

Author: Abbott Jade
Adeyemi Mofe
Ahia Orevaoghene
Akinfaderin Adewale
Akinola Solomon Oluwole
Ali Jamiil Toure
Bashir Abdallah
Bassey Blessing Itoro
Biljon Elan van
Dangana Idris Abdulkabir
Degila Kevin
Dossou Bonaventure
Duru Goodness
Elsahar Hady
Emezue Chris
Ezeani Ignatius
Fagbohungbe Taiwo
Fasubaa Timi
Freshia Sackey
Kabongo Salomon
Kamper Herman
Kioko Ghollah
Kolawole Tajudeen
Kreutzer Julia
Macharm Ricky
Marivate Vukosi
Martinus Laura Jane
Matsila Tshinondiwa
Meressa Musie
Mokgesi-Selinga Masabata
Muhammad Shamsuddeen Hassan
Murhabazi Espoir
Nekoto Wilhelmina
Niyongabo Rubungo Andre
Ogayo Perez
Ogueji Kelechi
Okegbemi Lawrence
Olabiyi Ayodele
Onyefuluchi Christopher
Orife Iroro
Osei Salomey
Ramkilowan Arshath
Sibanda Blessing
Siminyu Kathleen
Tajudeen Kolawole
Webster Jason
Whitenack Daniel
Öktem Alp
Publication venue
Publication date: 01/01/2020
Field of study

Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. "Low-resourced"-ness is a complex problem going beyond data availability and reflects systemic problems in society. In this paper, we focus on the task of Machine Translation (MT), that plays a crucial role for information accessibility and communication worldwide. Despite immense improvements in MT over the past decade, MT is centered around a few high-resourced languages. As MT researchers cannot solve the problem of low-resourcedness alone, we propose participatory research as a means to involve all necessary agents required in the MT development process. We demonstrate the feasibility and scalability of participatory research with a case study on MT for African languages. Its implementation leads to a collection of novel translation datasets, MT benchmarks for over 30 languages, with human evaluations for a third of them, and enables participants without formal training to make a unique scientific contribution. Benchmarks, models, data, code, and evaluation results are released under https://github.com/masakhane-io/masakhane-mt

arXiv.org e-Print Archive

Crossref

Lancaster E-Prints

AfriMTE and AfriCOMET : Empowering COMET to Embrace Under-resourced African Languages

Author: Abdullahi Saheed S.
Abolade Daud
Adelani David Ifeoluwa
Adewumi Tosin
Afolabi Abeeb
Agrawal Sweta
Ajao Simbiat
Akinjobi Zainab
Al-Azzawi Sana
Alkhaled Lama
Anigri Salma El
Aremu Anuoluwapo
Awoyomi Oluwabusayo Olufunke
Bourhim Sofia
Briakou Eleftheria
Brian Sam
Bukula Andiswa
Carpuat Marine
Chukwuneke Chiamaka
Etori Naome A.
Hassan Ayinde
He Xuanli
Hourrane Oumaima
Iro Ruqayya Nasir
Kimotho Wangari
Kimotho Wangui
Macharm Ricky
Mangwana Thabiso
Masiak Marek
Mbonu Chinedu Emmanuel
Mohamed Muhidin
Mohamed Shafie Abdi
Mokayede Hamam
Momo Lyse Naomi Wamba
Moore Stephen E.
Muchiri Eric
Muhammad Shamsuddeen Hassan
Mwase Christine
Ndolela Lolwethu
Njoroge Samuel
Obiefuna Nnaemeka
Ochieng Millicent
Ogayo Perez
Ogbu Onyekachi Raphael
Ojo Jessica
Olatoye Temitayo
Omotayo Abdul-Hakeem
Opoku Bernard
Osei Salomey
Otiende Verrah Akinyi
Rei Ricardo
Sari Sakayo Toadoum
Shode Iyanuoluwa
Siro Clemencia
Stenetorp Pontus
Wang Jiayi
Yuehgoh Foutse
Publication venue: 'Center for Open Science'
Publication date: 16/11/2023
Field of study

Despite the progress we have recorded in scaling multilingual machine translation (MT) models and evaluation data to several under-resourced African languages, it is difficult to measure accurately the progress we have made on these languages because evaluation is often performed on n-gram matching metrics like BLEU that often have worse correlation with human judgments. Embedding-based metrics such as COMET correlate better; however, lack of evaluation data with human ratings for under-resourced languages, complexity of annotation guidelines like Multidimensional Quality Metrics (MQM), and limited language coverage of multilingual encoders have hampered their applicability to African languages. In this paper, we address these challenges by creating high-quality human evaluation data with a simplified MQM guideline for error-span annotation and direct assessment (DA) scoring for 13 typologically diverse African languages. Furthermore, we develop AfriCOMET, a COMET evaluation metric for African languages by leveraging DA training data from high-resource languages and African-centric multilingual encoder (AfroXLM-Roberta) to create the state-of-the-art evaluation metric for African languages MT with respect to Spearman-rank correlation with human judgments (+0.406)

Lancaster E-Prints

MasakhaNER 2.0:Africa-centric Transfer Learning for Named Entity Recognition

Author: Abdulmumin Idris
Adelani David Ifeoluwa
Adewumi Tosin P.
Adeyemi Mofetoluwa
Ahia Orevaoghene
Alabi Jesujoba O.
Anuoluwapo Aremu
Beukman Michael
Bukula Andiswa
Buzaaba Happy
Chukwuneke Chiamaka
Dione Cheikh M. Bamba
Dossou Bonaventure F. P.
Emezue Chris Chinenye
Ezeani Ignatius
Gitau Catherine
Gwadabe Tajuddeen
Hacheme Gilles
Kabore Fatoumata Ouoba
Kalipe Godson
Klakow Dietrich
Koagne Victoire Memdjokam
Lignos Constantine
Mabuya Rooweither
Macucwa Tebogo
Marivate Vukosi
Mbaye Derguene
Mboning Elvis
Mokono Neo L.
Muhammad Shamsuddeen Hassan
Mukiibi Jonathan
Munkoh-Buabeng Edwin
Nabende Peter
Nakatumba-Nabende Joyce
Neubig Graham
Ngoli Tatiana Moteu
Ogayo Perez
Ogundepo Odunayo
Palen-Michel Chester
Rijhwani Shruti
Ruder Sebastian
Sibanda Blessing
Tapo Allahsera Auguste
Taylor Amelia
Yousuf Oreen
Publication venue
Publication date: 15/11/2022
Field of study

African languages are spoken by over a billion people, but are underrepresented in NLP research and development. The challenges impeding progress include the limited availability of annotated datasets, as well as a lack of understanding of the settings where current methods are effective. In this paper, we make progress towards solutions for these challenges, focusing on the task of named entity recognition (NER). We create the largest human-annotated NER dataset for 20 African languages, and we study the behavior of state-of-the-art cross-lingual transfer methods in an Africa-centric setting, demonstrating that the choice of source language significantly affects performance. We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points across 20 languages compared to using English. Our results highlight the need for benchmark datasets and models that cover typologically-diverse African languages

Lancaster E-Prints

AfriMTE and AfriCOMET: Enhancing COMET to Embrace Under-resourced African Languages

Author: Abdullahi Saheed S.
Abolade Daud
Adelani David Ifeoluwa
Adewumi Tosin
Afolabi Abeeb
Agrawal Sweta
Ajao Simbiat
Akinjobi Zainab
Al-Azzawi Sana
Alkhaled Lama
Anigri Salma El
Aremu Anuoluwapo
Awoyomi Oluwabusayo Olufunke
Bourhim Sofia
Briakou Eleftheria
Brian Sam
Bukula Andiswa
Carpuat Marine
Chukwuneke Chiamaka
Etori Naome A.
Hassan Ayinde
He Xuanli
Hourrane Oumaima
Iro Ruqayya Nasir
Kimotho Wangari
Kimotho Wangui
Lu Yao
Macharm Ricky
Mangwana Thabiso
Masiak Marek
Mbonu Chinedu Emmanuel
Mohamed Muhidin
Mohamed Shafie Abdi
Mokayed Hamam
Momo Lyse Naomi Wamba
Moore Stephen E.
Muchiri Eric
Muhammad Shamsuddeen Hassan
Mwase Christine
Ndolela Lolwethu
Njoroge Samuel
Obiefuna Nnaemeka
Ochieng Millicent
Ogayo Perez
Ogbu Onyekachi Raphael
Ojo Jessica
Olatoye Temitayo
Omotayo Abdul-Hakeem
Opoku Bernard
Osei Salomey
Otiende Verrah Akinyi
Rei Ricardo
Sari Sakayo Toadoum
Shode Iyanuoluwa
Siro Clemencia
Stenetorp Pontus
Wang Jiayi
Yuehgoh Foutse
Publication venue: arXiv.org
Publication date: 16/11/2023
Field of study

Despite the recent progress on scaling multilingual machine translation (MT) to several under-resourced African languages, accurately measuring this progress remains challenging, since evaluation is often performed on n-gram matching metrics such as BLEU, which typically show a weaker correlation with human judgments. Learned metrics such as COMET have higher correlation; however, the lack of evaluation data with human ratings for under-resourced languages, complexity of annotation guidelines like Multidimensional Quality Metrics (MQM), and limited language coverage of multilingual encoders have hampered their applicability to African languages. In this paper, we address these challenges by creating high-quality human evaluation data with simplified MQM guidelines for error detection and direct assessment (DA) scoring for 13 typologically diverse African languages. Furthermore, we develop AfriCOMET: COMET evaluation metrics for African languages by leveraging DA data from well-resourced languages and an African-centric multilingual encoder (AfroXLM-R) to create the state-of-the-art MT evaluation metrics for African languages with respect to Spearman-rank correlation with human judgments (0.441)

Aston Publications Explorer

MasakhaNER: Named entity recognition for African languages

Author: Abbott Jade
Abebe Azime Israel
Adelani David,
Adewumi Tosin
Adeyemi Mofetoluwa
Ahia Orevaoghene
Akinfaderin Adewale
Akinode Victor
Alabi Jesujoba
Anebi Emmanuel
Aremu Anuoluwapo
Awokoya Ayodele
Buzaaba Happy
Chinenye Emezue Chris
Chukwuneke Chiamaka
d'Souza Daniel
David Davis
Diallo Abdoulaye
Dossou Bonaventure,
Ezeani Ignatius
Faye Abdoulaye
Gebreyohannes Dibora
Gitau Catherine
Katusiime Maurice
Kreutzer Julia
Lignos Constantine
Marengereke Tendai
Mayhew Stephen
Mbaye Derguene
Mboup Mouhamadane
Muhammad Shamsuddeen,
Mukiibi Jonathan
Muriuki Gerald
Nabagereka Deborah
Nakatumba-Nabende Joyce
Neubig Graham
Ngom Samba
Niyongabo Rubungo,
Nwaike Kelechi
Odu Nkiruka
Ogayo Perez
Ogueji Kelechi
Oloyede Temilola
Orife Iroro
Osei Salomey
Otiende Verrah
Oyerinde Samuel
Palen-Michel Chester
Rabiu Gwadabe Tajuddeen
Rayson Paul
Rijhwani Shruti
Ruder Sebastian
Saul Bateesa Tobius
Sibanda Blessing
Siro Clemencia
Thierno Ibrahima
Tilaye Henok
Wairagala Eric,
Wambui Yvonne
Wolde Degaga
Yimam Seid,
Publication venue: 'MIT Press - Journals'
Publication date: 14/06/2021
Field of study

International audienceWe take a step towards addressing the underrepresentation of the African continent in NLP research by bringing together different stakeholders to create the first large, publicly available, high-quality dataset for named entity recognition (NER) in ten African languages. We detail the characteristics of these languages to help researchers and practitioners better understand the challenges they pose for NER tasks. We analyze our datasets and conduct an extensive empirical evaluation of stateof-the-art methods across both supervised and transfer learning settings. Finally, we release the data, code, and models to inspire future research on African NLP.

Hal-Diderot