Search CORE

17 research outputs found

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Author: Adeyemi Mofetoluwa
Agrawal Sweta
Ahia Oghenefego
Ahia Orevaoghene
Ataman Duygu
Awokoya Ayodele
Azime Israel Abebe
Baljekar Pallavi
Ballı Sakine Çabuk
Bapna Ankur
Baruwa Ahmed
Battisti Alessia
Biderman Stella
Caswell Isaac
de Silva Nisansa
Dlamini Sakhile
Dossou Bonaventure F. P.
Firat Orhan
Jenny Mathias
Jernite Yacine
Kreutzer Julia
Kudugunta Sneha
Lawson Nze
Leong Colin
Matangira Tapiwanashe
Mirzakhalov Jamshidbek
Mnyakeni Ayanda
Muhammad Nanda
Muhammad Shamsuddeen Hassan
Müller André
Müller Mathias
Nguyen Toan Q.
Ogueji Kelechi
Orife Iroro
Osei Salomey
Papadimitriou Isabel
Rios Annette
Rivera Clara
Rubungo Andre Niyongabo
Sagot Benoît
Samb Sokhar
Sarin Supheakmungkol
Setyawan Monang
Sikasote Claytone
Sokolov Artem
Subramani Nishant
Suárez Pedro Ortiz
Tapo Allahsera
Ulzii-Orshikh Nasanbayar
van Esch Daan
Wahab Ahsan
Wang Lisa
Publication venue
Publication date: 23/03/2021
Field of study

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.Comment: Accepted at TACL; pre-MIT Press publication versio

arXiv.org e-Print Archive

Directory of Open Access Journals

INRIA a CCSD electronic archive server

HAL: Hyper Article en Ligne

Portail HAL UNIV-RENNES

MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition

Author: Abdulmumin I
Adelani DI
Adewumi T
Adeyemi M
Ahia O
Alabi JO
Aremu A
Bamba Dione CM
Beukman M
Bukula A
Buzaaba H
Chukwuneke C
Dossou BFP
Emezue CC
Ezeani I
Gitau C
Gwadabe T
Hacheme GQ
Kabore F
Kalipe G
Klakow D
Koagne VM
Lignos C
Mabuya R
Macucwa T
Marivate V
Mbaye D
Mboning E
Mokono NL
Muhammad SH
Mukiibi J
Munkoh-Buabeng E
Nabende P
Nakatumba-Nabende J
Neubig G
Ngoli TM
Ogayo P
Ogundepo O
Palen-Michel C
Rijhwani S
Ruder S
Sibanda B
Tapo AA
Taylor A
Yousuf O
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/12/2022
Field of study

African languages are spoken by over a billion people, but are underrepresented in NLP research and development. The challenges impeding progress include the limited availability of annotated datasets, as well as a lack of understanding of the settings where current methods are effective. In this paper, we make progress towards solutions for these challenges, focusing on the task of named entity recognition (NER). We create the largest human-annotated NER dataset for 20 African languages, and we study the behavior of state-of-the-art cross-lingual transfer methods in an Africa-centric setting, demonstrating that the choice of source language significantly affects performance. We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points across 20 languages compared to using English. Our results highlight the need for benchmark datasets and models that cover typologically-diverse African languages

UCL Discovery

AfriQA:Cross-lingual Open-Retrieval Question Answering for African Languages

Author: Abdou Aziz DIOP
Adelani David Ifeoluwa
Adeyemi Mofetoluwa
Adhiambo Sonia
Ahia Orevaoghene
Ahmad Ibrahim Said
Ajayi Tunde Oluwaseyi
Ajisafe Daniel A.
Alabi Jesujoba O.
Amuok Priscilla A.
Anuoluwapo Aremu
Arthur Steven
Asai Akari
Awosan Oyinkansola
Ayodele Awokoya
Buzaaba Happy
Chinedu Mbonu
Chukwuneke Chiamaka
Clark Jonathan H.
Dossou Bonaventure F. P.
Emezue Chris
Ezeani Ignatius
Gwadabe Tajuddeen R.
Hacheme Gilles
Iro Ruqayya Nasir
Kahira Albert Njoroge
Lawan Falalu Ibrahim
Mabuya Rooweither
Mbow Habib
Mngoma Ndumiso
Muhammad Shamsuddeen H.
Mukonde Eunice
Mwase Christine
Namukombo Martin
Niyomutabazi Emile
Ogundepo Odunayo
Oladipo Akintunde
Onwuegbuzia Emeka Felix
Opoku Bernard
Osei Salomey
Otiende Verrah
Owodunni Abraham Toluwase
Phiri Mofya
Putini Neo
Rivera Clara E.
Rubungo Andre Niyongabo
Ruder Sebastian
Shode Iyanuoluwa
Sikasote Claytone
Sinkala Boyd
Siro Clemencia
Tonja Atnafu Lambebo
Publication venue: 'Center for Open Science'
Publication date: 11/05/2023
Field of study

African languages have far less in-language content available digitally, making it challenging for question answering systems to satisfy the information needs of users. Cross-lingual open-retrieval question answering (XOR QA) systems -- those that retrieve answer content from other languages while serving people in their native language -- offer a means of filling this gap. To this end, we create AfriQA, the first cross-lingual QA dataset with a focus on African languages. AfriQA includes 12,000+ XOR QA examples across 10 African languages. While previous datasets have focused primarily on languages where cross-lingual QA augments coverage from the target language, AfriQA focuses on languages where cross-lingual answer content is the only high-coverage source of answer content. Because of this, we argue that African languages are one of the most important and realistic use cases for XOR QA. Our experiments demonstrate the poor performance of automatic translation and multilingual retrieval methods. Overall, AfriQA proves challenging for state-of-the-art QA models. We hope that the dataset enables the development of more equitable QA technology

Lancaster E-Prints

Gazing Mathematics and Science Education in Ghana

Author: E. Fredua-Kwarteng
F. Ahia
Publication venue: SensePublishers
Publication date: 01/01/2012
Field of study

Crossref

Factors Associated with Clinician Perception of Improved Patient Safety Following Transition to a Comprehensive Electronic Health Record

Author: Ahia C.
Holt E. W.
Krousel-Wood M.
Luo Q.
McCoy A. B.
Milani R. V.
Price-Haywood E.
Sittig D. F.
Thomas E. J.
Publication venue: 'BMJ'
Publication date: 01/01/2016
Field of study

UQ eSpace (University of Queensland)

Implementing electronic health records (EHRs): health care provider perceptions before and after transition from a local basic EHR to a commercial comprehensive EHR

Author: Ahia Chad
Holt Elizabeth W
Krousel-Wood Marie
Luo Qingyang
McCoy Allison B
Milani Richard V
Price-Haywood Eboni G
Sittig Dean F
Thomas Eric J
Trapani Donnalee N
Publication venue: 'Oxford University Press (OUP)'
Publication date: 23/09/2017
Field of study

We assessed changes in the percentage of providers with positive perceptions of electronic health record (EHR) benefit before and after transition from a local basic to a commercial comprehensive EHR.Changes in the percentage of providers with positive perceptions of EHR benefit were captured via a survey of academic health care providers before (baseline) and at 6-12 months (short term) and 12-24 months (long term) after the transition. We analyzed 32 items for the overall group and by practice setting, provider age, and specialty using separate multivariable-adjusted random effects logistic regression models.A total of 223 providers completed all 3 surveys (30% response rate): 85.6% had outpatient practices, 56.5% were >45 years old, and 23.8% were primary care providers. The percentage of providers with positive perceptions significantly increased from baseline to long-term follow-up for patient communication, hospital transitions - access to clinical information, preventive care delivery, preventive care prompt, preventive lab prompt, satisfaction with system reliability, and sharing medical information (P

UQ eSpace (University of Queensland)

The Agmon spectral function for molecular hamiltonians with symmetry restrictions

Author: Ahia F.
Bach V.
Evans W. D.
Hunziker W.
Ismagilov R.
Lieb E. H.
Morgan J. D.
Morgan J. D.
Sigal I. M.
Simon B.
Simon B.
van Winter C.
Vugal'ter S. A.
Vugal'ter S. A.
Vugal'ter S. A.
Zhislin G. M.
Zhislin G. M.
Publication venue: 'The Royal Society'
Publication date
Field of study

Crossref

Protecting Children through Mandated Child-Abuse Reporting

Author: Bryant S.
C. Emmanuel Ahia
Cahill L.
Ferrara F.
Foreman T.
Gil E.
Kathleen McQuillan
Kuest D.
Levin P.
Merali N.
Monteleone J.
Myers J.
Myers J.
Perlis S.
Sebold J.
Shengold L.
Shoop R.
Sorenson T.
Stefan C. Dombrowski
Sue D.
Wolfe D.
Publication venue: 'Informa UK Limited'
Publication date
Field of study

Crossref

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Author: Ahmed Baruwa
Ahsan Wahab
Alessia Battisti
Allahsera Tapo
Andre Niyongabo Rubungo
André Müller
Ankur Bapna
Annette Rios
Artem Sokolov
Ayanda Mnyakeni
Ayodele Awokoya
Benoît Sagot
Bonaventure F. P. Dossou
Clara Rivera
Claytone Sikasote
Colin Leong
Daan van Esch
Duygu Ataman
Iroro Orife
Isaac Caswell
Isabel Papadimitriou
Israel Abebe Azime
Jamshidbek Mirzakhalov
Julia Kreutzer
Kelechi Ogueji
Lisa Wang
Mathias Jenny
Mathias Müller
Mofetoluwa Adeyemi
Monang Setyawan
Nanda Muhammad
Nasanbayar Ulzii-Orshikh
Nisansa de Silva
Nishant Subramani
Nze Lawson
Oghenefego Ahia
Orevaoghene Ahia
Orhan Firat
Pallavi Baljekar
Pedro Ortiz Suarez
Sakhile Dlamini
Sakine Çabuk Ballı
Salomey Osei
Shamsuddeen Hassan Muhammad
Sneha Kudugunta
Sokhar Samb
Stella Biderman
Supheakmungkol Sarin
Sweta Agrawal
Tapiwanashe Matangira
Toan Q. Nguyen
Yacine Jernite
Publication venue: MIT Press
Publication date: 01/01/2022
Field of study

AbstractWith the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.</jats:p

Crossref