Search CORE

14 research outputs found

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Author: Adeyemi Mofetoluwa
Agrawal Sweta
Ahia Oghenefego
Ahia Orevaoghene
Ataman Duygu
Awokoya Ayodele
Azime Israel Abebe
Baljekar Pallavi
Ballı Sakine Çabuk
Bapna Ankur
Baruwa Ahmed
Battisti Alessia
Biderman Stella
Caswell Isaac
de Silva Nisansa
Dlamini Sakhile
Dossou Bonaventure F. P.
Firat Orhan
Jenny Mathias
Jernite Yacine
Kreutzer Julia
Kudugunta Sneha
Lawson Nze
Leong Colin
Matangira Tapiwanashe
Mirzakhalov Jamshidbek
Mnyakeni Ayanda
Muhammad Nanda
Muhammad Shamsuddeen Hassan
Müller André
Müller Mathias
Nguyen Toan Q.
Ogueji Kelechi
Orife Iroro
Osei Salomey
Papadimitriou Isabel
Rios Annette
Rivera Clara
Rubungo Andre Niyongabo
Sagot Benoît
Samb Sokhar
Sarin Supheakmungkol
Setyawan Monang
Sikasote Claytone
Sokolov Artem
Subramani Nishant
Suárez Pedro Ortiz
Tapo Allahsera
Ulzii-Orshikh Nasanbayar
van Esch Daan
Wahab Ahsan
Wang Lisa
Publication venue
Publication date: 23/03/2021
Field of study

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.Comment: Accepted at TACL; pre-MIT Press publication versio

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL-Rennes 1

Participatory Research for Low-resourced Machine Translation:A Case Study in African Languages

Author: Abbott Jade
Adeyemi Mofe
Ahia Orevaoghene
Akinfaderin Adewale
Akinola Solomon Oluwole
Ali Jamiil Toure
Bashir Abdallah
Bassey Blessing Itoro
Biljon Elan van
Dangana Idris Abdulkabir
Degila Kevin
Dossou Bonaventure
Duru Goodness
Elsahar Hady
Emezue Chris
Ezeani Ignatius
Fagbohungbe Taiwo
Fasubaa Timi
Freshia Sackey
Kabongo Salomon
Kamper Herman
Kioko Ghollah
Kolawole Tajudeen
Kreutzer Julia
Macharm Ricky
Marivate Vukosi
Martinus Laura Jane
Matsila Tshinondiwa
Meressa Musie
Mokgesi-Selinga Masabata
Muhammad Shamsuddeen Hassan
Murhabazi Espoir
Nekoto Wilhelmina
Niyongabo Rubungo Andre
Ogayo Perez
Ogueji Kelechi
Okegbemi Lawrence
Olabiyi Ayodele
Onyefuluchi Christopher
Orife Iroro
Osei Salomey
Ramkilowan Arshath
Sibanda Blessing
Siminyu Kathleen
Tajudeen Kolawole
Webster Jason
Whitenack Daniel
Öktem Alp
Publication venue
Publication date: 01/01/2020
Field of study

Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. "Low-resourced"-ness is a complex problem going beyond data availability and reflects systemic problems in society. In this paper, we focus on the task of Machine Translation (MT), that plays a crucial role for information accessibility and communication worldwide. Despite immense improvements in MT over the past decade, MT is centered around a few high-resourced languages. As MT researchers cannot solve the problem of low-resourcedness alone, we propose participatory research as a means to involve all necessary agents required in the MT development process. We demonstrate the feasibility and scalability of participatory research with a case study on MT for African languages. Its implementation leads to a collection of novel translation datasets, MT benchmarks for over 30 languages, with human evaluations for a third of them, and enables participants without formal training to make a unique scientific contribution. Benchmarks, models, data, code, and evaluation results are released under https://github.com/masakhane-io/masakhane-mt

arXiv.org e-Print Archive

Crossref

Lancaster E-Prints

Tae Hong Park: Introduction to Digital Signal Processing: Computer Musically Speaking

Author: Iroro Orife
Publication venue: 'MIT Press - Journals'
Publication date
Field of study

Crossref