14 research outputs found
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
With the success of large-scale pre-training and multilingual modeling in
Natural Language Processing (NLP), recent years have seen a proliferation of
large, web-mined text datasets covering hundreds of languages. We manually
audit the quality of 205 language-specific corpora released with five major
public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource
corpora have systematic issues: At least 15 corpora have no usable text, and a
significant fraction contains less than 50% sentences of acceptable quality. In
addition, many are mislabeled or use nonstandard/ambiguous language codes. We
demonstrate that these issues are easy to detect even for non-proficient
speakers, and supplement the human audit with automatic analyses. Finally, we
recommend techniques to evaluate and improve multilingual corpora and discuss
potential risks that come with low-quality data releases.Comment: Accepted at TACL; pre-MIT Press publication versio
Participatory Research for Low-resourced Machine Translation:A Case Study in African Languages
Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. "Low-resourced"-ness is a complex problem going beyond data availability and reflects systemic problems in society. In this paper, we focus on the task of Machine Translation (MT), that plays a crucial role for information accessibility and communication worldwide. Despite immense improvements in MT over the past decade, MT is centered around a few high-resourced languages. As MT researchers cannot solve the problem of low-resourcedness alone, we propose participatory research as a means to involve all necessary agents required in the MT development process. We demonstrate the feasibility and scalability of participatory research with a case study on MT for African languages. Its implementation leads to a collection of novel translation datasets, MT benchmarks for over 30 languages, with human evaluations for a third of them, and enables participants without formal training to make a unique scientific contribution. Benchmarks, models, data, code, and evaluation results are released under https://github.com/masakhane-io/masakhane-mt