18 research outputs found
AfriNames: Most ASR models "butcher" African Names
Useful conversational agents must accurately capture named entities to
minimize error for downstream tasks, for example, asking a voice assistant to
play a track from a certain artist, initiating navigation to a specific
location, or documenting a laboratory result for a patient. However, where
named entities such as ``Ukachukwu`` (Igbo), ``Lakicia`` (Swahili), or
``Ingabire`` (Rwandan) are spoken, automatic speech recognition (ASR) models'
performance degrades significantly, propagating errors to downstream systems.
We model this problem as a distribution shift and demonstrate that such model
bias can be mitigated through multilingual pre-training, intelligent data
augmentation strategies to increase the representation of African-named
entities, and fine-tuning multilingual ASR models on multiple African accents.
The resulting fine-tuned models show an 81.5\% relative WER improvement
compared with the baseline on samples with African-named entities.Comment: Accepted at Interspeech 2023 (Main Conference
Adapting Pretrained ASR Models to Low-resource Clinical Speech using Epistemic Uncertainty-based Data Selection
While there has been significant progress in ASR, African-accented clinical
ASR has been understudied due to a lack of training datasets. Building robust
ASR systems in this domain requires large amounts of annotated or labeled data,
for a wide variety of linguistically and morphologically rich accents, which
are expensive to create. Our study aims to address this problem by reducing
annotation expenses through informative uncertainty-based data selection. We
show that incorporating epistemic uncertainty into our adaptation rounds
outperforms several baseline results, established using state-of-the-art (SOTA)
ASR models, while reducing the required amount of labeled data, and hence
reducing annotation costs. Our approach also improves out-of-distribution
generalization for very low-resource accents, demonstrating the viability of
our approach for building generalizable ASR models in the context of accented
African clinical ASR, where training datasets are predominantly scarce
AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages
In recent years, multilingual pre-trained language models have gained
prominence due to their remarkable performance on numerous downstream Natural
Language Processing tasks (NLP). However, pre-training these large multilingual
language models requires a lot of training data, which is not available for
African Languages. Active learning is a semi-supervised learning algorithm, in
which a model consistently and dynamically learns to identify the most
beneficial samples to train itself on, in order to achieve better optimization
and performance on downstream tasks. Furthermore, active learning effectively
and practically addresses real-world data scarcity. Despite all its benefits,
active learning, in the context of NLP and especially multilingual language
models pretraining, has received little consideration. In this paper, we
present AfroLM, a multilingual language model pretrained from scratch on 23
African languages (the largest effort to date) using our novel self-active
learning framework. Pretrained on a dataset significantly (14x) smaller than
existing baselines, AfroLM outperforms many multilingual pretrained language
models (AfriBERTa, XLMR-base, mBERT) on various NLP downstream tasks (NER, text
classification, and sentiment analysis). Additional out-of-domain sentiment
analysis experiments show that \textbf{AfroLM} is able to generalize well
across various domains. We release the code source, and our datasets used in
our framework at https://github.com/bonaventuredossou/MLM_AL.Comment: Third Workshop on Simple and Efficient Natural Language Processing,
EMNLP 202
The African Stopwords project:curating stopwords for African languages
Stopwords are fundamental in Natural Language Processing (NLP) techniques for information retrieval. One of the common tasks in preprocessing of text data is the removal of stopwords. Currently, while high-resource languages like English benefit from the availability of several stopwords, low-resource languages, such as those found in the African continent, have none that are standardized and available for use in NLP packages. Stopwords in the context of African languages are understudied and can reveal information about the crossover between languages. The \textit{African Stopwords} project aims to study and curate stopwords for African languages. In this paper, we present our current progress on ten African languages as well as future plans for the project
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
With the success of large-scale pre-training and multilingual modeling in
Natural Language Processing (NLP), recent years have seen a proliferation of
large, web-mined text datasets covering hundreds of languages. We manually
audit the quality of 205 language-specific corpora released with five major
public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource
corpora have systematic issues: At least 15 corpora have no usable text, and a
significant fraction contains less than 50% sentences of acceptable quality. In
addition, many are mislabeled or use nonstandard/ambiguous language codes. We
demonstrate that these issues are easy to detect even for non-proficient
speakers, and supplement the human audit with automatic analyses. Finally, we
recommend techniques to evaluate and improve multilingual corpora and discuss
potential risks that come with low-quality data releases.Comment: Accepted at TACL; pre-MIT Press publication versio