13 research outputs found
AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages
Africa is home to over 2000 languages from over six language families and has
the highest linguistic diversity among all continents. This includes 75
languages with at least one million speakers each. Yet, there is little NLP
research conducted on African languages. Crucial in enabling such research is
the availability of high-quality annotated datasets. In this paper, we
introduce AfriSenti, which consists of 14 sentiment datasets of 110,000+ tweets
in 14 African languages (Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda,
Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo, Swahili,
Tigrinya, Twi, Xitsonga, and Yor\`ub\'a) from four language families annotated
by native speakers. The data is used in SemEval 2023 Task 12, the first
Afro-centric SemEval shared task. We describe the data collection methodology,
annotation process, and related challenges when curating each of the datasets.
We conduct experiments with different sentiment classification baselines and
discuss their usefulness. We hope AfriSenti enables new work on
under-represented languages. The dataset is available at
https://github.com/afrisenti-semeval/afrisent-semeval-2023 and can also be
loaded as a huggingface datasets
(https://huggingface.co/datasets/shmuhammad/AfriSenti).Comment: 15 pages, 6 Figures, 9 Table