Automatic identification methods on a corpus of twenty five fine-grained Arabic dialects

J Li; JC Watson; K Spärck Jones; OF Zaidan; S Hochreiter; S Kullback; S Malmasi; Subarno Pal

Automatic identification methods on a corpus of twenty five fine-grained Arabic dialects

Authors: J Li
JC Watson
K Spärck Jones
OF Zaidan
S Hochreiter
S Kullback
S Malmasi
Subarno Pal
Publication date: 4 October 2019
Publisher: 'Springer Science and Business Media LLC'
Doi

Abstract

International audienceThis research deals with Arabic dialect identification, a challenging issue related to Arabic NLP. Indeed, the increasing use of Arabic dialects in a written form especially in social media generates new needs in the area of Arabic dialect processing. For discriminating between dialects in a multi-dialect context, we use different approaches based on machine learning techniques. To this end, we explored several methods. We used a classification method based on symmetric Kullback-Leibler, and we experimented classical classification methods such as Naive Bayes Classifiers and more sophisticated methods like Word2Vec and Long Short-Term Memory neural network. We tested our approaches on a large database of 25 Arabic dialects in addition to MSA