115 research outputs found

    Automatic identification methods on a corpus of twenty five fine-grained Arabic dialects

    Get PDF
    International audienceThis research deals with Arabic dialect identification, a challenging issue related to Arabic NLP. Indeed, the increasing use of Arabic dialects in a written form especially in social media generates new needs in the area of Arabic dialect processing. For discriminating between dialects in a multi-dialect context, we use different approaches based on machine learning techniques. To this end, we explored several methods. We used a classification method based on symmetric Kullback-Leibler, and we experimented classical classification methods such as Naive Bayes Classifiers and more sophisticated methods like Word2Vec and Long Short-Term Memory neural network. We tested our approaches on a large database of 25 Arabic dialects in addition to MSA

    PADIC: extension and new experiments

    Get PDF
    International audiencePADIC is a multidialectal parallel Arabic corpus. It was composed initially by five Arabic dialects, three from the Maghreb and two from the Middle East, in addition to standard Arabic. In this paper, we present an augmented version of PADIC with a Moroccan dialect. We give also an evaluation, using the σ–index, of the computerization level of the Arabic dialects present in PADIC which reveals that these languages are really under-resourced. Several experiments in machine translation, in both sides between all the combinations of language pairs, are discussed too. For each language, we interpolated the corresponding Language Model (LM) with a large Arabic corpus based LM. The results show that this interpolation is in some cases without effect on the performances of translation systems and in others is rather penalizing

    Arabic dialects annotation using an online game

    Get PDF
    Modern Standard Arabic is the written standard across the Arab world; but there is an increasing use of Arabic dialects in social media, so this is appropriate as a source of a corpus for research on classifying Arabic dialect texts using machine learning algorithms. An important first step is annotation of the text corpus with correct dialect tags. We collected tweets from Twitter and comments from Facebook and online newspapers, aiming for representative samples of five groups of Arabic dialects: Gulf, Iraqi, Egyptian, Levantine, and North African. Then, we explored an approach to crowdsourcing corpus annotation. The task of annotation was developed as an online game, where players can test their dialect classification skills and get a score of their knowledge. This approach has so far achieved 24K annotated documents containing 587K tokens; 16,179 tagged as a dialect and 7,821 as Modern Standard Arabic
    corecore