Search CORE

9,612 research outputs found

Automatic identification methods on a corpus of twenty five fine-grained Arabic dialects

Author: J Li
JC Watson
K Spärck Jones
OF Zaidan
S Hochreiter
S Kullback
S Malmasi
Subarno Pal
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 04/10/2019
Field of study

International audienceThis research deals with Arabic dialect identification, a challenging issue related to Arabic NLP. Indeed, the increasing use of Arabic dialects in a written form especially in social media generates new needs in the area of Arabic dialect processing. For discriminating between dialects in a multi-dialect context, we use different approaches based on machine learning techniques. To this end, we explored several methods. We used a classification method based on symmetric Kullback-Leibler, and we experimented classical classification methods such as Naive Bayes Classifiers and more sophisticated methods like Word2Vec and Long Short-Term Memory neural network. We tested our approaches on a large database of 25 Arabic dialects in addition to MSA

INRIA a CCSD electronic archive server

Hal-Diderot

Atar: Attention-based LSTM for Arabizi transliteration

Author: Abuammar Analle
Al-Ayyoub Mahmoud
Talafha Bashar
Publication venue: Institute of Advanced Engineering and Science
Publication date: 01/06/2021
Field of study

A non-standard romanization of Arabic script, known as Arbizi, is widely used in Arabic online and SMS/chat communities. However, since state-of-the-art tools and applications for Arabic NLP expects Arabic to be written in Arabic script, handling contents written in Arabizi requires a special attention either by building customized tools or by transliterating them into Arabic script. The latter approach is the more common one and this work presents two significant contributions in this direction. The first one is to collect and publicly release the first large-scale “Arabizi to Arabic script” parallel corpus focusing on the Jordanian dialect and consisting of more than 25 k pairs carefully created and inspected by native speakers to ensure highest quality. Second, we present Atar, an attention-based encoder-decoder model for Arabizi transliteration. Training and testing this model on our dataset yields impressive accuracy (79%) and BLEU score (88.49)

ZENODO

Institute of Advanced Engineering and Science