16,169 research outputs found
Arabic dialect identification in the context of bivalency and code-switching
In this paper we use a novel approach towards Arabic dialect identification using language bivalency and written code-switching. Bivalency between languages or dialects is where a word or element is treated by language users as having a fundamentally similar semantic content in more than one language or dialect. Arabic dialect identification in writing is a difficult task even for humans due to the fact that words are used interchangeably between dialects. The task of automatically identifying dialect is harder and classifiers trained using only n-grams will perform poorly when tested on unseen data. Such approaches require significant amounts of annotated training data which is costly and time consuming to produce. Currently available Arabic dialect datasets do not exceed a few hundred thousand sentences, thus we need to extract features other than word and character n-grams. In our work we present experimental results from automatically identifying dialects from the four main Arabic dialect regions (Egypt, North Africa, Gulf and Levant) in addition to Standard Arabic. We extend previous work by incorporating additional grammatical and stylistic features and define a subtractive bivalency profiling approach to address issues of bivalent words across the examined Arabic dialects. The results show that our new methods classification accuracy can reach more than 76% and score well (66%) when tested on completely unseen data
Recommended from our members
Using Prosody and Phonotactics in Arabic Dialect Identification
While Modern Standard Arabic is the formal spoken and written language of the Arab world, dialects are the major communication mode for everyday life; identifying a speaker’s dialect is thus critical to speech processing tasks such as automatic speech recognition, as well as speaker identification We examine the role of prosodic features (intonation and rhythm) across four Arabic dialects: Gulf, Iraqi, Levantine, and Egyptian, for the purpose of automatic dialect identification We show that prosodic features can significantly improve identification, over a purely phonotactic-based approach, with an identification accuracy of 86.33% for 2m utterances
Automatic identification methods on a corpus of twenty five fine-grained Arabic dialects
International audienceThis research deals with Arabic dialect identification, a challenging issue related to Arabic NLP. Indeed, the increasing use of Arabic dialects in a written form especially in social media generates new needs in the area of Arabic dialect processing. For discriminating between dialects in a multi-dialect context, we use different approaches based on machine learning techniques. To this end, we explored several methods. We used a classification method based on symmetric Kullback-Leibler, and we experimented classical classification methods such as Naive Bayes Classifiers and more sophisticated methods like Word2Vec and Long Short-Term Memory neural network. We tested our approaches on a large database of 25 Arabic dialects in addition to MSA
Speech Recognition Challenge in the Wild: Arabic MGB-3
This paper describes the Arabic MGB-3 Challenge - Arabic Speech Recognition
in the Wild. Unlike last year's Arabic MGB-2 Challenge, for which the
recognition task was based on more than 1,200 hours broadcast TV news
recordings from Aljazeera Arabic TV programs, MGB-3 emphasises dialectal Arabic
using a multi-genre collection of Egyptian YouTube videos. Seven genres were
used for the data collection: comedy, cooking, family/kids, fashion, drama,
sports, and science (TEDx). A total of 16 hours of videos, split evenly across
the different genres, were divided into adaptation, development and evaluation
data sets. The Arabic MGB-Challenge comprised two tasks: A) Speech
transcription, evaluated on the MGB-3 test set, along with the 10 hour MGB-2
test set to report progress on the MGB-2 evaluation; B) Arabic dialect
identification, introduced this year in order to distinguish between four major
Arabic dialects - Egyptian, Levantine, North African, Gulf, as well as Modern
Standard Arabic. Two hours of audio per dialect were released for development
and a further two hours were used for evaluation. For dialect identification,
both lexical features and i-vector bottleneck features were shared with
participants in addition to the raw audio recordings. Overall, thirteen teams
submitted ten systems to the challenge. We outline the approaches adopted in
each system, and summarise the evaluation results
- …