Modern Standard Arabic is the written standard across the Arab world; but there is an increasing use of Arabic dialects in social media, so this is appropriate as a source of a corpus for research on classifying Arabic dialect texts using machine learning algorithms. An important first step is annotation of the text corpus with correct dialect tags. We collected tweets from Twitter and comments from Facebook and online newspapers, aiming for representative samples of five groups of Arabic dialects: Gulf, Iraqi, Egyptian, Levantine, and North African. Then, we explored an approach to crowdsourcing corpus annotation. The task of annotation was developed as an online game, where players can test their dialect classification skills and get a score of their knowledge. This approach has so far achieved 24K annotated documents containing 587K tokens; 16,179 tagged as a dialect and 7,821 as Modern Standard Arabic

Alshutayri, A

Atwell, E

White Rose Research Online

This is a repository copy of Arabic dialects annotation using an online game.White Rose Research Online URL for this paper:http://eprints.whiterose.ac.uk/131819/Version: Accepted VersionProceedings Paper:Alshutayri, A orcid.org/0000-0001-8550-0597 and Atwell, E orcid.org/0000-0001-9395-3764 (2018) Arabic dialects annotation using an online game. In: ICNLSP 2018: 2nd International Conference on Natural Language and Speech Processing. 2nd International Conference on Natural Language and Speech Processing (ICNLSP 2018), 25-26 Apr 2018, Algiers, Algeria. IEEE . ISBN 978-1-5386-4543-7 https://doi.org/10.1109/ICNLSP.2018.8374371© 2018 IEEE. This is an author produced version of a paper published in ICNLSP 2018: 2nd International Conference on Natural Language and Speech Processing. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising orpromotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Uploaded in accordance with the publisher's self-archiving policy.eprints@whiterose.ac.ukhttps://eprints.whiterose.ac.uk/Reuse Items deposited in White Rose Research Online are protected by copyright, with all rights reserved unless indicated otherwise. They may be downloaded and/or printed for private study, or other acts as permitted by national copyright laws. The publisher or other rights holders may allow further reproduction and re-use of the full text version. This is indicated by the licence information on the White Rose Research Online record for the item. Takedown If you consider content in White Rose Research Online to be in breach of UK law, please notify us by emailing eprints@whiterose.ac.uk including the URL of the record and the reason for the withdrawal request. Arabic Dialects Annotation using an Online Game1st Areej AlshutayriFaculty of Computing and Information TechnologyKing Abdul Aziz UniversityJeddah, Saudi Arabiaaalshetary@kau.edu.saSchool of ComputingUniversity of LeedsLeeds, United Kingdomml14aooa@leeds.ac.uk2nd Eric AtwellSchool of ComputingUniversity of LeedsLeeds, United KingdomE.S.Atwell@leeds.ac.ukAbstract—Modern Standard Arabic is the written standardacross the Arab world; but there is an increasing use of Arabicdialects in social media, so this is appropriate as a sourceof a corpus for research on classifying Arabic dialect textsusing machine learning algorithms. An important first step isannotation of the text corpus with correct dialect tags. Wecollected tweets from Twitter and comments from Facebookand online newspapers, aiming for representative samples of fivegroups of Arabic dialects: Gulf, Iraqi, Egyptian, Levantine, andNorth African. Then, we explored an approach to crowdsourcingcorpus annotation. The task of annotation was developed as anonline game, where players can test their dialect classificationskills and get a score of their knowledge. This approach has sofar achieved 24K annotated documents containing 587K tokens;16,179 tagged as a dialect and 7,821 as Modern Standard Arabic.Index Terms—Arabic, Dialects, Corpus, Annotation, Crowd-sourcingI. INTRODUCTIONModern Standard Arabic (MSA) is the formal written stan-dard across the Arab world; but there is an increasing useof Arabic dialect in a range of informal text sources. Theclassification of dialects becomes an important pre-processfor other tasks, such as machine translation, dialect-to-dialectlexicons, and information retrieval [1]. To improve the clas-sification of Arabic dialect, we developed a new approach toannotate Arabic dialect texts. We used two sources of socialmedia: tweets from Twitter [2], and comments from Facebook,in addition to readers’ comments from online Newspaper asa web source. The corpus contains dialectal Arabic textscollected from Arab’s countries to cover the main Arabicdialects which are: The Gulf Dialect (GLF), the Iraqi Dialect(IRQ), the Levantine Dialect (LEV), the Egyptian Dialect(EGY), and the North African (Maghrebi) Dialect (NOR).GLF is spoken in countries around the Arabian Gulf, andincludes dialects of Saudi Arabia, Kuwait, Qatar, United ArabEmirates, Bahrain, Oman and Yemen. IRQ is spoken in Iraq,and it is a sub-dialect of GLF. LEV is spoken in countriesaround the Mediterranean east coast, and covers the dialectsof Lebanon, Syria, Jordan and Palestine. EGY includes thedialects of Egypt and Sudan. Finally, NOR includes thedialects of Morocco, Algeria, Tunisia and Libya [3]–[5], asin fig. 1.Fig. 1. Arab World Map.Some tweets were collected based on location points andsome tweets based on seed terms which are distinguishedwords that are very common in one dialect and not used inany other dialects [2], the total number of tweets are 280K,beside 2M comments from Facebook. In addition to 10Kcomments by crawling the newspaper websites for a periodof two months. Table 1 shows the total number of words foreach source of text.TABLE ITHE TOTAL NUMBER OF WORDS FROM EACH SOURCE OF TEXTSource Number of WordsTwitter 6,827,733Facebook 7,056,812Newspaper 3,318,717In [6] the method of the annotation was used throughthe workers on Amazons Mechanical Turk. They showed 10sentences per screen. The worker was asked to label eachsentence with two labels: the amount of dialect in the sentence,and the type of the dialect. They collected 330K labelleddocuments in about 4.5 months. But, compared to our methodthey pay to the workers a reward of $0.10 per screen. Thetotal cost of annotation process was $2,773.20 in addition to$277.32 for Amazons commission.In this paper, the second section presents why annotationprocess is important. The third section describes the methodused to annotate the collected dataset to build a corpus ofArabic dialect texts. The fourth section shows how we evaluatethe annotated results. The fifth section presents the result andthe number of annotated documents. Finally, the last sectionpresents the conclusion and future work.II. IMPORTANCE OF THE ANNOTATION PROCESSWe participated in the VarDial2016 workshop at COLING2016 Discriminating Similar Languages (DSL) shared task[7]. The shared task offered two tasks: first task worked onidentification of very similar languages in newswire texts. Thesecond task focused on Arabic dialect identification in speechtranscripts [8]. The Arabic dialect text used for training andtesting were developed using the QCRI Automatic SpeechRecognition (ASR) QATS system [9] to label each documentwith a dialect [10]. Some evidently mislabeled documentswere found which affected the accuracy of classification; so,to avoid this problem a new text corpus and labelling methodwere created.In the first step of labelling the corpus, we initially assumedeach tweet could be labelled based on the location thatappears in the user’s profile and the location points whichcould be used to collect the tweets from Twitter. As forthe comments were collected from online newspapers, eachcomment labelled based on the country where the newspaperis published. Finally, for the comments collected fromFacebook posts, each comment labelled based on the countryof the Facebook page depended on the nationality of theowner of the Facebook page if it is a famous public groupor person. However, through the inspection of the corpus,we noticed some mislabeled documents, due to disagreementbetween the locations of the users and their dialects. So,must be verified that each document is labelled with thecorrect dialect. Fig. 2 and 3 give an example of the confusionbetween the user location and their dialect.The user location in fig. 2 is England while the tweets arewritten using Arabic language not English language. Similarly,for Facebook comments, the Facebook page’s country basedon the nationality of the page owner is Saudi Arabia, but thecomments were not written in GLF dialect, such as the shadedcomment in the fig. 3.III. METHODTo annotate each document with the correct dialect, 100Kdocuments were randomly selected from the corpus (tweetsand comments), then created an annotation tool and hostedthis tool in a website.In the developed annotation tool, the player annotates 15documents (tweets and comments) per screen. Each of theseFig. 2. Example of user location and his tweets.Fig. 3. An example of the Facebook page’s country and the user’s comment.documents is labelled with four labels, so the player must readthe document and make four judgments about this document.The first judgment is the level of dialectal content in thedocument. The second judgment is the type of dialect ifthe document not MSA. The third judgment is the reasonwhich makes the player to select this dialect. Finally, theforth judgment if the reason selected in the third judgment isdialectal terms; then in the fourth judgment the player needsto write the dialectal words found in the document.The following list shows the options under each judgment tolet the player choose one of them.• The level of dialectal content– MSA (for document written in MSA)– Little bit of dialect (for document written in MSAbut it contains some words of dialect)– Mix of MSA and dialect (for document written inMSA and dialect (code switch))– Dialect (for document written in dialect)• The type of dialect if the document written in dialect– Egyptian– Gulf– Iraqi– Levantine– North Africa– Not Sure• The reason that make this document dialectal– Sentence Structure– Dialectal Terms• The words which identify the dialect (we need to usethese word as a dictionary for each dialect)To annotate the collected data, an interface was built as aweb page to display a group of Arabic documents randomlyselected from the dataset. Fig. 4 shows the interface of theAnnotation Tool in the website http://www.alshutayri.com/index.jsp.Each page displays 15 documents randomly selected fromthe dataset. As shown in fig. 5, the first label indicatesthe amount of dialectal content in the document to decidewhether the document is MSA or contains dialectal content.If the document is MSA the other labels will be inactive,and the player needs to move to the next document. But, ifthe document is not MSA, then all labels are required. Thesecond label specifies the document dialect if it is one ofthe five dialects (EGY, GLF, LEV, IRQ, and NOR), or NotSure if the document written using dialect but difficult todecide which dialect. The third and fourth labels to explainthe causes to choose the selected dialect: for example, thesentence structure if the words in the document are all MSAwords, but the structure of the sentence is not based on theMSA grammar rules, and/or the dialectal terms which arefamous words help to identify the dialect. In fact, there is noagreed standard for writing Arabic dialects because MSA isthe formal standard form of written Arabic [11]; therefore,some documents apparently contain only MSA vocabularybut are annotated as dialect based on non-standard sentencestructure.Before submitting the annotated documents, the motherdialect must be chosen. This may help to decide whichannotated document must be accepted if one document has dif-ferent annotations. Fig. 5 shows an example of one annotateddocument. Finally, by submitting the annotated documents thescore will be shown in the screen by comparing the labelleddocuments with our pre-labelled sample as shown in fig. 6.As a control to be sure that the player reads the documentbefore selecting the options, three MSA documents collectedfrom a newspaper articles [12], were mixed with 12 documentsselected from the dataset; so these three MSA documentsused as a control because they must be labelled as MSA, so,if the player labels all the three MSA documents as a dialectthen the player’s submitted documents are not counted inthe annotated corpus. Furthermore, to verify the annotationprocess, each document is redundantly being annotated threetimes.Fig. 4. The annotation interface.Fig. 5. Example of the annotated document.Fig. 6. Example of the player’s score.IV. EVALUATIONTo ensure that each document got a correct label, everydocument was annotated by three players besides the goldstandard, which is an initial label that have been used to labeleach document based on the source of comments and tweetsas mentioned in section 2. In addition to the mother dialect foreach player which help to decide which label must be countedas a correct label if the players gave different labels for onedocument. The result of annotated documents was evaluatedin two cases:• Agreement between annotators: All the players label onedocument with same label as in fig. 7 and 8. The agreedlabel considered as a correct label even if the agreed labelis different from the original label because as mentionedin section 2 the initial label sometimes is not correct.• Disagreement between annotators: Some of the playerslabel the document with different label of the otherplayers as in fig. 9. In this case the mother dialect couldhelp to decide which label must be accepted as a correctlabel for this document.Fig. 7. Example 1 of the agreement between annotator.Fig. 8. Example 2 of the agreement between annotator.Fig. 9. Example of the disagreement between annotator.To evaluate the quality of the annotation, the inter-annotatoragreement was calculated using Fleiss Kappa [13] to calculatethe annotator agreement for more than two annotators. The re-sult equal to 0.787 around 79% which is substantial agreementaccording to [14].V. RESULTThe result of the annotation tool is a set of documentswhich are labelled with four labels: the first label is the dialectlevel, which is an option from three choices: little of dialect,Mix of MSA and dialect, or Dialect. The second label is thespecific dialect which is one of the five dialects: GLF, EGY,LEV, IRQ, or NOR. The third label shows the reasons thathelp to identify the document’s dialect. The last label showsthe dialectal words which help to identify the document’sdialect. Fig. 10 shows the result of one annotated documentin the corpus.We launched the website via Twitter and WhatsApp at thebeginning of August 2017. At the time of paper submission,we have been running the annotation website for aroundfour months, and we have accumulated 24,000 annotateddocuments with total numbers of words equal to 586,952. Thedistribution of dialects of the annotated corpus shown in fig.11, where GLF dialect consist of 5K documents, EGY dialect4K documents, NOR dialect 2K documents, LEV dialect 3K,and IRQ dialect 2K documents. The number of users (players)equal to 1,575 from different countries around the world,fig. 12 shows the distributions of users on the days. Forour immediate research on Arabic dialects classification theannotated documents which we have already collected couldbe sufficient, but we decided to continue with this experimentto collect a large annotated Arabic dialect text corpus and letthe corpus be available for other research by the end of 2018.Fig. 10. Result of the annotated document.Fig. 11. The distribution of labels (dialects) of the annotated corpus.Fig. 12. The distribution of the number of users during days.VI. CONCLUSION AND FUTURE WORKIn this paper, we presented a new approach to annotate thedataset were collected from Twitter, Facebook, and OnlineNewspaper for the five main Arabic dialects: Gulf, Iraqi, Egyp-tian, Levantine and North African. The annotation websitewas created as an online game to gather more users whotalk different Arabic dialects and free to pay in comparingwith other crowdsourcing websites. This experiment is anew approach help to annotate the sufficient dataset for textresearches in Arabic dialect classification. The number of usershas decreased now in comparison with the beginning becausewe need to redistribute the website widely. In future work,we could modify the interface to be more attractive and easyto explore. In addition, we could make this annotation gameas an application can be downloaded in the smart phones andtablets.REFERENCES[1] S. Malmasi, E. Refaee, M. Dras, “Arabic dialect identification usinga parallel multidialectal corpus,” Pacific Association for ComputationalLinguistics, 2015, pp. 203–211.[2] A. Alshutayri, E. Atwell, “Exploring Twitter as a source of an Ara-bic dialect corpus,” International Journal of Computational Linguistics(IJCL), vol. 8, 2017, pp. 37–44.[3] F. S. Alorifi, “Automatic identification of Arabic dialects using hiddenmarkov models,” Doctor of Philosophy thesis, University of Pittsburgh,2008.[4] F. Biadsy, J. Hirschberg, N. Habash, “Spoken Arabic dialect identifica-tion using phonotactic modeling,” In proceedings of the EACL Work-shop on Computational Approaches to Semitic Languages, Associationfor Computational Linguistics, Athens 2009, pp. 53–61.[5] N. Habash, Introduction to Arabic Natural Language Processing. Morganand Claypool. 2010.[6] F.O. Zaidan, C. Callison-Burch, “Arabic dialect identification,” Compu-tational Linguistics. vol. 40, 2014, pp. 171–202.[7] A. Alshutayri, E. Atwell, A. AlOsaimy, J. Dickins, M. Ingleby, J.Watson, “Arabic language WEKA-Based dialect classifier for Arabicautomatic speech recognition transcripts,” Proceedings of the ThirdWorkshop on NLP for Similar Languages, Varieties and Dialects, 2016,pp. 204–211.[8] S. Malmasi, M. Zampieri, N. Ljubešić, P. Nakov, A. Ali, J. Tiedemann,“Discriminating between similar languages and Arabic dialect identi-fication: a report on the third DSL shared task,” Proceedings of theThird Workshop on NLP for Similar Languages, Varieties and Dialects(VarDial3), Osaka, Japan 2016, pp. 1–14.[9] S. Khurana, A. Ali, “QCRI advanced transcription system (QATS) forthe Arabic multi-dialect broadcast media recognition: MGB-2 chal-lenge,” IEEE Spoken Language Technology Workshop (SLT), 2016, pp.292–298.[10] A. Ali, N. Dehak, P. Cardinal, S. Khurana, S. Yella, J. Glass, P. Bell,S. Renals, “Automatic dialect detection in Arabic broadcast speech,”Interspeech2016, 2016, pp. 2934–2938.[11] H. Elfardy, M. Diab, “Token level identification of linguistic codeswitching,” In Proceedings of COLING, 2016, pp. 287–296.[12] L. Al-Sulaiti, E. Atwell, “Designing and developing a corpus of con-temporary Arabic,” Proceedings of TALC 2004: the sixth Teaching andLanguage Corpora conference, Granada 2004, pp. 92-93.[13] J. L. Fleiss,“Measuring nominal scale agreement among many raters,”Psychological Bulletin, vol. (76)5, 1971, pp. 378–382[14] J. Richard Landis and Gary G. Koch, “The Measurement of ObserverAgreement for Categorical Data,” Biometrics, Wiley, International Bio-metric Society, vol. 33(1), 1977, pp. 159–174.

Arabic dialects annotation using an online game

Abstract

Similar works

Full text

Available Versions

White Rose Research Online

White Rose Research Online

Crossref