Semantic Classification of Multidialectal Arabic Social Media

Abstract

Arabic is one of the most widely used languages in the world, but due in part to its morphological and syntactic richness, resources for automated processing of Arabic are relatively rare. Arabic takes three primary forms: Classical Arabic as seen in the Qur’an and other classical texts; Modern Standard Arabic (MSA) as seen in newspapers, formal documents, and other written text intended for widespread distribution; and dialectal Arabic as used in common speech and informal communication. Social media posts are often written in informal language and may include non-standard spellings, abbreviations, emoticons, hashtags, and emojis. Dialectal Arabic is commonly used in social media. Semantic classification is the task of assigning a label to a text based on its primary semantic content. Given the increased use of dialectal Arabic on social media platforms in recent years, there is an urgent need for semantic classification of dialectal Arabic. Even compared to MSA there are few resources for automated processing of dialectal Arabic. The prior work dealing with automated processing of dialectal Arabic are limited to only one or two dialects. One of the major obstacles to doing semantic classification of multi-dialectal Arabic is the lack of a large, multi-dialectal, tagged corpus. To the best of our knowledge there are no automated processes for semantic classification of multi-dialectal Arabic social media texts. We gather a data set of more than one million tweets collected from 449 accounts located in 12 Arabic-speaking countries. We group those tweets into 21,791 documents by country, account, and month. We first construct a query to represent a particular semantic concept. Then, using Latent Semantic Analysis (LSA) we rank the documents by semantic similarity to the query. Next, we use that ranking to train a deep neural network classifier to identify documents whose text is semantically similar to the query. Experiments demonstrate an overall accuracy of 98.075% and a positive accuracy of 88.178% have been achieved by this approach to semantic classification of multi-dialectal Arabic. The source code and the data set are provided on GitHub at https://github.com/therishel/ArabLeader

    Similar works