87 research outputs found
Recommended from our members
Sentiment Analysis for the Low-Resourced Latinised Arabic "Arabizi"
The expansion of digital communication mediums from private mobile messaging into the public through social media presented an opportunity for the data science research and industry to mine the generated big data for artificial information extraction. A popular information extraction task is sentiment analysis, which aims at extracting polarity opinions, positive, negative, or neutral, from the written natural language. This science helped organisations better understand the public’s opinion towards events, news, public figures, and products.
However, sentiment analysis has advanced for the English language ahead of Arabic. While sentiment analysis for Arabic is developing in the literature of Natural Language Processing (NLP), a popular variety of Arabic, Arabizi, has been overlooked for sentiment analysis advancements.
Arabizi is an informal transcription of the spoken dialectal Arabic in Latin script used for social texting. It is known to be common among the Arab youth, yet it is overlooked in efforts on Arabic sentiment analysis for its linguistic complexities.
As to Arabic, Arabizi is rich in inflectional morphology, but also codeswitched with English or French, and distinctively transcribed without adhering to a standard orthography. The rich morphology, inconsistent orthography, and codeswitching challenges are compounded together to have a multiplied effect on the lexical sparsity of the language, where each Arabizi word becomes eligible to be spelled in many ways, that, in addition to the mixing of other languages within the same textual context. The resulting high degree of lexical sparsity defies the very basics of sentiment analysis, classification of positive and negative words. Arabizi is even faced with a severe shortage of data resources that are required to set out any sentiment analysis approach.
In this thesis, we tackle this gap by conducting research on sentiment analysis for Arabizi. We addressed the sparsity challenge by harvesting Arabizi data from multi-lingual social media text using deep learning to build Arabizi resources for sentiment analysis. We developed six new morphologically and orthographically rich Arabizi sentiment lexicons and set the baseline for Arabizi sentiment analysis on social media
Code-choice on Twitter: How stance-taking and linguistic accommodation reflect the identity of polyglossic Egyptian users
This study examines the online identity of polyglossic Egyptian users of Twitter. It is descriptive and exploratory utilizing a qualitative design with some frequency count which adds descriptive data. Data were collected using a Discourse Completion Task (DCT) where the participants were presented with a number of tweets and were asked to type another tweet in response to each. The findings from the study suggest that polyglossic Egyptians, those who are proficient in English as well as Arabic, exhibited an assertive identity on Twitter. This identity was constructed through the choice of code, the linguistic accommodation to the tweet authors, and the stance they took. Polyglossic Egyptians were found to use English more than any other code, followed by Arabizi, and then Arabic. They linguistically accommodated the tweet authors in their replies to some extent by choosing the same code in replying as that used in the original tweet. Further, and using Du Bois’ (2007) stance triangle framework, it was also found that they expressed their (dis)alignment quite bluntly by taking an epistemic stance achieved through the use of boosters (very few hedges were used), sarcasm, simple present tense (to express an opinion as if stating a fact), and modals (to offer advice). By doing that, polyglossic Egyptians were found to be assertive in expressing their opinions, often showing themselves as informative, superior people who are guided by facts about topics rather than feelings
A review of sentiment analysis research in Arabic language
Sentiment analysis is a task of natural language processing which has
recently attracted increasing attention. However, sentiment analysis research
has mainly been carried out for the English language. Although Arabic is
ramping up as one of the most used languages on the Internet, only a few
studies have focused on Arabic sentiment analysis so far. In this paper, we
carry out an in-depth qualitative study of the most important research works in
this context by presenting limits and strengths of existing approaches. In
particular, we survey both approaches that leverage machine translation or
transfer learning to adapt English resources to Arabic and approaches that stem
directly from the Arabic language
Recommended from our members
Writing Arabizi: Orthographic Variation In Romanized Lebanese Arabicon Twitter
How does technology influence the script in which a language is written? Over the past few decades, a new form of writing has emerged across the Arab world. Known as Arabizi, it is a type of Romanized Arabic that uses Latin characters instead of Arabic script. It is mainly used by youth in technology-related contexts such as social media and texting, and has made many older Arabic speakers fear that more standard forms of Arabic may be in danger because of its use. Prior work on Arabizi suggests that although it is used frequently on social media, its orthography is not yet standardized (Palfreyman and Khalil, 2003; Abdel-Ghaffar et al., 2011). Therefore, this thesis aimed to examine orthographic variation in Romanized Lebanese Arabic, which has rarely beenstudied as a Romanized dialect. It was interested in how often Arabizi is used on Twitter in Lebanon and the extent of its orthographic variation. Using Twitter data collected from Beirut, tweets were analyzed to discover the most common orthographic variants in Arabizi for each Arabic letter, as well as the overall rate of Arabizi use. Results show that Arabizi was not used as frequently as hypothesized on Twitter, probably because of its low prestige and increased globalization. However, its consonants are relatively standardized, while its vowels show more variation. This thesis adds to the existing conversation about Romanized Arabic by presenting a detailed study of orthographic variation in Lebanese Arabic. The results could have useful implications for Arabic language ideology and technological endeavors, such as natural language processing or translation programs.
SenZi: A Sentiment Analysis Lexicon for the Latinised Arabic (Arabizi)
Arabizi is an informal written form of dialectal Arabic transcribed in Latin alphanumeric characters. It has a proven popularity on chat platforms and social media, yet it suffers from a severe lack of natural language processing (NLP) resources. As such, texts written in Arabizi are often disregarded in sentiment analysis tasks for Arabic. In this paper we describe the creation of a sentiment lexicon for Arabizi that was enriched with word embeddings. The result is a new Arabizi lexicon consisting of 11.3K positive and 13.3K negative words. We evaluated this lexicon by classifying the sentiment of Arabizi tweets achieving an F1-score of 0.72. We provide a detailed error analysis to present the challenges that impact the sentiment analysis of Arabizi
An Experimental Study on Sentiment Classification of Moroccan dialect texts in the web
With the rapid growth of the use of social media websites, obtaining the
users' feedback automatically became a crucial task to evaluate their
tendencies and behaviors online. Despite this great availability of
information, and the increasing number of Arabic users only few research has
managed to treat Arabic dialects. The purpose of this paper is to study the
opinion and emotion expressed in real Moroccan texts precisely in the YouTube
comments using some well-known and commonly used methods for sentiment
analysis. In this paper, we present our work of Moroccan dialect comments
classification using Machine Learning (ML) models and based on our collected
and manually annotated YouTube Moroccan dialect dataset. By employing many text
preprocessing and data representation techniques we aim to compare our
classification results utilizing the most commonly used supervised classifiers:
k-nearest neighbors (KNN), Support Vector Machine (SVM), Naive Bayes (NB), and
deep learning (DL) classifiers such as Convolutional Neural Network (CNN) and
Long Short-Term Memory (LTSM). Experiments were performed using both raw and
preprocessed data to show the importance of the preprocessing. In fact, the
experimental results prove that DL models have a better performance for
Moroccan Dialect than classical approaches and we achieved an accuracy of 90%.Comment: 13 pages, 5 tables, 2 figure
Atar: Attention-based LSTM for Arabizi transliteration
A non-standard romanization of Arabic script, known as Arbizi, is widely used in Arabic online and SMS/chat communities. However, since state-of-the-art tools and applications for Arabic NLP expects Arabic to be written in Arabic script, handling contents written in Arabizi requires a special attention either by building customized tools or by transliterating them into Arabic script. The latter approach is the more common one and this work presents two significant contributions in this direction. The first one is to collect and publicly release the first large-scale “Arabizi to Arabic script” parallel corpus focusing on the Jordanian dialect and consisting of more than 25 k pairs carefully created and inspected by native speakers to ensure highest quality. Second, we present Atar, an attention-based encoder-decoder model for Arabizi transliteration. Training and testing this model on our dataset yields impressive accuracy (79%) and BLEU score (88.49)
SentiALG: Automated Corpus Annotation for Algerian Sentiment Analysis
Data annotation is an important but time-consuming and costly procedure. To
sort a text into two classes, the very first thing we need is a good annotation
guideline, establishing what is required to qualify for each class. In the
literature, the difficulties associated with an appropriate data annotation has
been underestimated. In this paper, we present a novel approach to
automatically construct an annotated sentiment corpus for Algerian dialect (a
Maghrebi Arabic dialect). The construction of this corpus is based on an
Algerian sentiment lexicon that is also constructed automatically. The
presented work deals with the two widely used scripts on Arabic social media:
Arabic and Arabizi. The proposed approach automatically constructs a sentiment
corpus containing 8000 messages (where 4000 are dedicated to Arabic and 4000 to
Arabizi). The achieved F1-score is up to 72% and 78% for an Arabic and Arabizi
test sets, respectively. Ongoing work is aimed at integrating transliteration
process for Arabizi messages to further improve the obtained results.Comment: To appear in the 9th International Conference on Brain Inspired
Cognitive Systems (BICS 2018
- …