1,000 research outputs found
Arabic dialects annotation using an online game
Modern Standard Arabic is the written standard across the Arab world; but there is an increasing use of Arabic dialects in social media, so this is appropriate as a source of a corpus for research on classifying Arabic dialect texts using machine learning algorithms. An important first step is annotation of the text corpus with correct dialect tags. We collected tweets from Twitter and comments from Facebook and online newspapers, aiming for representative samples of five groups of Arabic dialects: Gulf, Iraqi, Egyptian, Levantine, and North African. Then, we explored an approach to crowdsourcing corpus annotation. The task of annotation was developed as an online game, where players can test their dialect classification skills and get a score of their knowledge. This approach has so far achieved 24K annotated documents containing 587K tokens; 16,179 tagged as a dialect and 7,821 as Modern Standard Arabic
Arabic Dialect Texts Classification
This study investigates how to classify Arabic dialects in text by extracting features which show the differences between dialects. There has been a lack of research about classification of Arabic dialect texts, in comparison to English and some other languages, due to the lack of Arabic dialect text corpora in comparison with what is available for dialects of English and some other languages. What is more, there is an increasing use of Arabic dialects in social media, so this text is now considered quite appropriate as a medium of communication and as a source of a corpus. We collected tweets from Twitter, comments from Facebook and online newspapers from five groups of Arabic dialects: Gulf, Iraqi, Egyptian, Levantine, and North African. The research sought to: 1) create a dataset of Arabic dialect texts to use in training and testing the system of classification, 2) find appropriate features to classify Arabic dialects: lexical (word and multi-word-unit) and grammatical variation across dialects, 3) build a more sophisticated filter to extract features from Arabic-character written dialect text files.
In this thesis, the first part describes the research motivation to show the reason for choosing the Arabic dialects as a research topic. The second part presents some background information about the Arabic language and its dialects, and the literature review shows previous research about this subject. The research methodology part shows the initial experiment to classify Arabic dialects. The results of this experiment showed the need to create an Arabic dialect text corpus, by exploring Twitter and online newspaper. The corpus used to train the ensemble classifier and to improve the accuracy of classification the corpus was extended by collecting tweets from Twitter based on the spatial coordinate points and comments from Facebook posts. The corpus was annotated with dialect labels and used in automatic dialect classification experiments. The last part of this thesis presents the results of classification, conclusions and future work
Computational Sociolinguistics: A Survey
Language is a social phenomenon and variation is inherent to its social
nature. Recently, there has been a surge of interest within the computational
linguistics (CL) community in the social dimension of language. In this article
we present a survey of the emerging field of "Computational Sociolinguistics"
that reflects this increased interest. We aim to provide a comprehensive
overview of CL research on sociolinguistic themes, featuring topics such as the
relation between language and social identity, language use in social
interaction and multilingual communication. Moreover, we demonstrate the
potential for synergy between the research communities involved, by showing how
the large-scale data-driven methods that are widely used in CL can complement
existing sociolinguistic studies, and how sociolinguistics can inform and
challenge the methods and assumptions employed in CL studies. We hope to convey
the possible benefits of a closer collaboration between the two communities and
conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication:
18th February, 201
A review of sentiment analysis research in Arabic language
Sentiment analysis is a task of natural language processing which has
recently attracted increasing attention. However, sentiment analysis research
has mainly been carried out for the English language. Although Arabic is
ramping up as one of the most used languages on the Internet, only a few
studies have focused on Arabic sentiment analysis so far. In this paper, we
carry out an in-depth qualitative study of the most important research works in
this context by presenting limits and strengths of existing approaches. In
particular, we survey both approaches that leverage machine translation or
transfer learning to adapt English resources to Arabic and approaches that stem
directly from the Arabic language
ArAutoSenti: Automatic annotation and new tendencies for sentiment classification of Arabic messages
The file attached to this record is the author's final peer reviewed version.A corpus-based sentiment analysis approach for messages written in Arabic and its dialects is presented and implemented. The originality of this approach resides in the automation construction of the annotated sentiment corpus, which relies mainly on a sentiment lexicon that is also constructed automatically. For the classification step, shallow and deep classifiers are used with features being extracted applying word embedding models. For the validation of the constructed corpus, we proceed with a manual reviewing and it was found that 85.17% were correctly annotated. This approach is applied on the under-resourced Algerian dialect and the approach is tested on two external test corpora presented in the literature. The obtained results are very
encouraging with an F1-score that is up to 88% (on the first test corpus) and up to 81% (on the second test corpus). These results respectively represent a 20% and a 6% improvement, respectively, when compared with existing work in the research literature
Recommended from our members
Sentiment Analysis for the Low-Resourced Latinised Arabic "Arabizi"
The expansion of digital communication mediums from private mobile messaging into the public through social media presented an opportunity for the data science research and industry to mine the generated big data for artificial information extraction. A popular information extraction task is sentiment analysis, which aims at extracting polarity opinions, positive, negative, or neutral, from the written natural language. This science helped organisations better understand the public’s opinion towards events, news, public figures, and products.
However, sentiment analysis has advanced for the English language ahead of Arabic. While sentiment analysis for Arabic is developing in the literature of Natural Language Processing (NLP), a popular variety of Arabic, Arabizi, has been overlooked for sentiment analysis advancements.
Arabizi is an informal transcription of the spoken dialectal Arabic in Latin script used for social texting. It is known to be common among the Arab youth, yet it is overlooked in efforts on Arabic sentiment analysis for its linguistic complexities.
As to Arabic, Arabizi is rich in inflectional morphology, but also codeswitched with English or French, and distinctively transcribed without adhering to a standard orthography. The rich morphology, inconsistent orthography, and codeswitching challenges are compounded together to have a multiplied effect on the lexical sparsity of the language, where each Arabizi word becomes eligible to be spelled in many ways, that, in addition to the mixing of other languages within the same textual context. The resulting high degree of lexical sparsity defies the very basics of sentiment analysis, classification of positive and negative words. Arabizi is even faced with a severe shortage of data resources that are required to set out any sentiment analysis approach.
In this thesis, we tackle this gap by conducting research on sentiment analysis for Arabizi. We addressed the sparsity challenge by harvesting Arabizi data from multi-lingual social media text using deep learning to build Arabizi resources for sentiment analysis. We developed six new morphologically and orthographically rich Arabizi sentiment lexicons and set the baseline for Arabizi sentiment analysis on social media
Benchmarking Arabic AI with Large Language Models
With large Foundation Models (FMs), language technologies (AI in general) are
entering a new paradigm: eliminating the need for developing large-scale
task-specific datasets and supporting a variety of tasks through set-ups
ranging from zero-shot to few-shot learning. However, understanding FMs
capabilities requires a systematic benchmarking effort by comparing FMs
performance with the state-of-the-art (SOTA) task-specific models. With that
goal, past work focused on the English language and included a few efforts with
multiple languages. Our study contributes to ongoing research by evaluating FMs
performance for standard Arabic NLP and Speech processing, including a range of
tasks from sequence tagging to content classification across diverse domains.
We start with zero-shot learning using GPT-3.5-turbo, Whisper, and USM,
addressing 33 unique tasks using 59 publicly available datasets resulting in 96
test setups. For a few tasks, FMs performs on par or exceeds the performance of
the SOTA models but for the majority it under-performs. Given the importance of
prompt for the FMs performance, we discuss our prompt strategies in detail and
elaborate on our findings. Our future work on Arabic AI will explore few-shot
prompting, expand the range of tasks, and investigate additional open-source
models.Comment: Foundation Models, Large Language Models, Arabic NLP, Arabic Speech,
Arabic AI, , CHatGPT Evaluation, USM Evaluation, Whisper Evaluatio
Recommended from our members
Perspective Identification in Informal Text
This dissertation studies the problem of identifying the ideological perspective of people as expressed in their written text. One's perspective is often expressed in his/her stance towards polarizing topics. We are interested in studying how nuanced linguistic cues can be used to identify the perspective of a person in informal genres. Moreover, we are interested in exploring the problem from a multilingual perspective comparing and contrasting linguistics devices used in both English informal genres datasets discussing American ideological issues and Arabic discussion fora posts related to Egyptian politics. %In doing so, we solve several challenges.
Our first and utmost goal is building computational systems that can successfully identify the perspective from which a given informal text is written while studying what linguistic cues work best for each language and drawing insights into the similarities and differences between the notion of perspective in both studied languages. We build computational systems that can successfully identify the stance of a person in English informal text that deal with different topics that are determined by one's perspective, such as legalization of abortion, feminist movement, gay and gun rights; additionally, we are able to identify a more general notion of perspective–namely the 2012 choice of presidential candidate–as well as build systems for automatically identifying different elements of a person's perspective given an Egyptian discussion forum comment. The systems utilize several lexical and semantic features for both languages. Specifically, for English we explore the use of word sense disambiguation, opinion features, latent and frame semantics as well; as Linguistic Inquiry and Word Count features; in Arabic, however, in addition to using sentiment and latent semantics, we study whether linguistic code-switching (LCS) between the standard and dialectal forms for the language can help as a cue for uncovering the perspective from which a comment was written.
This leads us to the challenge of devising computational systems that can handle LCS in Arabic. The Arabic language has a diglossic nature where the standard form of the language (MSA) coexists with the regional dialects (DA) corresponding to the native mother tongue of Arabic speakers in different parts of the Arab world. DA is ubiquitously prevalent in written informal genres and in most cases it is code-switched with MSA. The presence of code-switching degrades the performance of almost any MSA-only trained Natural Language Processing tool when applied to DA or to code-switched MSA-DA content. In order to solve this challenge, we build a state-of-the-art system–AIDA–to computationally handle token and sentence-level code-switching.
On a conceptual level, for handling and processing Egyptian ideological perspectives, we note the lack of a taxonomy for the most common perspectives among Egyptians and the lack of corresponding annotated corpora. In solving this challenge, we develop a taxonomy for the most common community perspectives among Egyptians and use an iterative feedback-loop process to devise guidelines on how to successfully annotate a given online discussion forum post with different elements of a person's perspective. Using the proposed taxonomy and annotation guidelines, we annotate a large set of Egyptian discussion fora posts to identify a comment's perspective as conveyed in the priority expressed by the comment, as well as the stance on major political entities
- …