40 research outputs found
The Effect of Arabism of Romanic Alphabets on the Development of 9th Grade English as a Foreign Language Students' Writing Skills at Secondary School Level
This paper aims at investigating the effect of Arabization of Romanic Alphabets on the development ofĀ 9th GradeĀ English as a Foreign Language students' composition writing skills at secondary school level. This experimental study includes 25 secondary school students in their 9th Grade in whichĀ English is taught as a foreign language at Al-Husainieh Secondary School for boys. The finding of this study indicates that students usually tend to write and compose English language sentences through Romanizing Arabic letters. This may be related to differentĀ reasons such as their weakness in writing and lack of awareness about specific aspect of sentence structures and lack of vocabulary deposit, but even good students tend to use the romanic alphabets in writing. This study recommends that students should be familiar with the meaning of English words. Key Words: Arabization, Romanic, Alphabets, Writing
Atar: Attention-based LSTM for Arabizi transliteration
A non-standard romanization of Arabic script, known as Arbizi, is widely used in Arabic online and SMS/chat communities. However, since state-of-the-art tools and applications for Arabic NLP expects Arabic to be written in Arabic script, handling contents written in Arabizi requires a special attention either by building customized tools or by transliterating them into Arabic script. The latter approach is the more common one and this work presents two significant contributions in this direction. The first one is to collect and publicly release the first large-scale āArabizi to Arabic scriptā parallel corpus focusing on the Jordanian dialect and consisting of more than 25 k pairs carefully created and inspected by native speakers to ensure highest quality. Second, we present Atar, an attention-based encoder-decoder model for Arabizi transliteration. Training and testing this model on our dataset yields impressive accuracy (79%) and BLEU score (88.49)
Multi-Task sequence prediction for Tunisian Arabizi multi-level annotation
In this paper we propose a multi-task sequence prediction system, based on recurrent neural networks and used to annotate on multiple levels an Arabizi Tunisian corpus. The annotation performed are text classification, tokenization, PoS tagging and encoding of Tunisian Arabizi into CODA* Arabic orthography. The system is learned to predict all the annotation levels in cascade, starting from Arabizi input. We evaluate the system on the TIGER German corpus, suitably converting data to have a multi-task problem, in order to show the effectiveness of our neural architecture. We show also how we used the system in order to annotate a Tunisian Arabizi corpus, which has been afterwards manually corrected and used to further evaluate sequence models on Tunisian data. Our system is developed for the Fairseq framework, which allows for a fast and easy use for any other sequence prediction problem
La frĆ©quence de lāalternance codique dans les groupes WhatsApp des Ć©tudiants libanais
The means of computer-mediated communication (CMC) and specifically the WhatsApp application, have led to innovative language practices in written communication. Among these practices is the high frequency of Code-Switching (CS), which is defined in this study as a switch from one written code to another within the same message. This quantitative study aims to automatically identify occurrences of Code-Switching in WhatsApp group chats. All through 14 months, we collected 168 219 messages from 30 WhatsApp groups. The study sample encompasses 1 482 bilingual students from 7 Lebanese universities. A computer tool "DACA" (automatic detection of Code-Switching and arabizi) has been developed to detect the frequency of this phenomenon resulting from languages contact. The results show that in the corpus, there are 15 342 occurrences of CS or 9,1% of the total number of messages. 70,5% of these CS occurrences are detected in messages in Arabizi, 17,9% in messages in English, 10,6% in messages in Arabic and 1% in messages in French. The results also reveal that CS in messages composed in Arabizi are quite often towards English (91,3% of the total number of these CS occurrences) and towards Arabizi in messages composed in English with the same percentage.Les moyens de communication meĢdieĢe par ordinateur (CMO) et speĢcifiquement lāapplication WhatsApp, ont meneĢ aĢ des pratiques langagieĢres innovantes au niveau de la communication eĢcrite. Parmi ces pratiques, le recours aĢ lāalternance codique (AC), qui est deĢfinie dans cette eĢtude, comme un passage dāun code eĢcrit aĢ un autre au sein du meĢme message. Cette eĢtude quantitative visait aĢ identifier automatiquement les occurrences de lāalternance codique dans les discussions de groupes WhatsApp durant 14 mois. Nous avons collecteĢ 168 219 messages dans 30 groupes WhatsApp. LāeĢchantillon de lāeĢtude comprend 1 482 eĢtudiants bilingues issus de 7 eĢtablissements universitaires libanais. Un outil informatique āDACAā (deĢtection automatique de lāalternance codique et lāarabizi) a eĢteĢ deĢveloppeĢ pour deĢtecter la freĢquence de ce pheĢnomeĢne reĢsultant du contact des langues. Les reĢsultats montrent que dans le corpus il y a 15 342 occurrences de lāAC soit 9,1% du total des messages. 70,5% de ces ACs sont deĢtecteĢs dans les messages en arabizi et 17,9% dans les messages en anglais, 10,6% dans les messages en arabe et 1% dans les messages en francĢ§ais. Les reĢsultats ont montreĢ aussi que les ACs dans les messages composeĢs en arabizi sont assez souvent vers lāanglais (91,3% du total de ces ACs) et vers lāarabizi dans les messages composeĢs en anglais avec le meĢme pourcentage
SenZi: A Sentiment Analysis Lexicon for the Latinised Arabic (Arabizi)
Arabizi is an informal written form of dialectal Arabic transcribed in Latin alphanumeric characters. It has a proven popularity on chat platforms and social media, yet it suffers from a severe lack of natural language processing (NLP) resources. As such, texts written in Arabizi are often disregarded in sentiment analysis tasks for Arabic. In this paper we describe the creation of a sentiment lexicon for Arabizi that was enriched with word embeddings. The result is a new Arabizi lexicon consisting of 11.3K positive and 13.3K negative words. We evaluated this lexicon by classifying the sentiment of Arabizi tweets achieving an F1-score of 0.72. We provide a detailed error analysis to present the challenges that impact the sentiment analysis of Arabizi
A review of sentiment analysis research in Arabic language
Sentiment analysis is a task of natural language processing which has
recently attracted increasing attention. However, sentiment analysis research
has mainly been carried out for the English language. Although Arabic is
ramping up as one of the most used languages on the Internet, only a few
studies have focused on Arabic sentiment analysis so far. In this paper, we
carry out an in-depth qualitative study of the most important research works in
this context by presenting limits and strengths of existing approaches. In
particular, we survey both approaches that leverage machine translation or
transfer learning to adapt English resources to Arabic and approaches that stem
directly from the Arabic language
Recommended from our members
Writing Arabizi: Orthographic Variation In Romanized Lebanese Arabicon Twitter
How does technology influence the script in which a language is written? Over the past few decades, a new form of writing has emerged across the Arab world. Known as Arabizi, it is a type of Romanized Arabic that uses Latin characters instead of Arabic script. It is mainly used by youth in technology-related contexts such as social media and texting, and has made many older Arabic speakers fear that more standard forms of Arabic may be in danger because of its use. Prior work on Arabizi suggests that although it is used frequently on social media, its orthography is not yet standardized (Palfreyman and Khalil, 2003; Abdel-Ghaffar et al., 2011). Therefore, this thesis aimed to examine orthographic variation in Romanized Lebanese Arabic, which has rarely beenstudied as a Romanized dialect. It was interested in how often Arabizi is used on Twitter in Lebanon and the extent of its orthographic variation. Using Twitter data collected from Beirut, tweets were analyzed to discover the most common orthographic variants in Arabizi for each Arabic letter, as well as the overall rate of Arabizi use. Results show that Arabizi was not used as frequently as hypothesized on Twitter, probably because of its low prestige and increased globalization. However, its consonants are relatively standardized, while its vowels show more variation. This thesis adds to the existing conversation about Romanized Arabic by presenting a detailed study of orthographic variation in Lebanese Arabic. The results could have useful implications for Arabic language ideology and technological endeavors, such as natural language processing or translation programs.