Search CORE

22 research outputs found

Towards Understanding Egyptian Arabic Dialogues

Author: Abdou Sherif M
Elmadany Abdelrahim A
Gheith Mervat
Publication venue: 'Foundation of Computer Science'
Publication date: 13/07/2015
Field of study

Labelling of user's utterances to understanding his attends which called Dialogue Act (DA) classification, it is considered the key player for dialogue language understanding layer in automatic dialogue systems. In this paper, we proposed a novel approach to user's utterances labeling for Egyptian spontaneous dialogues and Instant Messages using Machine Learning (ML) approach without relying on any special lexicons, cues, or rules. Due to the lack of Egyptian dialect dialogue corpus, the system evaluated by multi-genre corpus includes 4725 utterances for three domains, which are collected and annotated manually from Egyptian call-centers. The system achieves F1 scores of 70. 36% overall domains.Comment: arXiv admin note: substantial text overlap with arXiv:1505.0308

arXiv.org e-Print Archive

CiteSeerX

Holy Tweets: Exploring the Sharing of the Quran on Twitter

Author: Abokhodair Norah
Elmadany Abdelrahim
Magdy Walid
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 19/08/2020
Field of study

While social media offer users a platform for self-expression, identity exploration, and community management, among other functions, they also offer space for religious practice and expression. In this paper, we explore social media spaces as they subtend new forms of religious experiences and rituals. We present a mixed-method study to understand the practice of sharing Quran verses on Arabic Twitter in their cultural context by combining a quantitative analysis of the most shared Quran verses, the topics covered by these verses, and the modalities of sharing, with a qualitative study of users' goals. This analysis of a set of 2.6 million tweets containing Quran verses demonstrates that online religious expression in the form of sharing Quran verses both extends offline religious life and supports new forms of religious expression including goals such as doing good deeds, giving charity, holding memorials, and showing solidarity. By analysing the responses on a survey, we found that our Arab Muslim respondents conceptualize social media platforms as everlasting, at least beyond their lifetimes, where they consider them to be effective for certain religious practices, such as reciting Quran, supplication (dua), and ceaseless charity. Our quantitative analysis of the most shared verses of the Quran underlines this commitment to religious expression as an act of worship, highlighting topics such as the hereafter, God's mercy, and sharia law. We note that verses on topics such as jihad are shared much less often, contradicting some media representation of Muslim social media use and practice.Comment: Paper accepted to The 23rd ACM Conference on Computer-Supported Cooperative Work and Social Computing (CSCW) 202

arXiv.org e-Print Archive

Edinburgh Research Explorer

On the Robustness of Arabic Speech Dialect Identification

Author: Abdul-Mageed Muhammad
Elmadany AbdelRahim
Sullivan Peter
Publication venue
Publication date: 01/06/2023
Field of study

Arabic dialect identification (ADI) tools are an important part of the large-scale data collection pipelines necessary for training speech recognition models. As these pipelines require application of ADI tools to potentially out-of-domain data, we aim to investigate how vulnerable the tools may be to this domain shift. With self-supervised learning (SSL) models as a starting point, we evaluate transfer learning and direct classification from SSL features. We undertake our evaluation under rich conditions, with a goal to develop ADI systems from pretrained models and ultimately evaluate performance on newly collected data. In order to understand what factors contribute to model decisions, we carry out a careful human study of a subset of our data. Our analysis confirms that domain shift is a major challenge for ADI models. We also find that while self-training does alleviate this challenges, it may be insufficient for realistic conditions

arXiv.org e-Print Archive

An Arabic Speech-Act and Sentiment Corpus of Tweets

Author: Elmadany AbdelRahim A.
Magdy Walid
Mubarak Hamdy
Publication venue
Publication date: 12/05/2018
Field of study

Edinburgh Research Explorer

Octopus: A Multitask Model and Toolkit for Arabic Natural Language Generation

Author: Abdul-Mageed Muhammad
Elmadany AbdelRahim
Nagoudi El Moatez Billah
Publication venue
Publication date: 24/10/2023
Field of study

Understanding Arabic text and generating human-like responses is a challenging endeavor. While many researchers have proposed models and solutions for individual problems, there is an acute shortage of a comprehensive Arabic natural language generation toolkit that is capable of handling a wide range of tasks. In this work, we present a novel Arabic text-to-text Transformer model, namely AraT5v2. Our new model is methodically trained on extensive and diverse data, utilizing an extended sequence length of 2,048 tokens. We explore various pretraining strategies including unsupervised, supervised, and joint pertaining, under both single and multitask settings. Our models outperform competitive baselines with large margins. We take our work one step further by developing and publicly releasing Octopus, a Python-based package and command-line toolkit tailored for eight Arabic generation tasks all exploiting a single model. We release the models and the toolkit on our public repository

arXiv.org e-Print Archive

SERENGETI: Massively Multilingual Language Models for Africa

Author: Abdul-Mageed Muhammad
Adebara Ife
Elmadany AbdelRahim
Inciarte Alcides Alcoba
Publication venue
Publication date: 26/05/2023
Field of study

Multilingual pretrained language models (mPLMs) acquire valuable, generalizable linguistic information during pretraining and have advanced the state of the art on task-specific finetuning. To date, only ~31 out of ~2,000 African languages are covered in existing language models. We ameliorate this limitation by developing SERENGETI, a massively multilingual language model that covers 517 African languages and language varieties. We evaluate our novel models on eight natural language understanding tasks across 20 datasets, comparing to 4 mPLMs that cover 4-23 African languages. SERENGETI outperforms other models on 11 datasets across the eights tasks, achieving 82.27 average F_1. We also perform analyses of errors from our models, which allows us to investigate the influence of language genealogy and linguistic similarity when the models are applied under zero-shot settings. We will publicly release our models for research.\footnote{\href{https://github.com/UBC-NLP/serengeti}{https://github.com/UBC-NLP/serengeti}}Comment: To appear in Findings of ACL 202

arXiv.org e-Print Archive