22 research outputs found

    Towards Understanding Egyptian Arabic Dialogues

    Full text link
    Labelling of user's utterances to understanding his attends which called Dialogue Act (DA) classification, it is considered the key player for dialogue language understanding layer in automatic dialogue systems. In this paper, we proposed a novel approach to user's utterances labeling for Egyptian spontaneous dialogues and Instant Messages using Machine Learning (ML) approach without relying on any special lexicons, cues, or rules. Due to the lack of Egyptian dialect dialogue corpus, the system evaluated by multi-genre corpus includes 4725 utterances for three domains, which are collected and annotated manually from Egyptian call-centers. The system achieves F1 scores of 70. 36% overall domains.Comment: arXiv admin note: substantial text overlap with arXiv:1505.0308

    Holy Tweets: Exploring the Sharing of the Quran on Twitter

    Get PDF
    While social media offer users a platform for self-expression, identity exploration, and community management, among other functions, they also offer space for religious practice and expression. In this paper, we explore social media spaces as they subtend new forms of religious experiences and rituals. We present a mixed-method study to understand the practice of sharing Quran verses on Arabic Twitter in their cultural context by combining a quantitative analysis of the most shared Quran verses, the topics covered by these verses, and the modalities of sharing, with a qualitative study of users' goals. This analysis of a set of 2.6 million tweets containing Quran verses demonstrates that online religious expression in the form of sharing Quran verses both extends offline religious life and supports new forms of religious expression including goals such as doing good deeds, giving charity, holding memorials, and showing solidarity. By analysing the responses on a survey, we found that our Arab Muslim respondents conceptualize social media platforms as everlasting, at least beyond their lifetimes, where they consider them to be effective for certain religious practices, such as reciting Quran, supplication (dua), and ceaseless charity. Our quantitative analysis of the most shared verses of the Quran underlines this commitment to religious expression as an act of worship, highlighting topics such as the hereafter, God's mercy, and sharia law. We note that verses on topics such as jihad are shared much less often, contradicting some media representation of Muslim social media use and practice.Comment: Paper accepted to The 23rd ACM Conference on Computer-Supported Cooperative Work and Social Computing (CSCW) 202

    On the Robustness of Arabic Speech Dialect Identification

    Full text link
    Arabic dialect identification (ADI) tools are an important part of the large-scale data collection pipelines necessary for training speech recognition models. As these pipelines require application of ADI tools to potentially out-of-domain data, we aim to investigate how vulnerable the tools may be to this domain shift. With self-supervised learning (SSL) models as a starting point, we evaluate transfer learning and direct classification from SSL features. We undertake our evaluation under rich conditions, with a goal to develop ADI systems from pretrained models and ultimately evaluate performance on newly collected data. In order to understand what factors contribute to model decisions, we carry out a careful human study of a subset of our data. Our analysis confirms that domain shift is a major challenge for ADI models. We also find that while self-training does alleviate this challenges, it may be insufficient for realistic conditions

    Octopus: A Multitask Model and Toolkit for Arabic Natural Language Generation

    Full text link
    Understanding Arabic text and generating human-like responses is a challenging endeavor. While many researchers have proposed models and solutions for individual problems, there is an acute shortage of a comprehensive Arabic natural language generation toolkit that is capable of handling a wide range of tasks. In this work, we present a novel Arabic text-to-text Transformer model, namely AraT5v2. Our new model is methodically trained on extensive and diverse data, utilizing an extended sequence length of 2,048 tokens. We explore various pretraining strategies including unsupervised, supervised, and joint pertaining, under both single and multitask settings. Our models outperform competitive baselines with large margins. We take our work one step further by developing and publicly releasing Octopus, a Python-based package and command-line toolkit tailored for eight Arabic generation tasks all exploiting a single model. We release the models and the toolkit on our public repository

    SERENGETI: Massively Multilingual Language Models for Africa

    Full text link
    Multilingual pretrained language models (mPLMs) acquire valuable, generalizable linguistic information during pretraining and have advanced the state of the art on task-specific finetuning. To date, only ~31 out of ~2,000 African languages are covered in existing language models. We ameliorate this limitation by developing SERENGETI, a massively multilingual language model that covers 517 African languages and language varieties. We evaluate our novel models on eight natural language understanding tasks across 20 datasets, comparing to 4 mPLMs that cover 4-23 African languages. SERENGETI outperforms other models on 11 datasets across the eights tasks, achieving 82.27 average F_1. We also perform analyses of errors from our models, which allows us to investigate the influence of language genealogy and linguistic similarity when the models are applied under zero-shot settings. We will publicly release our models for research.\footnote{\href{https://github.com/UBC-NLP/serengeti}{https://github.com/UBC-NLP/serengeti}}Comment: To appear in Findings of ACL 202
    corecore