22 research outputs found
Towards Understanding Egyptian Arabic Dialogues
Labelling of user's utterances to understanding his attends which called
Dialogue Act (DA) classification, it is considered the key player for dialogue
language understanding layer in automatic dialogue systems. In this paper, we
proposed a novel approach to user's utterances labeling for Egyptian
spontaneous dialogues and Instant Messages using Machine Learning (ML) approach
without relying on any special lexicons, cues, or rules. Due to the lack of
Egyptian dialect dialogue corpus, the system evaluated by multi-genre corpus
includes 4725 utterances for three domains, which are collected and annotated
manually from Egyptian call-centers. The system achieves F1 scores of 70. 36%
overall domains.Comment: arXiv admin note: substantial text overlap with arXiv:1505.0308
Holy Tweets: Exploring the Sharing of the Quran on Twitter
While social media offer users a platform for self-expression, identity
exploration, and community management, among other functions, they also offer
space for religious practice and expression. In this paper, we explore social
media spaces as they subtend new forms of religious experiences and rituals. We
present a mixed-method study to understand the practice of sharing Quran verses
on Arabic Twitter in their cultural context by combining a quantitative
analysis of the most shared Quran verses, the topics covered by these verses,
and the modalities of sharing, with a qualitative study of users' goals. This
analysis of a set of 2.6 million tweets containing Quran verses demonstrates
that online religious expression in the form of sharing Quran verses both
extends offline religious life and supports new forms of religious expression
including goals such as doing good deeds, giving charity, holding memorials,
and showing solidarity. By analysing the responses on a survey, we found that
our Arab Muslim respondents conceptualize social media platforms as
everlasting, at least beyond their lifetimes, where they consider them to be
effective for certain religious practices, such as reciting Quran, supplication
(dua), and ceaseless charity. Our quantitative analysis of the most shared
verses of the Quran underlines this commitment to religious expression as an
act of worship, highlighting topics such as the hereafter, God's mercy, and
sharia law. We note that verses on topics such as jihad are shared much less
often, contradicting some media representation of Muslim social media use and
practice.Comment: Paper accepted to The 23rd ACM Conference on Computer-Supported
Cooperative Work and Social Computing (CSCW) 202
On the Robustness of Arabic Speech Dialect Identification
Arabic dialect identification (ADI) tools are an important part of the
large-scale data collection pipelines necessary for training speech recognition
models. As these pipelines require application of ADI tools to potentially
out-of-domain data, we aim to investigate how vulnerable the tools may be to
this domain shift. With self-supervised learning (SSL) models as a starting
point, we evaluate transfer learning and direct classification from SSL
features. We undertake our evaluation under rich conditions, with a goal to
develop ADI systems from pretrained models and ultimately evaluate performance
on newly collected data. In order to understand what factors contribute to
model decisions, we carry out a careful human study of a subset of our data.
Our analysis confirms that domain shift is a major challenge for ADI models. We
also find that while self-training does alleviate this challenges, it may be
insufficient for realistic conditions
Octopus: A Multitask Model and Toolkit for Arabic Natural Language Generation
Understanding Arabic text and generating human-like responses is a
challenging endeavor. While many researchers have proposed models and solutions
for individual problems, there is an acute shortage of a comprehensive Arabic
natural language generation toolkit that is capable of handling a wide range of
tasks. In this work, we present a novel Arabic text-to-text Transformer model,
namely AraT5v2. Our new model is methodically trained on extensive and diverse
data, utilizing an extended sequence length of 2,048 tokens. We explore various
pretraining strategies including unsupervised, supervised, and joint
pertaining, under both single and multitask settings. Our models outperform
competitive baselines with large margins. We take our work one step further by
developing and publicly releasing Octopus, a Python-based package and
command-line toolkit tailored for eight Arabic generation tasks all exploiting
a single model. We release the models and the toolkit on our public repository
SERENGETI: Massively Multilingual Language Models for Africa
Multilingual pretrained language models (mPLMs) acquire valuable,
generalizable linguistic information during pretraining and have advanced the
state of the art on task-specific finetuning. To date, only ~31 out of ~2,000
African languages are covered in existing language models. We ameliorate this
limitation by developing SERENGETI, a massively multilingual language model
that covers 517 African languages and language varieties. We evaluate our novel
models on eight natural language understanding tasks across 20 datasets,
comparing to 4 mPLMs that cover 4-23 African languages. SERENGETI outperforms
other models on 11 datasets across the eights tasks, achieving 82.27 average
F_1. We also perform analyses of errors from our models, which allows us to
investigate the influence of language genealogy and linguistic similarity when
the models are applied under zero-shot settings. We will publicly release our
models for
research.\footnote{\href{https://github.com/UBC-NLP/serengeti}{https://github.com/UBC-NLP/serengeti}}Comment: To appear in Findings of ACL 202