Transcribed speech and user-generated text in Arabic typically contain a
mixture of Modern Standard Arabic (MSA), the standardized language taught in
schools, and Dialectal Arabic (DA), used in daily communications. To handle
this variation, previous work in Arabic NLP has focused on Dialect
Identification (DI) on the sentence or the token level. However, DI treats the
task as binary, whereas we argue that Arabic speakers perceive a spectrum of
dialectness, which we operationalize at the sentence level as the Arabic Level
of Dialectness (ALDi), a continuous linguistic variable. We introduce the
AOC-ALDi dataset (derived from the AOC dataset), containing 127,835 sentences
(17% from news articles and 83% from user comments on those articles) which are
manually labeled with their level of dialectness. We provide a detailed
analysis of AOC-ALDi and show that a model trained on it can effectively
identify levels of dialectness on a range of other corpora (including dialects
and genres not included in AOC-ALDi), providing a more nuanced picture than
traditional DI systems. Through case studies, we illustrate how ALDi can reveal
Arabic speakers' stylistic choices in different situations, a useful property
for sociolinguistic analyses.Comment: Accepted to EMNLP 202