228 research outputs found
ALDi: Quantifying the Arabic Level of Dialectness of Text
Transcribed speech and user-generated text in Arabic typically contain a mixture of Modern Standard Arabic (MSA), the standardized language taught in schools, and Dialectal Arabic (DA), used in daily communications. To handle this variation, previous work in Arabic NLP has focused on Dialect Identification (DI) on the sentence or the token level. However, DI treats the task as binary, whereas we argue that Arabic speakers perceive a spectrum of dialectness, which we operationalize at the sentence level as the Arabic Level of Dialectness (ALDi), a continuous linguistic variable. We introduce the AOC-ALDi dataset (derived from the AOC dataset), containing 127,835 sentences (17% from news articles and 83% from user comments on those articles) which are manually labeled with their level of dialectness. We provide a detailed analysis of AOC-ALDi and show that a model trained on it can effectively identify levels of dialectness on a range of other corpora (including dialects and genres not included in AOC-ALDi), providing a more nuanced picture than traditional DI systems. Through case studies, we illustrate how ALDi can reveal Arabic speakers' stylistic choices in different situations, a useful property for sociolinguistic analyses
ALDi: Quantifying the Arabic Level of Dialectness of Text
Transcribed speech and user-generated text in Arabic typically contain a
mixture of Modern Standard Arabic (MSA), the standardized language taught in
schools, and Dialectal Arabic (DA), used in daily communications. To handle
this variation, previous work in Arabic NLP has focused on Dialect
Identification (DI) on the sentence or the token level. However, DI treats the
task as binary, whereas we argue that Arabic speakers perceive a spectrum of
dialectness, which we operationalize at the sentence level as the Arabic Level
of Dialectness (ALDi), a continuous linguistic variable. We introduce the
AOC-ALDi dataset (derived from the AOC dataset), containing 127,835 sentences
(17% from news articles and 83% from user comments on those articles) which are
manually labeled with their level of dialectness. We provide a detailed
analysis of AOC-ALDi and show that a model trained on it can effectively
identify levels of dialectness on a range of other corpora (including dialects
and genres not included in AOC-ALDi), providing a more nuanced picture than
traditional DI systems. Through case studies, we illustrate how ALDi can reveal
Arabic speakers' stylistic choices in different situations, a useful property
for sociolinguistic analyses.Comment: Accepted to EMNLP 202
Perspective Identification in Informal Text
This dissertation studies the problem of identifying the ideological perspective of people as expressed in their written text. One's perspective is often expressed in his/her stance towards polarizing topics. We are interested in studying how nuanced linguistic cues can be used to identify the perspective of a person in informal genres. Moreover, we are interested in exploring the problem from a multilingual perspective comparing and contrasting linguistics devices used in both English informal genres datasets discussing American ideological issues and Arabic discussion fora posts related to Egyptian politics. %In doing so, we solve several challenges.
Our first and utmost goal is building computational systems that can successfully identify the perspective from which a given informal text is written while studying what linguistic cues work best for each language and drawing insights into the similarities and differences between the notion of perspective in both studied languages. We build computational systems that can successfully identify the stance of a person in English informal text that deal with different topics that are determined by one's perspective, such as legalization of abortion, feminist movement, gay and gun rights; additionally, we are able to identify a more general notion of perspective–namely the 2012 choice of presidential candidate–as well as build systems for automatically identifying different elements of a person's perspective given an Egyptian discussion forum comment. The systems utilize several lexical and semantic features for both languages. Specifically, for English we explore the use of word sense disambiguation, opinion features, latent and frame semantics as well; as Linguistic Inquiry and Word Count features; in Arabic, however, in addition to using sentiment and latent semantics, we study whether linguistic code-switching (LCS) between the standard and dialectal forms for the language can help as a cue for uncovering the perspective from which a comment was written.
This leads us to the challenge of devising computational systems that can handle LCS in Arabic. The Arabic language has a diglossic nature where the standard form of the language (MSA) coexists with the regional dialects (DA) corresponding to the native mother tongue of Arabic speakers in different parts of the Arab world. DA is ubiquitously prevalent in written informal genres and in most cases it is code-switched with MSA. The presence of code-switching degrades the performance of almost any MSA-only trained Natural Language Processing tool when applied to DA or to code-switched MSA-DA content. In order to solve this challenge, we build a state-of-the-art system–AIDA–to computationally handle token and sentence-level code-switching.
On a conceptual level, for handling and processing Egyptian ideological perspectives, we note the lack of a taxonomy for the most common perspectives among Egyptians and the lack of corresponding annotated corpora. In solving this challenge, we develop a taxonomy for the most common community perspectives among Egyptians and use an iterative feedback-loop process to devise guidelines on how to successfully annotate a given online discussion forum post with different elements of a person's perspective. Using the proposed taxonomy and annotation guidelines, we annotate a large set of Egyptian discussion fora posts to identify a comment's perspective as conveyed in the priority expressed by the comment, as well as the stance on major political entities
Computational Sociolinguistics: A Survey
Language is a social phenomenon and variation is inherent to its social
nature. Recently, there has been a surge of interest within the computational
linguistics (CL) community in the social dimension of language. In this article
we present a survey of the emerging field of "Computational Sociolinguistics"
that reflects this increased interest. We aim to provide a comprehensive
overview of CL research on sociolinguistic themes, featuring topics such as the
relation between language and social identity, language use in social
interaction and multilingual communication. Moreover, we demonstrate the
potential for synergy between the research communities involved, by showing how
the large-scale data-driven methods that are widely used in CL can complement
existing sociolinguistic studies, and how sociolinguistics can inform and
challenge the methods and assumptions employed in CL studies. We hope to convey
the possible benefits of a closer collaboration between the two communities and
conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication:
18th February, 201
Sentiment Analysis for micro-blogging platforms in Arabic
Sentiment Analysis (SA) concerns the automatic extraction and classification of
sentiments conveyed in a given text, i.e. labelling a text instance as positive, negative
or neutral. SA research has attracted increasing interest in the past few years due
to its numerous real-world applications. The recent interest in SA is also fuelled
by the growing popularity of social media platforms (e.g. Twitter), as they provide
large amounts of freely available and highly subjective content that can be readily
crawled.
Most previous SA work has focused on English with considerable success. In
this work, we focus on studying SA in Arabic, as a less-resourced language. This
work reports on a wide set of investigations for SA in Arabic tweets, systematically
comparing three existing approaches that have been shown successful in English.
Specifically, we report experiments evaluating fully-supervised-based (SL), distantsupervision-
based (DS), and machine-translation-based (MT) approaches for SA.
The investigations cover training SA models on manually-labelled (i.e. in SL methods)
and automatically-labelled (i.e. in DS methods) data-sets. In addition, we
explored an MT-based approach that utilises existing off-the-shelf SA systems for
English with no need for training data, assessing the impact of translation errors on
the performance of SA models, which has not been previously addressed for Arabic
tweets. Unlike previous work, we benchmark the trained models against an independent
test-set of >3.5k instances collected at different points in time to account
for topic-shifts issues in the Twitter stream. Despite the challenging noisy medium
of Twitter and the mixture use of Dialectal and Standard forms of Arabic, we show
that our SA systems are able to attain performance scores on Arabic tweets that
are comparable to the state-of-the-art SA systems for English tweets.
The thesis also investigates the role of a wide set of features, including syntactic,
semantic, morphological, language-style and Twitter-specific features. We introduce
a set of affective-cues/social-signals features that capture information about the
presence of contextual cues (e.g. prayers, laughter, etc.) to correlate them with the
sentiment conveyed in an instance. Our investigations reveal a generally positive
impact for utilising these features for SA in Arabic. Specifically, we show that a rich
set of morphological features, which has not been previously used, extracted using
a publicly-available morphological analyser for Arabic can significantly improve the
performance of SA classifiers. We also demonstrate the usefulness of languageindependent
features (e.g. Twitter-specific) for SA. Our feature-sets outperform
results reported in previous work on a previously built data-set
Final FLaReNet deliverable: Language Resources for the Future - The Future of Language Resources
Language Technologies (LT), together with their backbone, Language Resources (LR), provide an essential support to the challenge of Multilingualism and ICT of the future. The main task of language technologies is to bridge language barriers and to help creating a new environment where information flows smoothly across frontiers and languages, no matter the country, and the language, of origin. To achieve this goal, all players involved need to act as a community able to join forces on a set of shared priorities. However, until now the field of Language Resources and Technology has long suffered from an excess of individuality and fragmentation, with a lack of coherence concerning the priorities for the field, the direction to move, not to mention a common timeframe. The context encountered by the FLaReNet project was thus represented by an active field needing a coherence that can only be given by sharing common priorities and endeavours. FLaReNet has contributed to the creation of this coherence by gathering a wide community of experts and making them participate in the definition of an exhaustive set of recommendations
- …