9,002 research outputs found
Computational Sociolinguistics: A Survey
Language is a social phenomenon and variation is inherent to its social
nature. Recently, there has been a surge of interest within the computational
linguistics (CL) community in the social dimension of language. In this article
we present a survey of the emerging field of "Computational Sociolinguistics"
that reflects this increased interest. We aim to provide a comprehensive
overview of CL research on sociolinguistic themes, featuring topics such as the
relation between language and social identity, language use in social
interaction and multilingual communication. Moreover, we demonstrate the
potential for synergy between the research communities involved, by showing how
the large-scale data-driven methods that are widely used in CL can complement
existing sociolinguistic studies, and how sociolinguistics can inform and
challenge the methods and assumptions employed in CL studies. We hope to convey
the possible benefits of a closer collaboration between the two communities and
conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication:
18th February, 201
Lexical Normalization for Code-switched Data and its Effect on POS Tagging
Lexical normalization, the translation of non-canonical data to standard
language, has shown to improve the performance of manynatural language
processing tasks on social media. Yet, using multiple languages in one
utterance, also called code-switching (CS), is frequently overlooked by these
normalization systems, despite its common use in social media. In this paper,
we propose three normalization models specifically designed to handle
code-switched data which we evaluate for two language pairs: Indonesian-English
(Id-En) and Turkish-German (Tr-De). For the latter, we introduce novel
normalization layers and their corresponding language ID and POS tags for the
dataset, and evaluate the downstream effect of normalization on POS tagging.
Results show that our CS-tailored normalization models outperform Id-En state
of the art and Tr-De monolingual models, and lead to 5.4% relative performance
increase for POS tagging as compared to unnormalized input
- …