157 research outputs found
Confounds and Consequences in Geotagged Twitter Data
Twitter is often used in quantitative studies that identify
geographically-preferred topics, writing styles, and entities. These studies
rely on either GPS coordinates attached to individual messages, or on the
user-supplied location field in each profile. In this paper, we compare these
data acquisition techniques and quantify the biases that they introduce; we
also measure their effects on linguistic analysis and text-based geolocation.
GPS-tagging and self-reported locations yield measurably different corpora, and
these linguistic differences are partially attributable to differences in
dataset composition by age and gender. Using a latent variable model to induce
age and gender, we show how these demographic variables interact with geography
to affect language use. We also show that the accuracy of text-based
geolocation varies with population demographics, giving the best results for
men above the age of 40.Comment: final version for EMNLP 201
Diffusion of Lexical Change in Social Media
Computer-mediated communication is driving fundamental changes in the nature
of written language. We investigate these changes by statistical analysis of a
dataset comprising 107 million Twitter messages (authored by 2.7 million unique
user accounts). Using a latent vector autoregressive model to aggregate across
thousands of words, we identify high-level patterns in diffusion of linguistic
change over the United States. Our model is robust to unpredictable changes in
Twitter's sampling rate, and provides a probabilistic characterization of the
relationship of macro-scale linguistic influence to a set of demographic and
geographic predictors. The results of this analysis offer support for prior
arguments that focus on geographical proximity and population size. However,
demographic similarity -- especially with regard to race -- plays an even more
central role, as cities with similar racial demographics are far more likely to
share linguistic influence. Rather than moving towards a single unified
"netspeak" dialect, language evolution in computer-mediated communication
reproduces existing fault lines in spoken American English.Comment: preprint of PLOS-ONE paper from November 2014; PLoS ONE 9(11) e11311
Distributed Representations of Geographically Situated Language
We introduce a model for incorporating contextual information (such as geogra-phy) in learning vector-space representa-tions of situated language. In contrast to approaches to multimodal representation learning that have used properties of the object being described (such as its color), our model includes information about the subject (i.e., the speaker), allowing us to learn the contours of a word’s meaning that are shaped by the context in which it is uttered. In a quantitative evaluation on the task of judging geographically in-formed semantic similarity between repre-sentations learned from 1.1 billion words of geo-located tweets, our joint model out-performs comparable independent models that learn meaning in isolation.
Modeling Ideological Salience and Framing in Polarized Online Groups with Graph Neural Networks and Structured Sparsity
The increasing polarization of online political discourse calls for computational tools that automatically detect and monitor ideological divides in social media. We introduce a minimally supervised method that leverages the network structure of online discussion forums, specifically Reddit, to detect polarized concepts. We model polarization along the dimensions of salience and framing, drawing upon insights from moral psychology. Our architecture combines graph neural networks with structured sparsity learning and results in representations for concepts and subreddits that capture temporal ideological dynamics such as right-wing and left-wing radicalization
Geographic Adaptation of Pretrained Language Models
While pretrained language models (PLMs) have been shown to possess a plethora
of linguistic knowledge, the existing body of research has largely neglected
extralinguistic knowledge, which is generally difficult to obtain by
pretraining on text alone. Here, we contribute to closing this gap by examining
geolinguistic knowledge, i.e., knowledge about geographic variation in
language. We introduce geoadaptation, an intermediate training step that
couples language modeling with geolocation prediction in a multi-task learning
setup. We geoadapt four PLMs, covering language groups from three geographic
areas, and evaluate them on five different tasks: fine-tuned (i.e., supervised)
geolocation prediction, zero-shot (i.e., unsupervised) geolocation prediction,
fine-tuned language identification, zero-shot language identification, and
zero-shot prediction of dialect features. Geoadaptation is very successful at
injecting geolinguistic knowledge into the PLMs: the geoadapted PLMs
consistently outperform PLMs adapted using only language modeling (by
especially wide margins on zero-shot prediction tasks), and we obtain new
state-of-the-art results on two benchmarks for geolocation prediction and
language identification. Furthermore, we show that the effectiveness of
geoadaptation stems from its ability to geographically retrofit the
representation space of the PLMs.Comment: TACL 2024 (pre-MIT Press publication version
Geographic adaptation of pretrained language models
While pretrained language models (PLMs) have been shown to possess a plethora of linguistic knowledge, the existing body of research has largely neglected extralinguistic knowledge, which is generally difficult to obtain by pretraining on text alone. Here, we contribute to closing this gap by examining geolinguistic knowledge, i.e., knowledge about geographic variation in language. We introduce geoadaptation, an intermediate training step that couples language modeling with geolocation prediction in a multi-task learning setup. We geoadapt four PLMs, covering language groups from three geographic areas, and evaluate them on five different tasks: fine-tuned (i.e., supervised) geolocation prediction, zero-shot (i.e., unsupervised) geolocation prediction, fine-tuned language identification, zero-shot language identification, and zero-shot prediction of dialect features. Geoadaptation is very successful at injecting geolinguistic knowledge into the PLMs: The geoadapted PLMs consistently outperform PLMs adapted using only language modeling (by especially wide margins on zero-shot prediction tasks), and we obtain new state-of-the-art results on two benchmarks for geolocation prediction and language identification. Furthermore, we show that the effectiveness of geoadaptation stems from its ability to geographically retrofit the representation space of the PLMs
Computational Sociolinguistics: A Survey
Language is a social phenomenon and variation is inherent to its social
nature. Recently, there has been a surge of interest within the computational
linguistics (CL) community in the social dimension of language. In this article
we present a survey of the emerging field of "Computational Sociolinguistics"
that reflects this increased interest. We aim to provide a comprehensive
overview of CL research on sociolinguistic themes, featuring topics such as the
relation between language and social identity, language use in social
interaction and multilingual communication. Moreover, we demonstrate the
potential for synergy between the research communities involved, by showing how
the large-scale data-driven methods that are widely used in CL can complement
existing sociolinguistic studies, and how sociolinguistics can inform and
challenge the methods and assumptions employed in CL studies. We hope to convey
the possible benefits of a closer collaboration between the two communities and
conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication:
18th February, 201
- …