6,967 research outputs found
Understanding U.S. regional linguistic variation with Twitter data analysis
We analyze a Big Data set of geo-tagged tweets for a year (Oct. 2013–Oct. 2014) to understand the regional linguistic variation in the U.S. Prior work on regional linguistic variations usually took a long time to collect data and focused on either rural or urban areas. Geo-tagged Twitter data offers an unprecedented database with rich linguistic representation of fine spatiotemporal resolution and continuity. From the one-year Twitter corpus, we extract lexical characteristics for twitter users by summarizing the frequencies of a set of lexical alternations that each user has used. We spatially aggregate and smooth each lexical characteristic to derive county-based linguistic variables, from which orthogonal dimensions are extracted using the principal component analysis (PCA). Finally a regionalization method is used to discover hierarchical dialect regions using the PCA components. The regionalization results reveal interesting linguistic regional variations in the U.S. The discovered regions not only confirm past research findings in the literature but also provide new insights and a more detailed understanding of very recent linguistic patterns in the U.S
Mapping the Americanization of English in Space and Time
As global political preeminence gradually shifted from the United Kingdom to
the United States, so did the capacity to culturally influence the rest of the
world. In this work, we analyze how the world-wide varieties of written English
are evolving. We study both the spatial and temporal variations of vocabulary
and spelling of English using a large corpus of geolocated tweets and the
Google Books datasets corresponding to books published in the US and the UK.
The advantage of our approach is that we can address both standard written
language (Google Books) and the more colloquial forms of microblogging messages
(Twitter). We find that American English is the dominant form of English
outside the UK and that its influence is felt even within the UK borders.
Finally, we analyze how this trend has evolved over time and the impact that
some cultural events have had in shaping it.Comment: 16 pages, 6 figures, 2 tables. Published versio
Dialectometric analysis of language variation in Twitter
In the last few years, microblogging platforms such as Twitter have given
rise to a deluge of textual data that can be used for the analysis of informal
communication between millions of individuals. In this work, we propose an
information-theoretic approach to geographic language variation using a corpus
based on Twitter. We test our models with tens of concepts and their associated
keywords detected in Spanish tweets geolocated in Spain. We employ
dialectometric measures (cosine similarity and Jensen-Shannon divergence) to
quantify the linguistic distance on the lexical level between cells created in
a uniform grid over the map. This can be done for a single concept or in the
general case taking into account an average of the considered variants. The
latter permits an analysis of the dialects that naturally emerge from the data.
Interestingly, our results reveal the existence of two dialect macrovarieties.
The first group includes a region-specific speech spoken in small towns and
rural areas whereas the second cluster encompasses cities that tend to use a
more uniform variety. Since the results obtained with the two different metrics
qualitatively agree, our work suggests that social media corpora can be
efficiently used for dialectometric analyses.Comment: 10 pages, 7 figures, 1 table. Accepted to VarDial 201
Continuous Representation of Location for Geolocation and Lexical Dialectology using Mixture Density Networks
We propose a method for embedding two-dimensional locations in a continuous
vector space using a neural network-based model incorporating mixtures of
Gaussian distributions, presenting two model variants for text-based
geolocation and lexical dialectology. Evaluated over Twitter data, the proposed
model outperforms conventional regression-based geolocation and provides a
better estimate of uncertainty. We also show the effectiveness of the
representation for predicting words from location in lexical dialectology, and
evaluate it using the DARE dataset.Comment: Conference on Empirical Methods in Natural Language Processing (EMNLP
2017) September 2017, Copenhagen, Denmar
Understanding and Measuring Psychological Stress using Social Media
A body of literature has demonstrated that users' mental health conditions,
such as depression and anxiety, can be predicted from their social media
language. There is still a gap in the scientific understanding of how
psychological stress is expressed on social media. Stress is one of the primary
underlying causes and correlates of chronic physical illnesses and mental
health conditions. In this paper, we explore the language of psychological
stress with a dataset of 601 social media users, who answered the Perceived
Stress Scale questionnaire and also consented to share their Facebook and
Twitter data. Firstly, we find that stressed users post about exhaustion,
losing control, increased self-focus and physical pain as compared to posts
about breakfast, family-time, and travel by users who are not stressed.
Secondly, we find that Facebook language is more predictive of stress than
Twitter language. Thirdly, we demonstrate how the language based models thus
developed can be adapted and be scaled to measure county-level trends. Since
county-level language is easily available on Twitter using the Streaming API,
we explore multiple domain adaptation algorithms to adapt user-level Facebook
models to Twitter language. We find that domain-adapted and scaled social
media-based measurements of stress outperform sociodemographic variables (age,
gender, race, education, and income), against ground-truth survey-based stress
measurements, both at the user- and the county-level in the U.S. Twitter
language that scores higher in stress is also predictive of poorer health, less
access to facilities and lower socioeconomic status in counties. We conclude
with a discussion of the implications of using social media as a new tool for
monitoring stress levels of both individuals and counties.Comment: Accepted for publication in the proceedings of ICWSM 201
Continue playing: examining language change in discourse about binge-watching on Twitter
2021 Spring.Includes bibliographical references.Utilizing data from Twitter, this study characterized the change in the use of the term binge and its variants from 2009-2019. While there is a significant amount of literature looking at either language change or digital media, this research considered the two as inextricable forces on each other. To examine this and the proposed research questions, a textual analysis was conducted of tweets containing the word binge. Overall, the findings suggest that the December 2013 press release published by Netflix deeming binge-watching as the "new normal" in media consumption, may have pushed binge-watching into the mainstream lexicon. Language use about binge-watching was typically positively connotated in contrast to the negative connotations associated with binge-eating and binge-drinking. The connotative change appears to align with a widening of the definition of "watch" to account for the normality of binge-watching. As the use of binge-watching spread throughout the United States, the pattern of the geographic diffusion of binge-watching did not follow traditional theories of the diffusion of language change. The difference in spread may derive from the corporate origins of the term. Lastly, Twitter enabled and reinforced the spread of binge-watching through the facilitation of the social aspect of binge-watching. The findings of this study provide rich ground for future study
Recommended from our members
Sociolinguistically Driven Approaches for Just Natural Language Processing
Natural language processing (NLP) systems are now ubiquitous. Yet the benefits of these language technologies do not accrue evenly to all users, and indeed they can be harmful; NLP systems reproduce stereotypes, prevent speakers of non-standard language varieties from participating fully in public discourse, and re-inscribe historical patterns of linguistic stigmatization and discrimination. How harms arise in NLP systems, and who is harmed by them, can only be understood at the intersection of work on NLP, fairness and justice in machine learning, and the relationships between language and social justice. In this thesis, we propose to address two questions at this intersection: i) How can we conceptualize harms arising from NLP systems?, and ii) How can we quantify such harms?
We propose the following contributions. First, we contribute a model in order to collect the first large dataset of African American Language (AAL)-like social media text. We use the dataset to quantify the performance of two types of NLP systems, identifying disparities in model performance between Mainstream U.S. English (MUSE)- and AAL-like text. Turning to the landscape of bias in NLP more broadly, we then provide a critical survey of the emerging literature on bias in NLP and identify its limitations. Drawing on work across sociology, sociolinguistics, linguistic anthropology, social psychology, and education, we provide an account of the relationships between language and injustice, propose a taxonomy of harms arising from NLP systems grounded in those relationships, and propose a set of guiding research questions for work on bias in NLP. Finally, we adapt the measurement modeling framework from the quantitative social sciences to effectively evaluate approaches for quantifying bias in NLP systems. We conclude with a discussion of recent work on bias through the lens of style in NLP, raising a set of normative questions for future work
- …