Search CORE

6,967 research outputs found

Understanding U.S. regional linguistic variation with Twitter data analysis

Author: Alice Kasakoff
Atwood
Borruso
Bro
Carlos
Carver
Chambers
Cheshire
Crampton
Di Nunzio
Diansheng Guo
Eisenstein
Eisenstein
Gastil
Gimpel
Goebl
Gonçalves
Goodchild
Goodchild
Grieve
Grieve
Grieve
Grieve
Guo
Guo
Guo
Haining
Handcock
Heeringa
Hong
Jack Grieve
James
Kafadar
Kitchin
Kohonen
Koylu
Kretzschmar
Kretzschmar
Kretzschmar
Kupfer
Kurath
Labov
Labov
Lee
Longley
Masser
Nerbonne
Nerbonne
Nerbonne
Nerbonne
Nerbonne
O'Cain
Petrovic
Rao
Spence
Szmrecsanyi
Séguy
Thill
Wang
Wieling
Wolfram
Xu
Yuan Huang
Publication venue: 'Elsevier BV'
Publication date: 01/09/2016
Field of study

We analyze a Big Data set of geo-tagged tweets for a year (Oct. 2013–Oct. 2014) to understand the regional linguistic variation in the U.S. Prior work on regional linguistic variations usually took a long time to collect data and focused on either rural or urban areas. Geo-tagged Twitter data offers an unprecedented database with rich linguistic representation of fine spatiotemporal resolution and continuity. From the one-year Twitter corpus, we extract lexical characteristics for twitter users by summarizing the frequencies of a set of lexical alternations that each user has used. We spatially aggregate and smooth each lexical characteristic to derive county-based linguistic variables, from which orthogonal dimensions are extracted using the principal component analysis (PCA). Finally a regionalization method is used to discover hierarchical dialect regions using the PCA components. The regionalization results reveal interesting linguistic regional variations in the U.S. The discovered regions not only confirm past research findings in the literature but also provide new insights and a more detailed understanding of very recent linguistic patterns in the U.S

Crossref

University of Birmingham Research Portal

Aston Publications Explorer

Mapping the Americanization of English in Space and Time

Author: Gonçalves Bruno
Loureiro-Porto Lucía
Ramasco José J.
Sánchez David
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 25/05/2018
Field of study

As global political preeminence gradually shifted from the United Kingdom to the United States, so did the capacity to culturally influence the rest of the world. In this work, we analyze how the world-wide varieties of written English are evolving. We study both the spatial and temporal variations of vocabulary and spelling of English using a large corpus of geolocated tweets and the Google Books datasets corresponding to books published in the US and the UK. The advantage of our approach is that we can address both standard written language (Google Books) and the more colloquial forms of microblogging messages (Twitter). We find that American English is the dominant form of English outside the UK and that its influence is felt even within the UK borders. Finally, we analyze how this trend has evolved over time and the impact that some cultural events have had in shaping it.Comment: 16 pages, 6 figures, 2 tables. Published versio

arXiv.org e-Print Archive

Directory of Open Access Journals

Digital.CSIC

FigShare

Dialectometric analysis of language variation in Twitter

Author: Donoso Gonzalo
Sanchez David
Publication venue
Publication date: 01/01/2017
Field of study

In the last few years, microblogging platforms such as Twitter have given rise to a deluge of textual data that can be used for the analysis of informal communication between millions of individuals. In this work, we propose an information-theoretic approach to geographic language variation using a corpus based on Twitter. We test our models with tens of concepts and their associated keywords detected in Spanish tweets geolocated in Spain. We employ dialectometric measures (cosine similarity and Jensen-Shannon divergence) to quantify the linguistic distance on the lexical level between cells created in a uniform grid over the map. This can be done for a single concept or in the general case taking into account an average of the considered variants. The latter permits an analysis of the dialects that naturally emerge from the data. Interestingly, our results reveal the existence of two dialect macrovarieties. The first group includes a region-specific speech spoken in small towns and rural areas whereas the second cluster encompasses cities that tend to use a more uniform variety. Since the results obtained with the two different metrics qualitatively agree, our work suggests that social media corpora can be efficiently used for dialectometric analyses.Comment: 10 pages, 7 figures, 1 table. Accepted to VarDial 201

arXiv.org e-Print Archive

Crossref

Digital.CSIC

Continuous Representation of Location for Geolocation and Lexical Dialectology using Mixture Density Networks

Author: Baldwin Timothy
Cohn Trevor
Rahimi Afshin
Publication venue
Publication date: 01/01/2017
Field of study

We propose a method for embedding two-dimensional locations in a continuous vector space using a neural network-based model incorporating mixtures of Gaussian distributions, presenting two model variants for text-based geolocation and lexical dialectology. Evaluated over Twitter data, the proposed model outperforms conventional regression-based geolocation and provides a better estimate of uncertainty. We also show the effectiveness of the representation for predicting words from location in lexical dialectology, and evaluate it using the DARE dataset.Comment: Conference on Empirical Methods in Natural Language Processing (EMNLP 2017) September 2017, Copenhagen, Denmar

arXiv.org e-Print Archive

Crossref

University of Queensland eSpace

Understanding and Measuring Psychological Stress using Social Media

Author: Buffone Anneke
Eichstaedt Johannes
Guntuku Sharath Chandra
Jaidka Kokil
Ungar Lyle
Publication venue
Publication date: 04/04/2019
Field of study

A body of literature has demonstrated that users' mental health conditions, such as depression and anxiety, can be predicted from their social media language. There is still a gap in the scientific understanding of how psychological stress is expressed on social media. Stress is one of the primary underlying causes and correlates of chronic physical illnesses and mental health conditions. In this paper, we explore the language of psychological stress with a dataset of 601 social media users, who answered the Perceived Stress Scale questionnaire and also consented to share their Facebook and Twitter data. Firstly, we find that stressed users post about exhaustion, losing control, increased self-focus and physical pain as compared to posts about breakfast, family-time, and travel by users who are not stressed. Secondly, we find that Facebook language is more predictive of stress than Twitter language. Thirdly, we demonstrate how the language based models thus developed can be adapted and be scaled to measure county-level trends. Since county-level language is easily available on Twitter using the Streaming API, we explore multiple domain adaptation algorithms to adapt user-level Facebook models to Twitter language. We find that domain-adapted and scaled social media-based measurements of stress outperform sociodemographic variables (age, gender, race, education, and income), against ground-truth survey-based stress measurements, both at the user- and the county-level in the U.S. Twitter language that scores higher in stress is also predictive of poorer health, less access to facilities and lower socioeconomic status in counties. We conclude with a discussion of the implications of using social media as a new tool for monitoring stress levels of both individuals and counties.Comment: Accepted for publication in the proceedings of ICWSM 201

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Understanding US regional linguistic variation with Twitter data analysis

Author: Grieve Jack
Guo Diansheng
Huang Yuan
Kasakoff Alice
Publication venue
Publication date
Field of study

University of Birmingham Research Portal

Continue playing: examining language change in discourse about binge-watching on Twitter

Author: Peterman Katharyn Alison Marjorie
Publication venue: Colorado State University. Libraries
Publication date: 01/01/2021
Field of study

2021 Spring.Includes bibliographical references.Utilizing data from Twitter, this study characterized the change in the use of the term binge and its variants from 2009-2019. While there is a significant amount of literature looking at either language change or digital media, this research considered the two as inextricable forces on each other. To examine this and the proposed research questions, a textual analysis was conducted of tweets containing the word binge. Overall, the findings suggest that the December 2013 press release published by Netflix deeming binge-watching as the "new normal" in media consumption, may have pushed binge-watching into the mainstream lexicon. Language use about binge-watching was typically positively connotated in contrast to the negative connotations associated with binge-eating and binge-drinking. The connotative change appears to align with a widening of the definition of "watch" to account for the normality of binge-watching. As the use of binge-watching spread throughout the United States, the pattern of the geographic diffusion of binge-watching did not follow traditional theories of the diffusion of language change. The difference in spread may derive from the corporate origins of the term. Lastly, Twitter enabled and reinforced the spread of binge-watching through the facilitation of the social aspect of binge-watching. The findings of this study provide rich ground for future study

Mountain Scholar (Digital Collections of Colorado and Wyoming)

Recommended from our members

Sociolinguistically Driven Approaches for Just Natural Language Processing

Author: Blodgett Su Lin
Publication venue: ScholarWorks@UMass Amherst
Publication date: 06/04/2021
Field of study

Natural language processing (NLP) systems are now ubiquitous. Yet the benefits of these language technologies do not accrue evenly to all users, and indeed they can be harmful; NLP systems reproduce stereotypes, prevent speakers of non-standard language varieties from participating fully in public discourse, and re-inscribe historical patterns of linguistic stigmatization and discrimination. How harms arise in NLP systems, and who is harmed by them, can only be understood at the intersection of work on NLP, fairness and justice in machine learning, and the relationships between language and social justice. In this thesis, we propose to address two questions at this intersection: i) How can we conceptualize harms arising from NLP systems?, and ii) How can we quantify such harms? We propose the following contributions. First, we contribute a model in order to collect the first large dataset of African American Language (AAL)-like social media text. We use the dataset to quantify the performance of two types of NLP systems, identifying disparities in model performance between Mainstream U.S. English (MUSE)- and AAL-like text. Turning to the landscape of bias in NLP more broadly, we then provide a critical survey of the emerging literature on bias in NLP and identify its limitations. Drawing on work across sociology, sociolinguistics, linguistic anthropology, social psychology, and education, we provide an account of the relationships between language and injustice, propose a taxonomy of harms arising from NLP systems grounded in those relationships, and propose a set of guiding research questions for work on bias in NLP. Finally, we adapt the measurement modeling framework from the quantitative social sciences to effectively evaluate approaches for quantifying bias in NLP systems. We conclude with a discussion of recent work on bias through the lens of style in NLP, raising a set of normative questions for future work

ScholarWorks@UMass Amherst