20 research outputs found
Developing natural language processing instruments to study sociotechnical systems
Identifying temporal linguistic patterns and tracing social amplification across communities has always been vital to understanding modern sociotechnical systems. Now, well into the age of information technology, the growing digitization of text archives powered by machine learning systems has enabled an enormous number of interdisciplinary studies to examine the coevolution of language and culture. However, most research in that domain investigates formal textual records, such as books and newspapers. In this work, I argue that the study of conversational text derived from social media is just as important. I present four case studies to identify and investigate societal developments in longitudinal social media streams with high temporal resolution spanning over 100 languages. These case studies show how everyday conversations on social media encode a unique perspective that is often complementary to observations derived from more formal texts. This unique perspective improves our understanding of modern sociotechnical systems and enables future research in computational linguistics, social science, and behavioral science
The sociospatial factors of death: Analyzing effects of geospatially-distributed variables in a Bayesian mortality model for Hong Kong
Human mortality is in part a function of multiple socioeconomic factors that
differ both spatially and temporally. Adjusting for other covariates, the human
lifespan is positively associated with household wealth. However, the extent to
which mortality in a geographical region is a function of socioeconomic factors
in both that region and its neighbors is unclear. There is also little
information on the temporal components of this relationship. Using the
districts of Hong Kong over multiple census years as a case study, we
demonstrate that there are differences in how wealth indicator variables are
associated with longevity in (a) areas that are affluent but neighbored by
socially deprived districts versus (b) wealthy areas surrounded by similarly
wealthy districts. We also show that the inclusion of spatially-distributed
variables reduces uncertainty in mortality rate predictions in each census year
when compared with a baseline model. Our results suggest that geographic
mortality models should incorporate nonlocal information (e.g., spatial
neighbors) to lower the variance of their mortality estimates, and point to a
more in-depth analysis of sociospatial spillover effects on mortality rates.Comment: 26 pages (15 main, 11 appendix), 22 figures (6 main, 11 appendix), 2
table
The shocklet transform: a decomposition method for the identification of local, mechanism-driven dynamics in sociotechnical time series
We introduce a qualitative, shape-based, timescale-independent time-domain transform used to extract local dynamics from sociotechnical time series—termed the Discrete Shocklet Transform (DST)—and an associated similarity search routine, the Shocklet Transform And Ranking (STAR) algorithm, that indicates time windows during which panels of time series display qualitatively-similar anomalous behavior. After distinguishing our algorithms from other methods used in anomaly detection and time series similarity search, such as the matrix profile, seasonal-hybrid ESD, and discrete wavelet transform-based procedures, we demonstrate the DST’s ability to identify mechanism-driven dynamics at a wide range of timescales and its relative insensitivity to functional parameterization. As an application, we analyze a sociotechnical data source (usage frequencies for a subset of words on Twitter) and highlight our algorithms’ utility by using them to extract both a typology of mechanistic local dynamics and a data-driven narrative of socially-important events as perceived by English-language Twitter
Hurricanes and hashtags: Characterizing online collective attention for natural disasters
We study collective attention paid towards hurricanes through the lens of
-grams on Twitter, a social media platform with global reach. Using
hurricane name mentions as a proxy for awareness, we find that the exogenous
temporal dynamics are remarkably similar across storms, but that overall
collective attention varies widely even among storms causing comparable deaths
and damage. We construct `hurricane attention maps' and observe that hurricanes
causing deaths on (or economic damage to) the continental United States
generate substantially more attention in English language tweets than those
that do not. We find that a hurricane's Saffir-Simpson wind scale category
assignment is strongly associated with the amount of attention it receives.
Higher category storms receive higher proportional increases of attention per
proportional increases in number of deaths or dollars of damage, than lower
category storms. The most damaging and deadly storms of the 2010s, Hurricanes
Harvey and Maria, generated the most attention and were remembered the longest,
respectively. On average, a category 5 storm receives 4.6 times more attention
than a category 1 storm causing the same number of deaths and economic damage.Comment: 31 pages (14 main, 17 Supplemental), 19 figures (5 main, 14 appendix
Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter
In real-time, Twitter strongly imprints world events, popular culture, and the day-to-day; Twitter records an ever growing compendium of language use and change; and Twitter has been shown to enable certain kinds of prediction. Vitally, and absent from many standard corpora such as books and news archives, Twitter also encodes popularity and spreading through retweets. Here, we describe Storywrangler, an ongoing, day-scale curation of over 100 billion tweets containing around 1 trillion 1-grams from 2008 to 2020. For each day, we break tweets into 1-, 2-, and 3-grams across 150+ languages, record usage frequencies, and generate Zipf distributions. We make the data set available through an interactive time series viewer, and as downloadable time series and daily distributions. We showcase a few examples of the many possible avenues of study we aim to enable including how social amplification can be visualized through ‘contagiograms’