311 research outputs found

    English verb regularization in books and tweets

    Get PDF
    The English language has evolved dramatically throughout its lifespan, to the extent that a modern speaker of Old English would be incomprehensible without translation. One concrete indicator of this process is the movement from irregular to regular (-ed) forms for the past tense of verbs. In this study we quantify the extent of verb regularization using two vastly disparate datasets: (1) Six years of published books scanned by Google (2003-2008), and (2) A decade of social media messages posted to Twitter (2008-2017). We find that the extent of verb regularization is greater on Twitter, taken as a whole, than in English Fiction books. Regularization is also greater for tweets geotagged in the United States relative to American English books, but the opposite is true for tweets geotagged in the United Kingdom relative to British English books. We also find interesting regional variations in regularization across counties in the United States. However, once differences in population are accounted for, we do not identify strong correlations with socio-demographic variables such as education or income

    Measuring Linguistic and Cultural Evolution Using Books and Tweets

    Get PDF
    Written language provides a snapshot of linguistic, cultural, and current events information for a given time period. Aggregating these snapshots by studying many texts over time reveals trends in the evolution of language, culture, and society. The ever-increasing amount of electronic text, both from the digitization of books and other paper documents to the increasing frequency with which electronic text is used as a means of communication, has given us an unprecedented opportunity to study these trends. In this dissertation, we use hundreds of thousands of books spanning two centuries scanned by Google, and over 100 billion messages, or ‘tweets’, posted to the social media platform, Twitter, over the course of a decade to study the English language, as well as study the evolution of culture and society as inferred from the changes in language. We begin by studying the current state of verb regularization and how this compares between the more formal writing of books and the more colloquial writing of tweets on Twitter. We find that the extent of verb regularization is greater on Twitter, taken as a whole, than in English Fiction books, and also for tweets geotagged in the United States relative to American English books, but the opposite is true for tweets geotagged in the United Kingdom relative to British English books. We also find interesting regional variations in regularization across counties in the United States. However, once differences in population are accounted for, we do not identify strong correlations with socio-demographic variables. Next, we study stretchable words, a fundamental aspect of spoken language that, until the advent of social media, was rarely observed within written language. We examine the frequency distributions of stretchable words and introduce two central parameters that capture their main characteristics of balance and stretch. We explore their dynamics by creating visual tools we call ‘balance plots’ and ‘spelling trees’. We also discuss how the tools and methods we develop could be used to study mistypings and misspellings, and may have further applications both within and beyond language. Finally, we take a closer look at the English Fiction n-gram dataset created by Google. We begin by explaining why using token counts as a proxy of word, or more generally, ‘n-gram’, importance is fundamentally flawed. We then devise a method to rebuild the Google Books corpus so that meaningful linguistic and cultural trends may be reliably discerned. We use book counts as the primary ranking for an n-gram and use subsampling to normalize across time to mitigate the extraneous results created by the underlying exponential increase in data volume over time. We also combine the subsampled data over a number of years as a method of smoothing. We then use these improved methods to study linguistic and cultural evolution across the last two centuries. We examine the dynamics of Zipf distributions for n-grams by measuring the churn of language reflected in the flux of n-grams across rank boundaries. Finally, we examine linguistic change using wordshift plots and a rank divergence measure with a tunable parameter to compare the language of two different time periods. Our results address several methodological shortcomings associated with the raw Google Books data, strengthening the potential for cultural inference by word changes

    Unsupervised Discovery of Gendered Language through Latent-Variable Modeling

    Get PDF
    Studying the ways in which language is gendered has long been an area of interest in sociolinguistics. Studies have explored, for example, the speech of male and female characters in film and the language used to describe male and female politicians. In this paper, we aim not to merely study this phenomenon qualitatively, but instead to quantify the degree to which the language used to describe men and women is different and, moreover, different in a positive or negative way. To that end, we introduce a generative latent-variable model that jointly represents adjective (or verb) choice, with its sentiment, given the natural gender of a head (or dependent) noun. We find that there are significant differences between descriptions of male and female nouns and that these differences align with common gender stereotypes: Positive adjectives used to describe women are more often related to their bodies than adjectives used to describe men.Comment: To appear in ACL 201

    Identifying Token-Level Dialectal Features in Social Media

    Get PDF
    Dialectal variation is present in many human languages and is attracting a growing interest in NLP. Most previous work concentrated on either (1) classifying dialectal varieties at the document or sentence level or (2) performing standard NLP tasks on dialectal data. In this paper, we propose the novel task of token-level dialectal feature prediction. We present a set of fine-grained annotation guidelines for Norwegian dialects, expand a corpus of dialectal tweets, and manually annotate them using the introduced guidelines. Furthermore, to evaluate the learnability of our task, we conduct labeling experiments using a collection of baselines, weakly supervised and supervised sequence labeling models. The obtained results show that, despite the difficulty of the task and the scarcity of training data, many dialectal features can be predicted with reasonably high accuracy.publishedVersio

    Statistical Laws Governing Fluctuations in Word Use from Word Birth to Word Death

    Get PDF
    We analyze the dynamic properties of 10^7 words recorded in English, Spanish and Hebrew over the period 1800--2008 in order to gain insight into the coevolution of language and culture. We report language independent patterns useful as benchmarks for theoretical models of language evolution. A significantly decreasing (increasing) trend in the birth (death) rate of words indicates a recent shift in the selection laws governing word use. For new words, we observe a peak in the growth-rate fluctuations around 40 years after introduction, consistent with the typical entry time into standard dictionaries and the human generational timescale. Pronounced changes in the dynamics of language during periods of war shows that word correlations, occurring across time and between words, are largely influenced by coevolutionary social, technological, and political factors. We quantify cultural memory by analyzing the long-term correlations in the use of individual words using detrended fluctuation analysis.Comment: Version 1: 31 pages, 17 figures, 3 tables. Version 2 is streamlined, eliminates substantial material and incorporates referee comments: 19 pages, 14 figures, 3 table

    Towards a Computational Model of Narrative on Social Media

    Get PDF
    This thesis describes a variety of approaches to developing a computational model of narrative on social media. Our goal is to use such a narrative model to identify efforts to manipulate public opinion on social media platforms like Twitter. We present a model in which narratives in a collection of tweets are represented as a graph. Elements from each tweet that are relevant to potential narratives are made into nodes in the graph; for this thesis, we populate graph nodes with tweets’ authors, hashtags, named entities (people, locations, organizations, etc.,), and moral foundations (central moral values framing the discussion). Two nodes are connected with an edge if the narrative elements they represent appear together in one or more tweets, with the edge weight corresponding to the number of tweets in which these elements coincide. We then explore multiple possible deep learning and graph analysis methods for identifying narratives in a collection of tweets, including clustering of language embeddings, topic modeling, community detection and random walks on our narrative graph, training a graph neural network to identify narratives in the graph, and training a graph embedding model to generate vector embeddings of graph nodes. While much work still remains to be done in this area, several of our techniques, especially the generation and clustering of graph embeddings, were able to identify groups of related and connected nodes that might form the beginnings of narratives. Further study of these or other techniques could allow for the reliable identification of full narratives and information operations on social media
    • …
    corecore