Search CORE

38 research outputs found

Confounds and Consequences in Geotagged Twitter Data

Author: Eisenstein Jacob
Pavalanathan Umashanthi
Publication venue
Publication date: 01/01/2015
Field of study

Twitter is often used in quantitative studies that identify geographically-preferred topics, writing styles, and entities. These studies rely on either GPS coordinates attached to individual messages, or on the user-supplied location field in each profile. In this paper, we compare these data acquisition techniques and quantify the biases that they introduce; we also measure their effects on linguistic analysis and text-based geolocation. GPS-tagging and self-reported locations yield measurably different corpora, and these linguistic differences are partially attributable to differences in dataset composition by age and gender. Using a latent variable model to induce age and gender, we show how these demographic variables interact with geography to affect language use. We also show that the accuracy of text-based geolocation varies with population demographics, giving the best results for men above the age of 40.Comment: final version for EMNLP 201

arXiv.org e-Print Archive

CiteSeerX

Crossref

Generalisation in named entity recognition: A quantitative analysis

Author: Al-Onaizan
Attardi
Baldwin
Baldwin
Bengio
Bollacker
Bontcheva
Brown
Cherry
Chinchor
Chiticariu
Chiu
Collobert
Daumé
Derczynski
Derczynski
Derczynski
Eisenstein
Finin
Finkel
Forman
Fromreide
Gella
Glorot
Grishman
Guo
Hovy
Hovy
Hu
Isabelle Augenstein
Kalina Bontcheva
Lafferty
Leon Derczynski
Lewis
Liu
Locke
Masud
Maynard
Mooney
Nadeau
Newman
Palmer
Pavalanathan
Plank
Plank
Preoţiuc-Pietro
Ratinov
Recasens
Ritter
Rowe
Rowe
Schiffman
Socher
Steinberger
Sutton
Tjong Kim Sang
Toda
Walker
Whitelaw
Wu
Publication venue: 'Elsevier BV'
Publication date: 15/02/2017
Field of study

Named Entity Recognition (NER) is a key NLP task, which is all the more challenging on Web and user-generated content with their diverse and continuously changing language. This paper aims to quantify how this diversity impacts state-of-the-art NER methods, by measuring named entity (NE) and context variability, feature sparsity, and their effects on precision and recall. In particular, our findings indicate that NER approaches struggle to generalise in diverse genres with limited training data. Unseen NEs, in particular, play an important role, which have a higher incidence in diverse genres such as social media than in more regular genres such as newswire. Coupled with a higher incidence of unseen features more generally and the lack of large training corpora, this leads to significantly lower F1 scores for diverse genres as compared to more regular ones. We also find that leading systems rely heavily on surface forms found in training data, having problems generalising beyond these, and offer explanations for this observation

arXiv.org e-Print Archive

Crossref

UCL Discovery

White Rose Research Online

Audience-Modulated Variation in Online Social Media

Author: Jacob Eisenstein
Umashanthi Pavalanathan
Publication venue: Duke University Press
Publication date: 01/05/2015
Field of study

Stylistic variation in online social media writing is well attested: for example, geographical analysis of the social media service Twitter has replicated isoglosses for many known lexical variables from speech, while simultaneously revealing a wealth of new geographical lexical variables, including emoticons, phonetic spellings, and phrasal abbreviations. However, less is known about the social role of variation in online writing. This article examines online writing variation in the context of audience design, focusing on affordances offered by Twitter that allow users to modulate a message's intended audience. We find that the frequency of nonstandard lexical variables is inversely related to the size of the intended audience: as writers target smaller audiences, the frequency of lexical variables increases. In addition, these variables are more often used in messages that are addressed to individuals who are known to be geographically local. This phenomenon holds not only for geographically differentiated lexical variables, but also for nonstandard variables that are widely used throughout the United States. These findings suggest that users of social media are attuned to both the nature of their audience and the social meaning of lexical variation and that they customize their self-presentation accordingly.</jats:p

Crossref

More emojis, less :) The competition for paralinguistic function in microblog writing

Author: Jacob Eisenstein
Umashanthi Pavalanathan
Publication venue: University of Illinois Libraries
Publication date: 20/10/2016
Field of study

Many non-standard elements of ‘netspeak’ writing can be viewed as efforts to replicate the linguistic role played by nonverbal modalities in speech, conveying contextual information such as affect and interpersonal stance. Recently, a new non-standard communicative tool has emerged in online writing: emojis. These unicode characters contain a standardized set of pictographs, some of which are visually similar to well-known emoticons. Do emojis play the same linguistic role as emoticons and other ASCII-based writing innovations? If so, might the introduction of emojis eventually displace the earlier, user-created forms of contextual expression? Using a matching approach to causal statistical inference, we show that as social media users adopt emojis, they dramatically reduce their use of emoticons, suggesting that these linguistic resources compete for the same communicative function. Furthermore, we demonstrate that the adoption of emojis leads to a corresponding increase in the use of standard spellings, suggesting that all forms of non-standard writing are losing out in a competition with emojis. Finally, we identify specific textual features that make some emoticons especially likely to be replaced by emojis.</jats:p

University of Illinois at Chicago: Journals@UIC

Crossref

Identity Management and Mental Health Discourse in Social Media

Author: Munmun De Choudhury
Umashanthi Pavalanathan
Publication venue: ACM
Publication date: 18/05/2015
Field of study

Crossref

More emojis, less :) The competition for paralinguistic function in microblog writing

Author: Eisenstein Jacob
Pavalanathan Umashanthi
Publication venue: University of Illinois at Chicago University Library
Publication date: 20/10/2016
Field of study

University of Illinois at Chicago: Journals@UIC

A Multidimensional Lexicon for Interpersonal Stancetaking

Author: Jacob Eisenstein
Jim Fitzpatrick
Scott Kiesling
Umashanthi Pavalanathan
Publication venue: Association for Computational Linguistics
Publication date: 01/01/2017
Field of study

Crossref

Modeling heterogeneous data resources for social-ecological research

Author: Beth Plale
Miao Chen
Scott Jensen
Umashanthi Pavalanathan
Publication venue: ACM
Publication date: 22/07/2013
Field of study

Crossref

Mind Your POV

Author: Jacob Eisenstein
Umashanthi Pavalanathan
Xiaochuang Han
Publication venue: Association for Computing Machinery (ACM)
Publication date: 18/09/2018
Field of study

Wikipedia has a strong norm of writing in a 'neutral point of view' (NPOV). Articles that violate this norm are tagged, and editors are encouraged to make corrections. But the impact of this tagging system has not been quantitatively measured. Does NPOV tagging help articles to converge to the desired style? Do NPOV corrections encourage editors to adopt this style? We study these questions using a corpus of NPOV-tagged articles and a set of lexicons associated with biased language. An interrupted time series analysis shows that after an article is tagged for NPOV, there is a significant decrease in biased language in the article, as measured by several lexicons. However, for individual editors, NPOV corrections and talk page discussions yield no significant change in the usage of words in most of these lexicons, including Wikipedia's own list of 'words to watch.' This suggests that NPOV tagging and discussion does improve content, but has less success enculturating editors to the site's linguistic norms.Comment: ACM Conference on Computer-Supported Cooperative Work and Social Computing (CSCW), 201

arXiv.org e-Print Archive

Crossref