38 research outputs found
Confounds and Consequences in Geotagged Twitter Data
Twitter is often used in quantitative studies that identify
geographically-preferred topics, writing styles, and entities. These studies
rely on either GPS coordinates attached to individual messages, or on the
user-supplied location field in each profile. In this paper, we compare these
data acquisition techniques and quantify the biases that they introduce; we
also measure their effects on linguistic analysis and text-based geolocation.
GPS-tagging and self-reported locations yield measurably different corpora, and
these linguistic differences are partially attributable to differences in
dataset composition by age and gender. Using a latent variable model to induce
age and gender, we show how these demographic variables interact with geography
to affect language use. We also show that the accuracy of text-based
geolocation varies with population demographics, giving the best results for
men above the age of 40.Comment: final version for EMNLP 201
Generalisation in named entity recognition: A quantitative analysis
Named Entity Recognition (NER) is a key NLP task, which is all the more challenging on Web and user-generated content with their diverse and continuously changing language. This paper aims to quantify how this diversity impacts state-of-the-art NER methods, by measuring named entity (NE) and context variability, feature sparsity, and their effects on precision and recall. In particular, our findings indicate that NER approaches struggle to generalise in diverse genres with limited training data. Unseen NEs, in particular, play an important role, which have a higher incidence in diverse genres such as social media than in more regular genres such as newswire. Coupled with a higher incidence of unseen features more generally and the lack of large training corpora, this leads to significantly lower F1 scores for diverse genres as compared to more regular ones. We also find that leading systems rely heavily on surface forms found in training data, having problems generalising beyond these, and offer explanations for this observation
Audience-Modulated Variation in Online Social Media
Stylistic variation in online social media writing is well attested: for example, geographical analysis of the social media service Twitter has replicated isoglosses for many known lexical variables from speech, while simultaneously revealing a wealth of new geographical lexical variables, including emoticons, phonetic spellings, and phrasal abbreviations. However, less is known about the social role of variation in online writing. This article examines online writing variation in the context of audience design, focusing on affordances offered by Twitter that allow users to modulate a message's intended audience. We find that the frequency of nonstandard lexical variables is inversely related to the size of the intended audience: as writers target smaller audiences, the frequency of lexical variables increases. In addition, these variables are more often used in messages that are addressed to individuals who are known to be geographically local. This phenomenon holds not only for geographically differentiated lexical variables, but also for nonstandard variables that are widely used throughout the United States. These findings suggest that users of social media are attuned to both the nature of their audience and the social meaning of lexical variation and that they customize their self-presentation accordingly.</jats:p
More emojis, less :) The competition for paralinguistic function in microblog writing
Many non-standard elements of ‘netspeak’ writing can be viewed as efforts to replicate the linguistic role played by nonverbal modalities in speech, conveying contextual information such as affect and interpersonal stance. Recently, a new non-standard communicative tool has emerged in online writing: emojis. These unicode characters contain a standardized set of pictographs, some of which are visually similar to well-known emoticons. Do emojis play the same linguistic role as emoticons and other ASCII-based writing innovations? If so, might the introduction of emojis eventually displace the earlier, user-created forms of contextual expression? Using a matching approach to causal statistical inference, we show that as social media users adopt emojis, they dramatically reduce their use of emoticons, suggesting that these linguistic resources compete for the same communicative function. Furthermore, we demonstrate that the adoption of emojis leads to a corresponding increase in the use of standard spellings, suggesting that all forms of non-standard writing are losing out in a competition with emojis. Finally, we identify specific textual features that make some emoticons especially likely to be replaced by emojis.</jats:p
More emojis, less :) The competition for paralinguistic function in microblog writing
Many non-standard elements of ‘netspeak’ writing can be viewed as efforts to replicate the linguistic role played by nonverbal modalities in speech, conveying contextual information such as affect and interpersonal stance. Recently, a new non-standard communicative tool has emerged in online writing: emojis. These unicode characters contain a standardized set of pictographs, some of which are visually similar to well-known emoticons. Do emojis play the same linguistic role as emoticons and other ASCII-based writing innovations? If so, might the introduction of emojis eventually displace the earlier, user-created forms of contextual expression? Using a matching approach to causal statistical inference, we show that as social media users adopt emojis, they dramatically reduce their use of emoticons, suggesting that these linguistic resources compete for the same communicative function. Furthermore, we demonstrate that the adoption of emojis leads to a corresponding increase in the use of standard spellings, suggesting that all forms of non-standard writing are losing out in a competition with emojis. Finally, we identify specific textual features that make some emoticons especially likely to be replaced by emojis
Mind Your POV
Wikipedia has a strong norm of writing in a 'neutral point of view' (NPOV).
Articles that violate this norm are tagged, and editors are encouraged to make
corrections. But the impact of this tagging system has not been quantitatively
measured. Does NPOV tagging help articles to converge to the desired style? Do
NPOV corrections encourage editors to adopt this style? We study these
questions using a corpus of NPOV-tagged articles and a set of lexicons
associated with biased language. An interrupted time series analysis shows that
after an article is tagged for NPOV, there is a significant decrease in biased
language in the article, as measured by several lexicons. However, for
individual editors, NPOV corrections and talk page discussions yield no
significant change in the usage of words in most of these lexicons, including
Wikipedia's own list of 'words to watch.' This suggests that NPOV tagging and
discussion does improve content, but has less success enculturating editors to
the site's linguistic norms.Comment: ACM Conference on Computer-Supported Cooperative Work and Social
Computing (CSCW), 201
