7 research outputs found

    Confounds and Consequences in Geotagged Twitter Data

    Full text link
    Twitter is often used in quantitative studies that identify geographically-preferred topics, writing styles, and entities. These studies rely on either GPS coordinates attached to individual messages, or on the user-supplied location field in each profile. In this paper, we compare these data acquisition techniques and quantify the biases that they introduce; we also measure their effects on linguistic analysis and text-based geolocation. GPS-tagging and self-reported locations yield measurably different corpora, and these linguistic differences are partially attributable to differences in dataset composition by age and gender. Using a latent variable model to induce age and gender, we show how these demographic variables interact with geography to affect language use. We also show that the accuracy of text-based geolocation varies with population demographics, giving the best results for men above the age of 40.Comment: final version for EMNLP 201

    Computational approaches to understanding stylistic variation in online writing

    Get PDF
    Language use in online interactions varies from community to community, from individual to individual, and even for individuals in different contexts. While prior work has identified these differences, far less is understood about why these differences have arisen in online writing. My dissertation focuses on this why question. The reasons for linguistic diversity in online writing could be multifold. As more and more interpersonal social interactions are conducted through technology-mediated channels, there is an increasing need to express multiple social meanings in varied social situations through linguistic means. In the absence of non-verbal cues, the technology-mediated channels provide several affordances to conduct interpersonal interactions. How do factors that are unique to online writing, such as the need to convey varied social meanings and the affordances in technology-mediated channels, shape online writing? My dissertation investigates this interplay through a series of large-scale computational studies of linguistic style variation in online writing. Using unsupervised methods and causal statistical analysis, I have investigated the social meaning of varied non-standard language usage in social media and the effects of new technological affordances in online social platforms on individuals' writing style. To quantitatively study community-level stylistic variation at scale, I have developed a multi-dimensional style lexicon using unsupervised techniques and used it to study style-shifting in online multi-communities. Further, I have investigated how writing style norm enforcement in online platforms affects stylistic variation in online writing. My dissertation will advance our understanding of how individuals utilize the affordances in online social platforms and shift style to achieve varied social goals in online interpersonal interactions. Understanding the social dimensions of linguistic style variation in online writing has important consequences for the design of language technology and social computing systems, and beyond.Ph.D

    More emojis, less :) The competition for paralinguistic function in microblog writing

    No full text
    Many non-standard elements of ‘netspeak’ writing can be viewed as efforts to replicate the linguistic role played by nonverbal modalities in speech, conveying contextual information such as affect and interpersonal stance. Recently, a new non-standard communicative tool has emerged in online writing: emojis. These unicode characters contain a standardized set of pictographs, some of which are visually similar to well-known emoticons. Do emojis play the same linguistic role as emoticons and other ASCII-based writing innovations? If so, might the introduction of emojis eventually displace the earlier, user-created forms of contextual expression? Using a matching approach to causal statistical inference, we show that as social media users adopt emojis, they dramatically reduce their use of emoticons, suggesting that these linguistic resources compete for the same communicative function. Furthermore, we demonstrate that the adoption of emojis leads to a corresponding increase in the use of standard spellings, suggesting that all forms of non-standard writing are losing out in a competition with emojis. Finally, we identify specific textual features that make some emoticons especially likely to be replaced by emojis

    Towards Influenza Surveillance in Military Populations Using Novel and Traditional Sources

    No full text
    U.S. military influenza surveillance utilizes electronic reporting of clinical diagnoses to monitor health of military personnel and detect naturally occurring and bioterrorism-related epidemics. While accurate, these systems lack in timeliness. More recently, researchers have used novel data sources to detect influenza in real-time and capture non-traditional populations. With data-mining techniques, military social media users are identified and influenza-related discourse is integrated along with medical data into a comprehensive disease model. By leveraging heterogeneous data streams and developing dashboard biosurveillance analytics, the researchers hope to increase the speed at which outbreaks are detected and provide accurate disease forecasting among military personnel

    Towards Influenza Surveillance in Military Populations Using Novel and Traditional Sources

    Get PDF
    U.S. military influenza surveillance utilizes electronic reporting of clinical diagnoses to monitor health of military personnel and detect naturally occurring and bioterrorism-related epidemics. While accurate, these systems lack in timeliness. More recently, researchers have used novel data sources to detect influenza in real-time and capture non-traditional populations. With data-mining techniques, military social media users are identified and influenza-related discourse is integrated along with medical data into a comprehensive disease model. By leveraging heterogeneous data streams and developing dashboard biosurveillance analytics, the researchers hope to increase the speed at which outbreaks are detected and provide accurate disease forecasting among military personnel

    Social Data: Biases, Methodological Pitfalls, and Ethical Boundaries

    No full text
    corecore