237 research outputs found

    Learning Representations of Social Media Users

    Get PDF
    User representations are routinely used in recommendation systems by platform developers, targeted advertisements by marketers, and by public policy researchers to gauge public opinion across demographic groups. Computer scientists consider the problem of inferring user representations more abstractly; how does one extract a stable user representation - effective for many downstream tasks - from a medium as noisy and complicated as social media? The quality of a user representation is ultimately task-dependent (e.g. does it improve classifier performance, make more accurate recommendations in a recommendation system) but there are proxies that are less sensitive to the specific task. Is the representation predictive of latent properties such as a person's demographic features, socioeconomic class, or mental health state? Is it predictive of the user's future behavior? In this thesis, we begin by showing how user representations can be learned from multiple types of user behavior on social media. We apply several extensions of generalized canonical correlation analysis to learn these representations and evaluate them at three tasks: predicting future hashtag mentions, friending behavior, and demographic features. We then show how user features can be employed as distant supervision to improve topic model fit. Finally, we show how user features can be integrated into and improve existing classifiers in the multitask learning framework. We treat user representations - ground truth gender and mental health features - as auxiliary tasks to improve mental health state prediction. We also use distributed user representations learned in the first chapter to improve tweet-level stance classifiers, showing that distant user information can inform classification tasks at the granularity of a single message.Comment: PhD thesi

    Learning Representations of Social Media Users

    Get PDF
    User representations are routinely used in recommendation systems by platform developers, targeted advertisements by marketers, and by public policy researchers to gauge public opinion across demographic groups. Computer scientists consider the problem of inferring user representations more abstractly; how does one extract a stable user representation - effective for many downstream tasks - from a medium as noisy and complicated as social media? The quality of a user representation is ultimately task-dependent (e.g. does it improve classifier performance, make more accurate recommendations in a recommendation system) but there are proxies that are less sensitive to the specific task. Is the representation predictive of latent properties such as a person's demographic features, socioeconomic class, or mental health state? Is it predictive of the user's future behavior? In this thesis, we begin by showing how user representations can be learned from multiple types of user behavior on social media. We apply several extensions of generalized canonical correlation analysis to learn these representations and evaluate them at three tasks: predicting future hashtag mentions, friending behavior, and demographic features. We then show how user features can be employed as distant supervision to improve topic model fit. Finally, we show how user features can be integrated into and improve existing classifiers in the multitask learning framework. We treat user representations - ground truth gender and mental health features - as auxiliary tasks to improve mental health state prediction. We also use distributed user representations learned in the first chapter to improve tweet-level stance classifiers, showing that distant user information can inform classification tasks at the granularity of a single message.Comment: PhD thesi

    Is the web being used to speak our language?

    Get PDF
    This paper presents results from extensive surveys of the usage of Maori language on the World Wide Web(www, Web) conducted in 1998 and 2002. Issues both supportive and detrimental relating to the use and publication of indigenous languages in the WWW will be highlighted. Specifically: how is the WWW being used to articulate the Maori language

    Social TV: Linking TV Content to Buzz and Sales

    Get PDF
    “Social TV” is a term that broadly describes the online social interactions occurring between viewers while watching television. In this paper, we show that TV networks can derive value from social media content placed in shows because it leads to increased word of mouth via online posts, and it highly correlates with TV show related sales. In short, we show that TV event triggers change the online behavior of viewers. In this paper, we first show that using social media content on the televised American reality singing competition, The Voice, led to increased social media engagement during the TV broadcast. We then illustrate that social media buzz about a contestant after a performance is highly correlated with song sales from that contestant’s performance. We believe this to be the first study linking TV content to buzz and sales in real time

    WHEN DOES SOCIAL NETWORK-BASED PREDICTION WORK? A LARGE SCALE ANALYSIS OF BRAND AND TV AUDIENCE ENGAGEMENT BY TWITTER USERS

    Get PDF
    Social network-based prediction, more specifically targeting friends and contacts of existing customers, has proven successful in various domains like retail banking, telecommunications, and online advertising. However, little is known about for what types of product categories and brands social network-based marketing is especially effective at predicting brand engagement, both in absolute terms and compared to demographic targeting or collaborative filtering. In this work, we compare the performance of a social network-based recommendation engine against a product network-based recommendation engine of the kind used in collaborative filtering. We do so over 700 brands and 223,000 consumers a novel data set collected from Twitter. We compare the performance of the two approaches by product and user features. Preliminary results indicate that the variance in performance within and across methods is related to differences in brand and user popularity as well as brand audience

    An Improved Method for 21cm Foreground Removal

    Get PDF
    21 cm tomography is expected to be difficult in part because of serious foreground contamination. Previous studies have found that line-of-sight approaches are capable of cleaning foregrounds to an acceptable level on large spatial scales, but not on small spatial scales. In this paper, we introduce a Fourier-space formalism for describing the line-of-sight methods, and use it to introduce an improved new method for 21 cm foreground cleaning. Heuristically, this method involves fitting foregrounds in Fourier space using weighted polynomial fits, with each pixel weighted according to its information content. We show that the new method reproduces the old one on large angular scales, and gives marked improvements on small scales at essentially no extra computational cost.Comment: 6 pages, 5 figures, replaced to match accepted MNRAS versio

    A system for de-identifying medical message board text

    Get PDF
    There are millions of public posts to medical message boards by users seeking support and information on a wide range of medical conditions. It has been shown that these posts can be used to gain a greater understanding of patients’ experiences and concerns. As investigators continue to explore large corpora of medical discussion board data for research purposes, protecting the privacy of the members of these online communities becomes an important challenge that needs to be met. Extant entity recognition methods used for more structured text are not sufficient because message posts present additional challenges: the posts contain many typographical errors, larger variety of possible names, terms and abbreviations specific to Internet posts or a particular message board, and mentions of the authors’ personal lives. The main contribution of this paper is a system to de-identify the authors of message board posts automatically, taking into account the aforementioned challenges. We demonstrate our system on two different message board corpora, one on breast cancer and another on arthritis. We show that our approach significantly outperforms other publicly available named entity recognition and de-identification systems, which have been tuned for more structured text like operative reports, pathology reports, discharge summaries, or newswire

    Identifying Potential Adverse Effects Using the Web: A New Approach to Medical Hypothesis Generation

    Get PDF
    Medical message boards are online resources where users with a particular condition exchange information, some of which they might not otherwise share with medical providers. Many of these boards contain a large number of posts and contain patient opinions and experiences that would be potentially useful to clinicians and researchers. We present an approach that is able to collect a corpus of medical message board posts, de-identify the corpus, and extract information on potential adverse drug effects discussed by users. Using a corpus of posts to breast cancer message boards, we identified drug event pairs using co-occurrence statistics. We then compared the identified drug event pairs with adverse effects listed on the package labels of tamoxifen, anastrozole, exemestane, and letrozole. Of the pairs identified by our system, 75–80% were documented on the drug labels. Some of the undocumented pairs may represent previously unidentified adverse drug effects
    corecore