1,332 research outputs found

    SmokEng: Towards Fine-grained Classification of Tobacco-related Social Media Text

    Full text link
    Contemporary datasets on tobacco consumption focus on one of two topics, either public health mentions and disease surveillance, or sentiment analysis on topical tobacco products and services. However, two primary considerations are not accounted for, the language of the demographic affected and a combination of the topics mentioned above in a fine-grained classification mechanism. In this paper, we create a dataset of 3144 tweets, which are selected based on the presence of colloquial slang related to smoking and analyze it based on the semantics of the tweet. Each class is created and annotated based on the content of the tweets such that further hierarchical methods can be easily applied. Further, we prove the efficacy of standard text classification methods on this dataset, by designing experiments which do both binary as well as multi-class classification. Our experiments tackle the identification of either a specific topic (such as tobacco product promotion), a general mention (cigarettes and related products) or a more fine-grained classification. This methodology paves the way for further analysis, such as understanding sentiment or style, which makes this dataset a vital contribution to both disease surveillance and tobacco use research.Comment: Accepted at the Workshop on Noisy User-generated Text (W-NUT) at EMNLP-IJCNLP 201

    Identification and characterization of diseases on social web

    Get PDF
    [no abstract

    A customisable pipeline for continuously harvesting socially-minded Twitter users

    Full text link
    On social media platforms and Twitter in particular, specific classes of users such as influencers have been given satisfactory operational definitions in terms of network and content metrics. Others, for instance online activists, are not less important but their characterisation still requires experimenting. We make the hypothesis that such interesting users can be found within temporally and spatially localised contexts, i.e., small but topical fragments of the network containing interactions about social events or campaigns with a significant footprint on Twitter. To explore this hypothesis, we have designed a continuous user profile discovery pipeline that produces an ever-growing dataset of user profiles by harvesting and analysing contexts from the Twitter stream. The profiles dataset includes key network and content-based users metrics, enabling experimentation with user-defined score functions that characterise specific classes of online users. The paper describes the design and implementation of the pipeline and its empirical evaluation on a case study consisting of healthcare-related campaigns in the UK, showing how it supports the operational definitions of online activism, by comparing three experimental ranking functions. The code is publicly available.Comment: Procs. ICWE 2019, June 2019, Kore

    Text Mining Methods for Analyzing Online Health Information and Communication

    Get PDF
    The Internet provides an alternative way to share health information. Specifically, social network systems such as Twitter, Facebook, Reddit, and disease specific online support forums are increasingly being used to share information on health related topics. This could be in the form of personal health information disclosure to seek suggestions or answering other patients\u27 questions based on their history. This social media uptake gives a new angle to improve the current health communication landscape with consumer generated content from social platforms. With these online modes of communication, health providers can offer more immediate support to the people seeking advice. Non-profit organizations and federal agencies can also diffuse preventative information in such networks for better outcomes. Researchers in health communication can mine user generated content on social networks to understand themes and derive insights into patient experiences that may be impractical to glean through traditional surveys. The main difficulty in mining social health data is in separating the signal from the noise. Social data is characterized by informal nature of content, typos, emoticons, tonal variations (e.g. sarcasm), and ambiguities arising from polysemous words, all of which make it difficult in building automated systems for deriving insights from such sources. In this dissertation, we present four efforts to mine health related insights from user generated social data. In the first effort, we build a model to identify marketing tweets on electronic cigarettes (e-cigs) and assess different topics in marketing and non-marketing messages on e-cigs on Twitter. In our next effort, we build ensemble models to classify messages on a mental health forum for triaging posts whose authors need immediate attention from trained moderators to prevent self-harm. The third effort deals with models from our participation in a shared task on identifying tweets that discuss adverse drug reactions and those that mention medication intake. In the final task, we build a classifier that identifies whether a particular tweet about the popular Juul e-cig indicates the tweeter actually using the product. Our methods range from linear classifiers (e.g., logistic regression), classical nonlinear models (e.g., nearest neighbors), recent deep neural networks (e.g., convolutional neural networks), and ensembles of all these models in using different supervised training regimens (e.g., co-training). The focus is more on task specific system building than on building specific individual models. Overall, we demonstrate that it is possible to glean insights from social data on health related topics through natural language processing and machine learning with use-cases from substance use and mental health

    Learning Representations of Social Media Users

    Get PDF
    User representations are routinely used in recommendation systems by platform developers, targeted advertisements by marketers, and by public policy researchers to gauge public opinion across demographic groups. Computer scientists consider the problem of inferring user representations more abstractly; how does one extract a stable user representation - effective for many downstream tasks - from a medium as noisy and complicated as social media? The quality of a user representation is ultimately task-dependent (e.g. does it improve classifier performance, make more accurate recommendations in a recommendation system) but there are proxies that are less sensitive to the specific task. Is the representation predictive of latent properties such as a person's demographic features, socioeconomic class, or mental health state? Is it predictive of the user's future behavior? In this thesis, we begin by showing how user representations can be learned from multiple types of user behavior on social media. We apply several extensions of generalized canonical correlation analysis to learn these representations and evaluate them at three tasks: predicting future hashtag mentions, friending behavior, and demographic features. We then show how user features can be employed as distant supervision to improve topic model fit. Finally, we show how user features can be integrated into and improve existing classifiers in the multitask learning framework. We treat user representations - ground truth gender and mental health features - as auxiliary tasks to improve mental health state prediction. We also use distributed user representations learned in the first chapter to improve tweet-level stance classifiers, showing that distant user information can inform classification tasks at the granularity of a single message.Comment: PhD thesi

    Learning Representations of Social Media Users

    Get PDF
    User representations are routinely used in recommendation systems by platform developers, targeted advertisements by marketers, and by public policy researchers to gauge public opinion across demographic groups. Computer scientists consider the problem of inferring user representations more abstractly; how does one extract a stable user representation - effective for many downstream tasks - from a medium as noisy and complicated as social media? The quality of a user representation is ultimately task-dependent (e.g. does it improve classifier performance, make more accurate recommendations in a recommendation system) but there are proxies that are less sensitive to the specific task. Is the representation predictive of latent properties such as a person's demographic features, socioeconomic class, or mental health state? Is it predictive of the user's future behavior? In this thesis, we begin by showing how user representations can be learned from multiple types of user behavior on social media. We apply several extensions of generalized canonical correlation analysis to learn these representations and evaluate them at three tasks: predicting future hashtag mentions, friending behavior, and demographic features. We then show how user features can be employed as distant supervision to improve topic model fit. Finally, we show how user features can be integrated into and improve existing classifiers in the multitask learning framework. We treat user representations - ground truth gender and mental health features - as auxiliary tasks to improve mental health state prediction. We also use distributed user representations learned in the first chapter to improve tweet-level stance classifiers, showing that distant user information can inform classification tasks at the granularity of a single message.Comment: PhD thesi

    Vaporous marketing: Uncovering pervasive electronic cigarette advertisements on twitter

    Get PDF
    Background Twitter has become the wild-west of marketing and promotional strategies for advertisement agencies. Electronic cigarettes have been heavily marketed across Twitter feeds, offering discounts, kid-friendly flavors, algorithmically generated false testimonials, and free samples. Methods All electronic cigarette keyword related tweets from a 10% sample of Twitter spanning January 2012 through December 2014 (approximately 850,000 total tweets) were identified and categorized as Automated or Organic by combining a keyword classification and a machine trained Human Detection algorithm. A sentiment analysis using Hedonometrics was performed on Organic tweets to quantify the change in consumer sentiments over time. Commercialized tweets were topically categorized with key phrasal pattern matching. Results The overwhelming majority (80%) of tweets were classified as automated or promotional in nature. The majority of these tweets were coded as commercialized (83.65% in 2013), up to 33% of which offered discounts or free samples and appeared on over a billion twitter feeds as impressions. The positivity of Organic (human) classified tweets has decreased over time (5.84 in 2013 to 5.77 in 2014) due to a relative increase in the negative words \u27ban\u27, \u27tobacco\u27, \u27doesn\u27t\u27, \u27drug\u27, \u27against\u27, \u27poison\u27, \u27tax\u27 and a relative decrease in the positive words like \u27haha\u27, \u27good\u27, \u27cool\u27. Automated tweets are more positive than organic (6.17 versus 5.84) due to a relative increase in the marketing words like \u27best\u27, \u27win\u27, \u27buy\u27, \u27sale\u27, \u27health\u27, \u27discount\u27 and a relative decrease in negative words like \u27bad\u27, \u27hate\u27, \u27stupid\u27, \u27don\u27t\u27. Conclusions Due to the youth presence on Twitter and the clinical uncertainty of the long term health complications of electronic cigarette consumption, the protection of public health warrants scrutiny and potential regulation of social media marketing
    • …
    corecore