1,332 research outputs found
SmokEng: Towards Fine-grained Classification of Tobacco-related Social Media Text
Contemporary datasets on tobacco consumption focus on one of two topics,
either public health mentions and disease surveillance, or sentiment analysis
on topical tobacco products and services. However, two primary considerations
are not accounted for, the language of the demographic affected and a
combination of the topics mentioned above in a fine-grained classification
mechanism. In this paper, we create a dataset of 3144 tweets, which are
selected based on the presence of colloquial slang related to smoking and
analyze it based on the semantics of the tweet. Each class is created and
annotated based on the content of the tweets such that further hierarchical
methods can be easily applied.
Further, we prove the efficacy of standard text classification methods on
this dataset, by designing experiments which do both binary as well as
multi-class classification. Our experiments tackle the identification of either
a specific topic (such as tobacco product promotion), a general mention
(cigarettes and related products) or a more fine-grained classification. This
methodology paves the way for further analysis, such as understanding sentiment
or style, which makes this dataset a vital contribution to both disease
surveillance and tobacco use research.Comment: Accepted at the Workshop on Noisy User-generated Text (W-NUT) at
EMNLP-IJCNLP 201
A customisable pipeline for continuously harvesting socially-minded Twitter users
On social media platforms and Twitter in particular, specific classes of
users such as influencers have been given satisfactory operational definitions
in terms of network and content metrics.
Others, for instance online activists, are not less important but their
characterisation still requires experimenting.
We make the hypothesis that such interesting users can be found within
temporally and spatially localised contexts, i.e., small but topical fragments
of the network containing interactions about social events or campaigns with a
significant footprint on Twitter.
To explore this hypothesis, we have designed a continuous user profile
discovery pipeline that produces an ever-growing dataset of user profiles by
harvesting and analysing contexts from the Twitter stream.
The profiles dataset includes key network and content-based users metrics,
enabling experimentation with user-defined score functions that characterise
specific classes of online users.
The paper describes the design and implementation of the pipeline and its
empirical evaluation on a case study consisting of healthcare-related campaigns
in the UK, showing how it supports the operational definitions of online
activism, by comparing three experimental ranking functions. The code is
publicly available.Comment: Procs. ICWE 2019, June 2019, Kore
Text Mining Methods for Analyzing Online Health Information and Communication
The Internet provides an alternative way to share health information. Specifically, social network systems such as Twitter, Facebook, Reddit, and disease specific online support forums are increasingly being used to share information on health related topics. This could be in the form of personal health information disclosure to seek suggestions or answering other patients\u27 questions based on their history. This social media uptake gives a new angle to improve the current health communication landscape with consumer generated content from social platforms. With these online modes of communication, health providers can offer more immediate support to the people seeking advice. Non-profit organizations and federal agencies can also diffuse preventative information in such networks for better outcomes. Researchers in health communication can mine user generated content on social networks to understand themes and derive insights into patient experiences that may be impractical to glean through traditional surveys. The main difficulty in mining social health data is in separating the signal from the noise. Social data is characterized by informal nature of content, typos, emoticons, tonal variations (e.g. sarcasm), and ambiguities arising from polysemous words, all of which make it difficult in building automated systems for deriving insights from such sources.
In this dissertation, we present four efforts to mine health related insights from user generated social data. In the first effort, we build a model to identify marketing tweets on electronic cigarettes (e-cigs) and assess different topics in marketing and non-marketing messages on e-cigs on Twitter. In our next effort, we build ensemble models to classify messages on a mental health forum for triaging posts whose authors need immediate attention from trained moderators to prevent self-harm. The third effort deals with models from our participation in a shared task on identifying tweets that discuss adverse drug reactions and those that mention medication intake. In the final task, we build a classifier that identifies whether a particular tweet about the popular Juul e-cig indicates the tweeter actually using the product. Our methods range from linear classifiers (e.g., logistic regression), classical nonlinear models (e.g., nearest neighbors), recent deep neural networks (e.g., convolutional neural networks), and ensembles of all these models in using different supervised training regimens (e.g., co-training). The focus is more on task specific system building than on building specific individual models. Overall, we demonstrate that it is possible to glean insights from social data on health related topics through natural language processing and machine learning with use-cases from substance use and mental health
Learning Representations of Social Media Users
User representations are routinely used in recommendation systems by platform
developers, targeted advertisements by marketers, and by public policy
researchers to gauge public opinion across demographic groups. Computer
scientists consider the problem of inferring user representations more
abstractly; how does one extract a stable user representation - effective for
many downstream tasks - from a medium as noisy and complicated as social media?
The quality of a user representation is ultimately task-dependent (e.g. does
it improve classifier performance, make more accurate recommendations in a
recommendation system) but there are proxies that are less sensitive to the
specific task. Is the representation predictive of latent properties such as a
person's demographic features, socioeconomic class, or mental health state? Is
it predictive of the user's future behavior?
In this thesis, we begin by showing how user representations can be learned
from multiple types of user behavior on social media. We apply several
extensions of generalized canonical correlation analysis to learn these
representations and evaluate them at three tasks: predicting future hashtag
mentions, friending behavior, and demographic features. We then show how user
features can be employed as distant supervision to improve topic model fit.
Finally, we show how user features can be integrated into and improve existing
classifiers in the multitask learning framework. We treat user representations
- ground truth gender and mental health features - as auxiliary tasks to
improve mental health state prediction. We also use distributed user
representations learned in the first chapter to improve tweet-level stance
classifiers, showing that distant user information can inform classification
tasks at the granularity of a single message.Comment: PhD thesi
Learning Representations of Social Media Users
User representations are routinely used in recommendation systems by platform
developers, targeted advertisements by marketers, and by public policy
researchers to gauge public opinion across demographic groups. Computer
scientists consider the problem of inferring user representations more
abstractly; how does one extract a stable user representation - effective for
many downstream tasks - from a medium as noisy and complicated as social media?
The quality of a user representation is ultimately task-dependent (e.g. does
it improve classifier performance, make more accurate recommendations in a
recommendation system) but there are proxies that are less sensitive to the
specific task. Is the representation predictive of latent properties such as a
person's demographic features, socioeconomic class, or mental health state? Is
it predictive of the user's future behavior?
In this thesis, we begin by showing how user representations can be learned
from multiple types of user behavior on social media. We apply several
extensions of generalized canonical correlation analysis to learn these
representations and evaluate them at three tasks: predicting future hashtag
mentions, friending behavior, and demographic features. We then show how user
features can be employed as distant supervision to improve topic model fit.
Finally, we show how user features can be integrated into and improve existing
classifiers in the multitask learning framework. We treat user representations
- ground truth gender and mental health features - as auxiliary tasks to
improve mental health state prediction. We also use distributed user
representations learned in the first chapter to improve tweet-level stance
classifiers, showing that distant user information can inform classification
tasks at the granularity of a single message.Comment: PhD thesi
Vaporous marketing: Uncovering pervasive electronic cigarette advertisements on twitter
Background Twitter has become the wild-west of marketing and promotional strategies for advertisement agencies. Electronic cigarettes have been heavily marketed across Twitter feeds, offering discounts, kid-friendly flavors, algorithmically generated false testimonials, and free samples. Methods All electronic cigarette keyword related tweets from a 10% sample of Twitter spanning January 2012 through December 2014 (approximately 850,000 total tweets) were identified and categorized as Automated or Organic by combining a keyword classification and a machine trained Human Detection algorithm. A sentiment analysis using Hedonometrics was performed on Organic tweets to quantify the change in consumer sentiments over time. Commercialized tweets were topically categorized with key phrasal pattern matching. Results The overwhelming majority (80%) of tweets were classified as automated or promotional in nature. The majority of these tweets were coded as commercialized (83.65% in 2013), up to 33% of which offered discounts or free samples and appeared on over a billion twitter feeds as impressions. The positivity of Organic (human) classified tweets has decreased over time (5.84 in 2013 to 5.77 in 2014) due to a relative increase in the negative words \u27ban\u27, \u27tobacco\u27, \u27doesn\u27t\u27, \u27drug\u27, \u27against\u27, \u27poison\u27, \u27tax\u27 and a relative decrease in the positive words like \u27haha\u27, \u27good\u27, \u27cool\u27. Automated tweets are more positive than organic (6.17 versus 5.84) due to a relative increase in the marketing words like \u27best\u27, \u27win\u27, \u27buy\u27, \u27sale\u27, \u27health\u27, \u27discount\u27 and a relative decrease in negative words like \u27bad\u27, \u27hate\u27, \u27stupid\u27, \u27don\u27t\u27. Conclusions Due to the youth presence on Twitter and the clinical uncertainty of the long term health complications of electronic cigarette consumption, the protection of public health warrants scrutiny and potential regulation of social media marketing
- …