17 research outputs found
Semi-supervised Text Regression with Conditional Generative Adversarial Networks
Enormous online textual information provides intriguing opportunities for
understandings of social and economic semantics. In this paper, we propose a
novel text regression model based on a conditional generative adversarial
network (GAN), with an attempt to associate textual data and social outcomes in
a semi-supervised manner. Besides promising potential of predicting
capabilities, our superiorities are twofold: (i) the model works with
unbalanced datasets of limited labelled data, which align with real-world
scenarios; and (ii) predictions are obtained by an end-to-end framework,
without explicitly selecting high-level representations. Finally we point out
related datasets for experiments and future research directions
Classifying Tweet Level Judgements of Rumours in Social Media
Social media is a rich source of rumours and corresponding community
reactions. Rumours reflect different characteristics, some shared and some
individual. We formulate the problem of classifying tweet level judgements of
rumours as a supervised learning task. Both supervised and unsupervised domain
adaptation are considered, in which tweets from a rumour are classified on the
basis of other annotated rumours. We demonstrate how multi-task learning helps
achieve good results on rumours from the 2011 England riots
What Twitter Profile and Posted Images Reveal About Depression and Anxiety
Previous work has found strong links between the choice of social media
images and users' emotions, demographics and personality traits. In this study,
we examine which attributes of profile and posted images are associated with
depression and anxiety of Twitter users. We used a sample of 28,749 Facebook
users to build a language prediction model of survey-reported depression and
anxiety, and validated it on Twitter on a sample of 887 users who had taken
anxiety and depression surveys. We then applied it to a different set of 4,132
Twitter users to impute language-based depression and anxiety labels, and
extracted interpretable features of posted and profile pictures to uncover the
associations with users' depression and anxiety, controlling for demographics.
For depression, we find that profile pictures suppress positive emotions rather
than display more negative emotions, likely because of social media
self-presentation biases. They also tend to show the single face of the user
(rather than show her in groups of friends), marking increased focus on the
self, emblematic for depression. Posted images are dominated by grayscale and
low aesthetic cohesion across a variety of image features. Profile images of
anxious users are similarly marked by grayscale and low aesthetic cohesion, but
less so than those of depressed users. Finally, we show that image features can
be used to predict depression and anxiety, and that multitask learning that
includes a joint modeling of demographics improves prediction performance.
Overall, we find that the image attributes that mark depression and anxiety
offer a rich lens into these conditions largely congruent with the
psychological literature, and that images on Twitter allow inferences about the
mental health status of users.Comment: ICWSM 201
Longitudinal Modeling of Social Media with Hawkes Process based on Users and Networks
Online social networks provide a platform for
sharing information at an unprecedented scale. Users generate
information which propagates across the network resulting in
information cascades. In this paper, we study the evolution of
information cascades in Twitter using a point process model
of user activity. We develop several Hawkes process models
considering various properties including conversational structure,
users’ connections and general features of users including the
textual information, and show how they are helpful in modeling
the social network activity. We consider low-rank embeddings
of users and user features, and learn the features helpful in
identifying the influence and susceptibility of users. Evaluation
on Twitter data sets associated with civil unrest shows that
incorporating richer properties improves the performance in
predicting future activity of users and memes
Semi-supervised Text Regression with Conditional Generative Adversarial Networks
Enormous online textual information provides intriguing opportunities for understandings of social and economic semantics. In this paper, we propose a novel text regression model based on a conditional generative adversarial network (GAN), with an attempt to associate textual data and social outcomes in a semi-supervised manner. Besides promising potential of predicting capabilities, our superiorities are twofold: (i) the model works with unbalanced datasets of limited labelled data, which align with real-world scenarios; and (ii) predictions are obtained by an end-to-end framework, without explicitly selecting high-level representations. Finally we point out related datasets for experiments and future research directions
Advances in nowcasting influenza-like illness rates using search query logs
User-generated content can assist epidemiological surveillance in the early detection and prevalence estimation of infectious diseases, such as influenza. Google Flu Trends embodies the first public platform for transforming search queries to indications about the current state of flu in various places all over the world. However, the original model significantly mispredicted influenza-like illness rates in the US during the 2012–13 flu season. In this work, we build on the previous modeling attempt, proposing substantial improvements. Firstly, we investigate the performance of a widely used linear regularized regression solver, known as the Elastic Net. Then, we expand on this model by incorporating the queries selected by the Elastic Net into a nonlinear regression framework, based on a composite Gaussian Process. Finally, we augment the query-only predictions with an autoregressive model, injecting prior knowledge about the disease. We assess predictive performance using five consecutive flu seasons spanning from 2008 to 2013 and qualitatively explain certain shortcomings of the previous approach. Our results indicate that a nonlinear query modeling approach delivers the lowest cumulative nowcasting error, and also suggest that query information significantly improves autoregressive inferences, obtaining state-of-the-art performance
Enhancing Feature Selection Using Word Embeddings: The Case of Flu Surveillance
Health surveillance systems based on online user-generated content often rely on the identification of textual markers that are related to a target disease. Given the high volume of available data, these systems benefit from an automatic feature selection process. This is accomplished either by applying statistical learning techniques, which do not consider the semantic relationship between the selected features and the inference task, or by developing labour-intensive text classifiers. In this paper, we use neural word embeddings, trained on social media content from Twitter, to determine, in an unsupervised manner, how strongly textual features are semantically linked to an underlying health concept. We then refine conventional feature selection methods by a priori operating on textual variables that are sufficiently close to a target concept. Our experiments focus on the supervised learning problem of estimating influenza-like illness rates from Google search queries. A "flu infection" concept is formulated and used to reduce spurious and potentially confounding features that were selected by previously applied approaches. In this way, we also address forms of scepticism regarding the appropriateness of the feature space, alleviating potential cases of overfitting. Ultimately, the proposed hybrid feature selection method creates a more reliable model that, according to our empirical analysis, improves the inference performance (Mean Absolute Error) of linear and nonlinear regressors by 12% and 28.7%, respectively
Multi-Task Learning Improves Disease Models from Web Search
We investigate the utility of multi-task learning to disease surveillance using Web search data. Our motivation is two-fold. Firstly, we assess whether concurrently training models for various geographies - inside a country or across different countries - can improve accuracy. We also test the ability of such models to assist health systems that are producing sporadic disease surveillance reports that reduce the quantity of available training data. We explore both linear and nonlinear models, specifically a multi-task expansion of elastic net and a multi-task Gaussian Process, and compare them to their respective single task formulations. We use influenza-like illness as a case study and conduct experiments on the United States (US) as well as England, where both health and Google search data were obtained. Our empirical results indicate that multi-task learning improves regional as well as national models for the US. The percentage of improvement on mean absolute error increases up to 14.8% as the historical training data is reduced from 5 to 1 year(s), illustrating that accurate models can be obtained, even by training on relatively short time intervals. Furthermore, in simulated scenarios, where only a few health reports (training data) are available, we show that multi-task learning helps to maintain a stable performance across all the affected locations. Finally, we present results from a cross-country experiment, where data from the US improves the estimates for England. As the historical training data for England is reduced, the benefits of multi-task learning increase, reducing mean absolute error by up to 40%