983 research outputs found
Tracking Dengue Epidemics using Twitter Content Classification and Topic Modelling
Detecting and preventing outbreaks of mosquito-borne diseases such as Dengue
and Zika in Brasil and other tropical regions has long been a priority for
governments in affected areas. Streaming social media content, such as Twitter,
is increasingly being used for health vigilance applications such as flu
detection. However, previous work has not addressed the complexity of drastic
seasonal changes on Twitter content across multiple epidemic outbreaks. In
order to address this gap, this paper contrasts two complementary approaches to
detecting Twitter content that is relevant for Dengue outbreak detection,
namely supervised classification and unsupervised clustering using topic
modelling. Each approach has benefits and shortcomings. Our classifier achieves
a prediction accuracy of about 80\% based on a small training set of about
1,000 instances, but the need for manual annotation makes it hard to track
seasonal changes in the nature of the epidemics, such as the emergence of new
types of virus in certain geographical locations. In contrast, LDA-based topic
modelling scales well, generating cohesive and well-separated clusters from
larger samples. While clusters can be easily re-generated following changes in
epidemics, however, this approach makes it hard to clearly segregate relevant
tweets into well-defined clusters.Comment: Procs. SoWeMine - co-located with ICWE 2016. 2016, Lugano,
Switzerlan
Unsupervised keyword extraction from microblog posts via hashtags
© River Publishers. Nowadays, huge amounts of texts are being generated for social networking purposes on Web. Keyword extraction from such texts like microblog posts benefits many applications such as advertising, search, and content filtering. Unlike traditional web pages, a microblog post usually has some special social feature like a hashtag that is topical in nature and generated by users. Extracting keywords related to hashtags can reflect the intents of users and thus provides us better understanding on post content. In this paper, we propose a novel unsupervised keyword extraction approach for microblog posts by treating hashtags as topical indicators. Our approach consists of two hashtag enhanced algorithms. One is a topic model algorithm that infers topic distributions biased to hashtags on a collection of microblog posts. The words are ranked by their average topic probabilities. Our topic model algorithm can not only find the topics of a collection, but also extract hashtag-related keywords. The other is a random walk based algorithm. It first builds a word-post weighted graph by taking into account posts themselves. Then, a hashtag biased random walk is applied on this graph, which guides the algorithm to extract keywords according to hashtag topics. Last, the final ranking score of a word is determined by the stationary probability after a number of iterations. We evaluate our proposed approach on a collection of real Chinese microblog posts. Experiments show that our approach is more effective in terms of precision than traditional approaches considering no hashtag. The result achieved by the combination of two algorithms performs even better than each individual algorithm
Characterizing Geo-located Tweets in Brazilian Megacities
This work presents a framework for collecting, processing and mining
geo-located tweets in order to extract meaningful and actionable knowledge in
the context of smart cities. We collected and characterized more than 9M tweets
from the two biggest cities in Brazil, Rio de Janeiro and S\~ao Paulo. We
performed topic modeling using the Latent Dirichlet Allocation model to produce
an unsupervised distribution of semantic topics over the stream of geo-located
tweets as well as a distribution of words over those topics. We manually
labeled and aggregated similar topics obtaining a total of 29 different topics
across both cities. Results showed similarities in the majority of topics for
both cities, reflecting similar interests and concerns among the population of
Rio de Janeiro and S\~ao Paulo. Nevertheless, some specific topics are more
predominant in one of the cities
Characterizing Geo-located Tweets in Brazilian Megacities
This work presents a framework for collecting, processing and mining
geo-located tweets in order to extract meaningful and actionable knowledge in
the context of smart cities. We collected and characterized more than 9M tweets
from the two biggest cities in Brazil, Rio de Janeiro and S\~ao Paulo. We
performed topic modeling using the Latent Dirichlet Allocation model to produce
an unsupervised distribution of semantic topics over the stream of geo-located
tweets as well as a distribution of words over those topics. We manually
labeled and aggregated similar topics obtaining a total of 29 different topics
across both cities. Results showed similarities in the majority of topics for
both cities, reflecting similar interests and concerns among the population of
Rio de Janeiro and S\~ao Paulo. Nevertheless, some specific topics are more
predominant in one of the cities
- …