14,002 research outputs found
Enhancing Twitter Data Analysis with Simple Semantic Filtering: Example in Tracking Influenza-Like Illnesses
Systems that exploit publicly available user generated content such as
Twitter messages have been successful in tracking seasonal influenza. We
developed a novel filtering method for Influenza-Like-Illnesses (ILI)-related
messages using 587 million messages from Twitter micro-blogs. We first filtered
messages based on syndrome keywords from the BioCaster Ontology, an extant
knowledge model of laymen's terms. We then filtered the messages according to
semantic features such as negation, hashtags, emoticons, humor and geography.
The data covered 36 weeks for the US 2009 influenza season from 30th August
2009 to 8th May 2010. Results showed that our system achieved the highest
Pearson correlation coefficient of 98.46% (p-value<2.2e-16), an improvement of
3.98% over the previous state-of-the-art method. The results indicate that
simple NLP-based enhancements to existing approaches to mine Twitter data can
increase the value of this inexpensive resource.Comment: 10 pages, 5 figures, IEEE HISB 2012 conference, Sept 27-28, 2012, La
Jolla, California, U
Characterizing Geo-located Tweets in Brazilian Megacities
This work presents a framework for collecting, processing and mining
geo-located tweets in order to extract meaningful and actionable knowledge in
the context of smart cities. We collected and characterized more than 9M tweets
from the two biggest cities in Brazil, Rio de Janeiro and S\~ao Paulo. We
performed topic modeling using the Latent Dirichlet Allocation model to produce
an unsupervised distribution of semantic topics over the stream of geo-located
tweets as well as a distribution of words over those topics. We manually
labeled and aggregated similar topics obtaining a total of 29 different topics
across both cities. Results showed similarities in the majority of topics for
both cities, reflecting similar interests and concerns among the population of
Rio de Janeiro and S\~ao Paulo. Nevertheless, some specific topics are more
predominant in one of the cities
Characterizing Geo-located Tweets in Brazilian Megacities
This work presents a framework for collecting, processing and mining
geo-located tweets in order to extract meaningful and actionable knowledge in
the context of smart cities. We collected and characterized more than 9M tweets
from the two biggest cities in Brazil, Rio de Janeiro and S\~ao Paulo. We
performed topic modeling using the Latent Dirichlet Allocation model to produce
an unsupervised distribution of semantic topics over the stream of geo-located
tweets as well as a distribution of words over those topics. We manually
labeled and aggregated similar topics obtaining a total of 29 different topics
across both cities. Results showed similarities in the majority of topics for
both cities, reflecting similar interests and concerns among the population of
Rio de Janeiro and S\~ao Paulo. Nevertheless, some specific topics are more
predominant in one of the cities
Recommended from our members
New topic detection in microblogs and topic model evaluation using topical alignment
textThis thesis deals with topic model evaluation and new topic detection in microblogs. Microblogs are short and thus may not carry any contextual clues. Hence it becomes challenging to apply traditional natural language processing algorithms on such data. Graphical models have been traditionally used for topic discovery and text clustering on sets of text-based documents. Their unsupervised nature allows topic models to be trained easily on datasets meant for specific domains. However the advantage of not requiring annotated data comes with a drawback with respect to evaluation difficulties. The problem aggravates when the data comprises microblogs which are unstructured and noisy.
We demonstrate the application of three types of such models to microblogs - the Latent Dirichlet Allocation, the Author-Topic and the Author-Recipient-Topic model. We extensively evaluate these models under different settings, and our results show that the Author-Recipient-Topic model extracts the most coherent topics. We also addressed the problem of topic modeling on short text by using clustering techniques. This technique helps in boosting the performance of our models.
Topical alignment is used for large scale assessment of topical relevance by comparing topics to manually generated domain specific concepts. In this thesis we use this idea to evaluate topic models by measuring misalignments between topics. Our study on comparing topic models reveals interesting traits about Twitter messages, users and their interactions and establishes that joint modeling on author-recipient pairs and on the content of tweet leads to qualitatively better topic discovery.
This thesis gives a new direction to the well known problem of topic discovery in microblogs. Trend prediction or topic discovery for microblogs is an extensive research area. We propose the idea of using topical alignment to detect new topics by comparing topics from the current week to those of the previous week. We measure correspondence between a set of topics from the current week and a set of topics from the previous week to quantify five types of misalignments: \textit{junk, fused, missing} and \textit{repeated}. Our analysis compares three types of topic models under different settings and demonstrates how our framework can detect new topics from topical misalignments. In particular so-called \textit{junk} topics are more likely to be new topics and the \textit{missing} topics are likely to have died or die out.
To get more insights into the nature of microblogs we apply topical alignment to hashtags. Comparing topics to hashtags enables us to make interesting inferences about Twitter messages and their content. Our study revealed that although a very small proportion of Twitter messages explicitly contain hashtags, the proportion of tweets that discuss topics related to hashtags is much higher.Computer Science
- …