1,429 research outputs found
Semantics-driven event clustering in Twitter feeds
Detecting events using social media such as Twitter has many useful applications in real-life situations. Many algorithms which all use different information sources - either textual, temporal, geographic or community features - have been developed to achieve this task. Semantic information is often added at the end of the event detection to classify events into semantic topics. But semantic information can also be used to drive the actual event detection, which is less covered by academic research. We therefore supplemented an existing baseline event clustering algorithm with semantic information about the tweets in order to improve its performance. This paper lays out the details of the semantics-driven event clustering algorithms developed, discusses a novel method to aid in the creation of a ground truth for event detection purposes, and analyses how well the algorithms improve over baseline. We find that assigning semantic information to every individual tweet results in just a worse performance in F1 measure compared to baseline. If however semantics are assigned on a coarser, hashtag level the improvement over baseline is substantial and significant in both precision and recall
Recommended from our members
New topic detection in microblogs and topic model evaluation using topical alignment
textThis thesis deals with topic model evaluation and new topic detection in microblogs. Microblogs are short and thus may not carry any contextual clues. Hence it becomes challenging to apply traditional natural language processing algorithms on such data. Graphical models have been traditionally used for topic discovery and text clustering on sets of text-based documents. Their unsupervised nature allows topic models to be trained easily on datasets meant for specific domains. However the advantage of not requiring annotated data comes with a drawback with respect to evaluation difficulties. The problem aggravates when the data comprises microblogs which are unstructured and noisy.
We demonstrate the application of three types of such models to microblogs - the Latent Dirichlet Allocation, the Author-Topic and the Author-Recipient-Topic model. We extensively evaluate these models under different settings, and our results show that the Author-Recipient-Topic model extracts the most coherent topics. We also addressed the problem of topic modeling on short text by using clustering techniques. This technique helps in boosting the performance of our models.
Topical alignment is used for large scale assessment of topical relevance by comparing topics to manually generated domain specific concepts. In this thesis we use this idea to evaluate topic models by measuring misalignments between topics. Our study on comparing topic models reveals interesting traits about Twitter messages, users and their interactions and establishes that joint modeling on author-recipient pairs and on the content of tweet leads to qualitatively better topic discovery.
This thesis gives a new direction to the well known problem of topic discovery in microblogs. Trend prediction or topic discovery for microblogs is an extensive research area. We propose the idea of using topical alignment to detect new topics by comparing topics from the current week to those of the previous week. We measure correspondence between a set of topics from the current week and a set of topics from the previous week to quantify five types of misalignments: \textit{junk, fused, missing} and \textit{repeated}. Our analysis compares three types of topic models under different settings and demonstrates how our framework can detect new topics from topical misalignments. In particular so-called \textit{junk} topics are more likely to be new topics and the \textit{missing} topics are likely to have died or die out.
To get more insights into the nature of microblogs we apply topical alignment to hashtags. Comparing topics to hashtags enables us to make interesting inferences about Twitter messages and their content. Our study revealed that although a very small proportion of Twitter messages explicitly contain hashtags, the proportion of tweets that discuss topics related to hashtags is much higher.Computer Science
Tracking Dengue Epidemics using Twitter Content Classification and Topic Modelling
Detecting and preventing outbreaks of mosquito-borne diseases such as Dengue
and Zika in Brasil and other tropical regions has long been a priority for
governments in affected areas. Streaming social media content, such as Twitter,
is increasingly being used for health vigilance applications such as flu
detection. However, previous work has not addressed the complexity of drastic
seasonal changes on Twitter content across multiple epidemic outbreaks. In
order to address this gap, this paper contrasts two complementary approaches to
detecting Twitter content that is relevant for Dengue outbreak detection,
namely supervised classification and unsupervised clustering using topic
modelling. Each approach has benefits and shortcomings. Our classifier achieves
a prediction accuracy of about 80\% based on a small training set of about
1,000 instances, but the need for manual annotation makes it hard to track
seasonal changes in the nature of the epidemics, such as the emergence of new
types of virus in certain geographical locations. In contrast, LDA-based topic
modelling scales well, generating cohesive and well-separated clusters from
larger samples. While clusters can be easily re-generated following changes in
epidemics, however, this approach makes it hard to clearly segregate relevant
tweets into well-defined clusters.Comment: Procs. SoWeMine - co-located with ICWE 2016. 2016, Lugano,
Switzerlan
Viewpoint Discovery and Understanding in Social Networks
The Web has evolved to a dominant platform where everyone has the opportunity
to express their opinions, to interact with other users, and to debate on
emerging events happening around the world. On the one hand, this has enabled
the presence of different viewpoints and opinions about a - usually
controversial - topic (like Brexit), but at the same time, it has led to
phenomena like media bias, echo chambers and filter bubbles, where users are
exposed to only one point of view on the same topic. Therefore, there is the
need for methods that are able to detect and explain the different viewpoints.
In this paper, we propose a graph partitioning method that exploits social
interactions to enable the discovery of different communities (representing
different viewpoints) discussing about a controversial topic in a social
network like Twitter. To explain the discovered viewpoints, we describe a
method, called Iterative Rank Difference (IRD), which allows detecting
descriptive terms that characterize the different viewpoints as well as
understanding how a specific term is related to a viewpoint (by detecting other
related descriptive terms). The results of an experimental evaluation showed
that our approach outperforms state-of-the-art methods on viewpoint discovery,
while a qualitative analysis of the proposed IRD method on three different
controversial topics showed that IRD provides comprehensive and deep
representations of the different viewpoints
Language in Our Time: An Empirical Analysis of Hashtags
Hashtags in online social networks have gained tremendous popularity during
the past five years. The resulting large quantity of data has provided a new
lens into modern society. Previously, researchers mainly rely on data collected
from Twitter to study either a certain type of hashtags or a certain property
of hashtags. In this paper, we perform the first large-scale empirical analysis
of hashtags shared on Instagram, the major platform for hashtag-sharing. We
study hashtags from three different dimensions including the temporal-spatial
dimension, the semantic dimension, and the social dimension. Extensive
experiments performed on three large-scale datasets with more than 7 million
hashtags in total provide a series of interesting observations. First, we show
that the temporal patterns of hashtags can be categorized into four different
clusters, and people tend to share fewer hashtags at certain places and more
hashtags at others. Second, we observe that a non-negligible proportion of
hashtags exhibit large semantic displacement. We demonstrate hashtags that are
more uniformly shared among users, as quantified by the proposed hashtag
entropy, are less prone to semantic displacement. In the end, we propose a
bipartite graph embedding model to summarize users' hashtag profiles, and rely
on these profiles to perform friendship prediction. Evaluation results show
that our approach achieves an effective prediction with AUC (area under the ROC
curve) above 0.8 which demonstrates the strong social signals possessed in
hashtags.Comment: WWW 201
A survey of data mining techniques for social media analysis
Social network has gained remarkable attention in the last decade. Accessing social network sites such as Twitter, Facebook LinkedIn and Google+ through the internet and the web 2.0 technologies has become more affordable. People are becoming more interested in and relying on social network for information, news and opinion of other users on diverse subject matters. The heavy reliance on social network sites causes them to generate massive data characterised by three computational issues namely; size, noise and dynamism. These issues often make social network data very complex to analyse manually, resulting in the pertinent use of computational means of analysing them. Data mining provides a wide range of techniques for detecting useful knowledge from massive datasets like trends, patterns and rules [44]. Data mining techniques are used for information retrieval, statistical modelling and machine learning. These techniques employ data pre-processing, data analysis, and data interpretation processes in the course of data analysis. This survey discusses different data mining techniques used in mining diverse aspects of the social network over decades going from the historical techniques to the up-to-date models, including our novel technique named TRCM. All the techniques covered in this survey are listed in the Table.1 including the tools employed as well as names of their authors
Unsupervised improvement of named entity extraction in short informal context using disambiguation clues
Short context messages (like tweets and SMSâs) are a potentially rich source of continuously and instantly updated information. Shortness and informality of such messages are challenges for Natural Language Processing tasks. Most efforts done in this direction rely on machine learning techniques which are expensive in terms of data collection and training. In this paper we present an unsupervised Semantic Web-driven approach to improve the extraction process by using clues from the disambiguation process. For extraction we used a simple Knowledge-Base matching technique combined with a clustering-based approach for disambiguation. Experimental results on a self-collected set of tweets (as an example of short context messages) show improvement in extraction results when using unsupervised feedback from the disambiguation process
- âŠ