34 research outputs found

    Building a Test Collection for Significant-Event Detection in Arabic Tweets

    Get PDF
    With the increasing popularity of microblogging services like Twitter, researchers discov- ered a rich medium for tackling real-life problems like event detection. However, event detection in Twitter is often obstructed by the lack of public evaluation mechanisms such as test collections (set of tweets, labels, and queries to measure the eectiveness of an information retrieval system). The problem is more evident when non-English lan- guages, e.g., Arabic, are concerned. With the recent surge of signicant events in the Arab world, news agencies and decision makers rely on Twitters microblogging service to obtain recent information on events. In this thesis, we address the problem of building a test collection of Arabic tweets (named EveTAR) for the task of event detection. To build EveTAR, we rst adopted an adequate denition of an event, which is a signicant occurrence that takes place at a certain time. An occurrence is signicant if there are news articles about it. We collected Arabic tweets using Twitter's streaming API. Then, we identied a set of events from the Arabic data collection using Wikipedias current events portal. Corresponding tweets were extracted by querying the Arabic data collection with a set of manually-constructed queries. To obtain relevance judgments for those tweets, we leveraged CrowdFlower's crowdsourcing platform. Over a period of 4 weeks, we crawled over 590M tweets, from which we identied 66 events that cover 8 dierent categories and gathered more than 134k relevance judgments. Each event contains an average of 779 relevant tweets. Over all events, we got an average Kappa of 0.6, which is a substantially acceptable value. EveTAR was used to evalu- ate three state-of-the-art event detection algorithms. The best performing algorithms achieved 0.60 in F1 measure and 0.80 in both precision and recall. We plan to make our test collection available for research, including events description, manually-crafted queries to extract potentially-relevant tweets, and all judgments per tweet. EveTAR is the rst Arabic test collection built from scratch for the task of event detection. Addi- tionally, we show in our experiments that it supports other tasks like ad-hoc search

    An enhanced binary bat and Markov clustering algorithms to improve event detection for heterogeneous news text documents

    Get PDF
    Event Detection (ED) works on identifying events from various types of data. Building an ED model for news text documents greatly helps decision-makers in various disciplines in improving their strategies. However, identifying and summarizing events from such data is a non-trivial task due to the large volume of published heterogeneous news text documents. Such documents create a high-dimensional feature space that influences the overall performance of the baseline methods in ED model. To address such a problem, this research presents an enhanced ED model that includes improved methods for the crucial phases of the ED model such as Feature Selection (FS), ED, and summarization. This work focuses on the FS problem by automatically detecting events through a novel wrapper FS method based on Adapted Binary Bat Algorithm (ABBA) and Adapted Markov Clustering Algorithm (AMCL), termed ABBA-AMCL. These adaptive techniques were developed to overcome the premature convergence in BBA and fast convergence rate in MCL. Furthermore, this study proposes four summarizing methods to generate informative summaries. The enhanced ED model was tested on 10 benchmark datasets and 2 Facebook news datasets. The effectiveness of ABBA-AMCL was compared to 8 FS methods based on meta-heuristic algorithms and 6 graph-based ED methods. The empirical and statistical results proved that ABBAAMCL surpassed other methods on most datasets. The key representative features demonstrated that ABBA-AMCL method successfully detects real-world events from Facebook news datasets with 0.96 Precision and 1 Recall for dataset 11, while for dataset 12, the Precision is 1 and Recall is 0.76. To conclude, the novel ABBA-AMCL presented in this research has successfully bridged the research gap and resolved the curse of high dimensionality feature space for heterogeneous news text documents. Hence, the enhanced ED model can organize news documents into distinct events and provide policymakers with valuable information for decision making

    Understanding Bots on Social Media - An Application in Disaster Response

    Get PDF
    abstract: Social media has become a primary platform for real-time information sharing among users. News on social media spreads faster than traditional outlets and millions of users turn to this platform to receive the latest updates on major events especially disasters. Social media bridges the gap between the people who are affected by disasters, volunteers who offer contributions, and first responders. On the other hand, social media is a fertile ground for malicious users who purposefully disturb the relief processes facilitated on social media. These malicious users take advantage of social bots to overrun social media posts with fake images, rumors, and false information. This process causes distress and prevents actionable information from reaching the affected people. Social bots are automated accounts that are controlled by a malicious user and these bots have become prevalent on social media in recent years. In spite of existing efforts towards understanding and removing bots on social media, there are at least two drawbacks associated with the current bot detection algorithms: general-purpose bot detection methods are designed to be conservative and not label a user as a bot unless the algorithm is highly confident and they overlook the effect of users who are manipulated by bots and (unintentionally) spread their content. This study is trifold. First, I design a Machine Learning model that uses content and context of social media posts to detect actionable ones among them; it specifically focuses on tweets in which people ask for help after major disasters. Second, I focus on bots who can be a facilitator of malicious content spreading during disasters. I propose two methods for detecting bots on social media with a focus on the recall of the detection. Third, I study the characteristics of users who spread the content of malicious actors. These features have the potential to improve methods that detect malicious content such as fake news.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    Can we predict a riot? Disruptive event detection using Twitter

    Get PDF
    In recent years, there has been increased interest in real-world event detection using publicly accessible data made available through Internet technology such as Twitter, Facebook, and YouTube. In these highly interactive systems, the general public are able to post real-time reactions to “real world” events, thereby acting as social sensors of terrestrial activity. Automatically detecting and categorizing events, particularly small-scale incidents, using streamed data is a non-trivial task but would be of high value to public safety organisations such as local police, who need to respond accordingly. To address this challenge, we present an end-to-end integrated event detection framework that comprises five main components: data collection, pre-processing, classification, online clustering, and summarization. The integration between classification and clustering enables events to be detected, as well as related smaller-scale “disruptive events,” smaller incidents that threaten social safety and security or could disrupt social order. We present an evaluation of the effectiveness of detecting events using a variety of features derived from Twitter posts, namely temporal, spatial, and textual content. We evaluate our framework on a large-scale, real-world dataset from Twitter. Furthermore, we apply our event detection system to a large corpus of tweets posted during the August 2011 riots in England. We use ground-truth data based on intelligence gathered by the London Metropolitan Police Service, which provides a record of actual terrestrial events and incidents during the riots, and show that our system can perform as well as terrestrial sources, and even better in some cases


    Get PDF
    The acceptance and popularity of social media platforms for the dispersion and proliferation of news articles have led to the spread of questionable and untrusted information (in part) due to the ease by which misleading content can be created and shared among the communities. While prior research has attempted to automatically classify news articles and tweets as credible and non-credible. This work complements such research by proposing an approach that utilizes the amalgamation of Natural Language Processing (NLP), and Deep Learning techniques such as Long Short-Term Memory (LSTM). Moreover, in Information System’s paradigm, design science research methodology (DSRM) has become the major stream that focuses on building and evaluating an artifact to solve emerging problems. Hence, DSRM can accommodate deep learning-based models with the availability of adequate datasets. Two publicly available datasets that contain labeled news articles and tweets have been used to validate the proposed model’s effectiveness. This work presents two distinct experiments, and the results demonstrate that the proposed model works well for both long sequence news articles and short-sequence texts such as tweets. Finally, the findings suggest that the sentiments, tagging, linguistics, syntactic, and text embeddings are the features that have the potential to foster fake news detection through training the proposed model on various dimensionality to learn the contextual meaning of the news content

    Discovering and Mitigating Social Data Bias

    Get PDF
    abstract: Exabytes of data are created online every day. This deluge of data is no more apparent than it is on social media. Naturally, finding ways to leverage this unprecedented source of human information is an active area of research. Social media platforms have become laboratories for conducting experiments about people at scales thought unimaginable only a few years ago. Researchers and practitioners use social media to extract actionable patterns such as where aid should be distributed in a crisis. However, the validity of these patterns relies on having a representative dataset. As this dissertation shows, the data collected from social media is seldom representative of the activity of the site itself, and less so of human activity. This means that the results of many studies are limited by the quality of data they collect. The finding that social media data is biased inspires the main challenge addressed by this thesis. I introduce three sets of methodologies to correct for bias. First, I design methods to deal with data collection bias. I offer a methodology which can find bias within a social media dataset. This methodology works by comparing the collected data with other sources to find bias in a stream. The dissertation also outlines a data collection strategy which minimizes the amount of bias that will appear in a given dataset. It introduces a crawling strategy which mitigates the amount of bias in the resulting dataset. Second, I introduce a methodology to identify bots and shills within a social media dataset. This directly addresses the concern that the users of a social media site are not representative. Applying these methodologies allows the population under study on a social media site to better match that of the real world. Finally, the dissertation discusses perceptual biases, explains how they affect analysis, and introduces computational approaches to mitigate them. The results of the dissertation allow for the discovery and removal of different levels of bias within a social media dataset. This has important implications for social media mining, namely that the behavioral patterns and insights extracted from social media will be more representative of the populations under study.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    Sentiment Analysis, Quantification, and Shift Detection

    Get PDF
    This dissertation focuses on event detection within streams of Tweets based on sentiment quantification. Sentiment quantification extends sentiment analysis, the analysis of the sentiment of individual documents, to analyze the sentiment of an aggregated collection of documents. Although the former has been widely researched, the latter has drawn less attention but offers greater potential to enhance current business intelligence systems. Indeed, knowing the proportion of positive and negative Tweets is much more valuable than knowing which individual Tweets are positive or negative. We also extend our sentiment quantification research to analyze the evolution of sentiment over time to automatically detect a shift in sentiment with respect to a topic or entity. We introduce a probabilistic approach to create a paired sentiment lexicon that models the positivity and the negativity of words separately. We show that such a lexicon can be used to more accurately predict the sentiment features for a Tweet than univalued lexicons. In addition, we show that employing these features with a multivariate Support Vector Machine (SVM) that optimizes the Hellinger Distance improves sentiment quantification accuracy versus other distance metrics. Furthermore, we introduce a mean of representing sentiment over time through sentiment signals built from the aforementioned sentiment quantifier and show that sentiment shift can be detected using geometric change-point detection algorithms. Finally, our evaluation shows that, of the methods implemented, a two-dimensional Euclidean distance measure, analyzed using the first and second order statistical moments, was the most accurate in detecting sentiment shift