2,008 research outputs found

    Local News And Event Detection In Twitter

    Get PDF
    Twitter, one of the most popular micro-blogging services, allows users to publish short messages on a wide variety of subjects such as news, events, stories, ideas, and opinions, called tweets. The popularity of Twitter, to some extent, arises from its capability of letting users promptly and conveniently contribute tweets to convey diverse information. Specifically, with people discussing what is happening outside in the real world by posting tweets, Twitter captures invaluable information about real-world news and events, spanning a wide scale from large national or international stories like a presidential election to small local stories such as a local farmers market. Detecting and extracting small news and events for a local place is a challenging problem and is the focus of this thesis. In particular, we explore several directions to extract and detect local news and events using tweets in Twitter: a) how to identify local influential people on Twitter for potential news seeders; b) how to recognize unusualness in tweet volume as signals of potential local events; c) how to overcome the data sparsity of local tweets to detect more and smaller undergoing local news and events. Additionally, we also try to uncover implicit correlations between location, time, and text in tweets by learning embeddings for them using a universal representation under the same semantic space. In the first part, we investigate how to measure the spatial influence of Twitter users by their interactions and thereby identify the locally influential users, which we found are usually good news and event seeders in practice. In order to do this, we built a large-scale directed interaction graph of Twitter users. Such a graph allows us to exploit PageRank based ranking procedures to select top local influential people after innovatively incorporating in geographical distance to the transition matrix used for the random walking. In the second part, we study how to recognize the unusualness in tweet volume at a local place as signals of potential ongoing local events. The intuition is that if there is suddenly an abnormal change in the number of tweets at a location (e.g., a significant increase), it may imply a potential local event. We, therefore, present DeLLe, a methodology for automatically Detecting Latest Local Events from geotagged tweet streams (i.e., tweets that contain GPS points). With the help of novel spatiotemporal tweet count prediction models, DeLLe first finds unusual locations which have aggregated an unexpected number of tweets in the latest time period and then calculates, for each such unusual location, a ranking score to identify the ones most likely to have ongoing local events by addressing the temporal burstiness, spatial business, and topical coherence. In the third part, we explore how to overcome the data sparsity of local tweets when trying to discover more and smaller local news or events. Local tweets are those whose locations fall inside a local place. They are very sparse in Twitter, which hinders the detection of small local news or events that have only a handful of tweets. A system, called Firefly, is proposed to enhance the local live tweet stream by tracking the tweets of a large body of local people, and further perform a locality-aware keyword based clustering for event detection. The intuition is that local tweets are published by local people, and tracking their tweets naturally yields a source of local tweets. However, in practice, only 20% Twitter users provide information about where they come from. Thus, a social network-based geotagging procedure is subsequently proposed to estimate locations for Twitter users whose locations are missing. Finally, in order to discover correlations between location, time and text in geotagged tweets, e.g., “find which locations are mostly related to the given topics“ and “find which locations are similar to a given location“, we present LeGo, a methodology for Learning embeddings of Geotagged tweets with respect to entities such as locations, time units (hour-of-day and day-of-week) and textual words in tweets. The resulting compact vector representations of these entities hence make it easy to measure the relatedness between locations, time and words in tweets. LeGo comprises two working modes: crossmodal search (LeGo-CM) and location-similarity search (LeGo-LS), to answer these two types of queries accordingly. In LeGo-CM, we first build a graph of entities extracted from tweets in which each edge carries the weight of co-occurrences between two entities. The embeddings of graph nodes are then learned in the same latent space under the guidance of approximating stationary residing probabilities between nodes which are computed using personalized random walk procedures. In comparison, we supplement edges between locations in LeGo-LS to address their underlying spatial proximity and topic likeliness to support location-similarity search queries

    ON RELEVANCE FILTERING FOR REAL-TIME TWEET SUMMARIZATION

    Get PDF
    Real-time tweet summarization systems (RTS) require mechanisms for capturing relevant tweets, identifying novel tweets, and capturing timely tweets. In this thesis, we tackle the RTS problem with a main focus on the relevance filtering. We experimented with different traditional retrieval models. Additionally, we propose two extensions to alleviate the sparsity and topic drift challenges that affect the relevance filtering. For the sparsity, we propose leveraging word embeddings in Vector Space model (VSM) term weighting to empower the system to use semantic similarity alongside the lexical matching. To mitigate the effect of topic drift, we exploit explicit relevance feedback to enhance profile representation to cope with its development in the stream over time. We conducted extensive experiments over three standard English TREC test collections that were built specifically for RTS. Although the extensions do not generally exhibit better performance, they are comparable to the baselines used. Moreover, we extended an event detection Arabic tweets test collection, called EveTAR, to support tasks that require novelty in the system's output. We collected novelty judgments using in-house annotators and used the collection to test our RTS system. We report preliminary results on EveTAR using different models of the RTS system.This work was made possible by NPRP grants # NPRP 7-1313-1-245 and # NPRP 7-1330-2-483 from the Qatar National Research Fund (a member of Qatar Foundation)

    Enhancing the Performance of Text Mining

    Get PDF
    The amount of text data produced in science, finance, social media, and medicine is growing at an unprecedented pace. The raw text data typically introduces major computational and analytical obstacles (e.g., extremely high dimensionality) to data mining and machine learning algorithms. Besides, the growth in the size of text data makes the search process more difficult for information retrieval systems, making retrieving relevant results to match the users’ search queries challenging. Moreover, the availability of text data in different languages creates the need to develop new methods to analyze multilingual topics to help policymakers in governmental and health systems to make risk decisions and to create policies to respond to public health crises, natural disasters, and political or social movements. The goal of this thesis is to develop new methods that handle computational and analytical problems for complex high-dimensional text data, develop a new query expansion approach to enhance the performance of information retrieval systems, and to present new techniques for analyzing multilingual topics using a translation service. First, in the field of dimensionality reduction, we develop a new method for detecting and eliminating domain-based words. In this method, we use three different datasets and five classifiers for testing and evaluating the performance of our new approach before and after eliminating domain-based words. We compare the performance of our approach with other feature selection methods. We find that the new approach improves the performance of the binary classifier and reduces the dimensionality of the feature space by 90%. Also, our approach reduces the execution time of the classifier and outperforms one of the feature selection methods. Second, in the field of information retrieval, we design and implement a method that integrates words from a current stream with external data sources in order to predict the occurrence of relevant words that have not yet appeared in the primary source. This algorithm enables the construction of new queries that effectively capture emergent events that a user may not have anticipated when initiating the data collection stream. The added value of using the external data sources appears when we have a stream of data and we want to predict something that has not yet happened instead of using only the stream that is limited to the available information at a specific time. We compare the performance of our approach with two alternative approaches. The first approach (static) expands user queries with words extracted from a probabilistic topic model of the stream. The second approach (emergent) reinforces user queries with emergent words extracted from the stream. We find that our method outperforms alternative approaches, exhibiting particularly good results in identifying future emergent topics. Third, in the field of the multilingual text, we present a strategy to analyze the similarity between multilingual topics in English and Arabic tweets surrounding the 2020 COVID-19 pandemic. We make a descriptive comparison between topics in Arabic and English tweets about COVID-19 using tweets collected in the same way and filtered using the same keywords. We analyze Twitter’s discussion to understand the evolution of topics over time and reveal topic similarity among tweets across the datasets. We use probabilistic topic modeling to identify and extract the key topics of Twitter’s discussion in Arabic and English tweets. We use two methods to analyze the similarity between multilingual topics. The first method (full-text topic modeling approach) translates all text to English and then runs topic modeling to find similar topics. The second method (term-based topic modeling approach) runs topic modeling on the text before translation then translates the top keywords in each topic to find similar topics. We find similar topics related to COVID-19 pandemic covered in English and Arabic tweets for certain time intervals. Results indicate that the term-based topic modeling approach can reduce the cost compared to the full-text topic modeling approach and still have comparable results in finding similar topics. The computational time to translate the terms is significantly lower than the translation of the full text

    POISED: Spotting Twitter Spam Off the Beaten Paths

    Get PDF
    Cybercriminals have found in online social networks a propitious medium to spread spam and malicious content. Existing techniques for detecting spam include predicting the trustworthiness of accounts and analyzing the content of these messages. However, advanced attackers can still successfully evade these defenses. Online social networks bring people who have personal connections or share common interests to form communities. In this paper, we first show that users within a networked community share some topics of interest. Moreover, content shared on these social network tend to propagate according to the interests of people. Dissemination paths may emerge where some communities post similar messages, based on the interests of those communities. Spam and other malicious content, on the other hand, follow different spreading patterns. In this paper, we follow this insight and present POISED, a system that leverages the differences in propagation between benign and malicious messages on social networks to identify spam and other unwanted content. We test our system on a dataset of 1.3M tweets collected from 64K users, and we show that our approach is effective in detecting malicious messages, reaching 91% precision and 93% recall. We also show that POISED's detection is more comprehensive than previous systems, by comparing it to three state-of-the-art spam detection systems that have been proposed by the research community in the past. POISED significantly outperforms each of these systems. Moreover, through simulations, we show how POISED is effective in the early detection of spam messages and how it is resilient against two well-known adversarial machine learning attacks

    An association rule dynamics and classification approach to event detection and tracking in Twitter.

    Get PDF
    Twitter is a microblogging application used for sending and retrieving instant on-line messages of not more than 140 characters. There has been a surge in Twitter activities since its launch in 2006 as well as steady increase in event detection research on Twitter data (tweets) in recent years. With 284 million monthly active users Twitter has continued to grow both in size and activity. The network is rapidly changing the way global audience source for information and influence the process of journalism [Newman, 2009]. Twitter is now perceived as an information network in addition to being a social network. This explains why traditional news media follow activities on Twitter to enhance their news reports and news updates. Knowing the significance of the network as an information dissemination platform, news media subscribe to Twitter accounts where they post their news headlines and include the link to their on-line news where the full story may be found. Twitter users in some cases, post breaking news on the network before such news are published by traditional news media. This can be ascribed to Twitter subscribers' nearness to location of events. The use of Twitter as a network for information dissemination as well as for opinion expression by different entities is now common. This has also brought with it the issue of computational challenges of extracting newsworthy contents from Twitter noisy data. Considering the enormous volume of data Twitter generates, users append the hashtag (#) symbol as prefix to keywords in tweets. Hashtag labels describe the content of tweets. The use of hashtags also makes it easy to search for and read tweets of interest. The volume of Twitter streaming data makes it imperative to derive Topic Detection and Tracking methods to extract newsworthy topics from tweets. Since hashtags describe and enhance the readability of tweets, this research is developed to show how the appropriate use of hashtags keywords in tweets can demonstrate temporal evolvements of related topic in real-life and consequently enhance Topic Detection and Tracking on Twitter network. We chose to apply our method on Twitter network because of the restricted number of characters per message and for being a network that allows sharing data publicly. More importantly, our choice was based on the fact that hashtags are an inherent component of Twitter. To this end, the aim of this research is to develop, implement and validate a new approach that extracts newsworthy topics from tweets' hashtags of real-life topics over a specified period using Association Rule Mining. We termed our novel methodology Transaction-based Rule Change Mining (TRCM). TRCM is a system built on top of the Apriori method of Association Rule Mining to extract patterns of Association Rules changes in tweets hashtag keywords at different periods of time and to map the extracted keywords to related real-life topic or scenario. To the best of our knowledge, the adoption of dynamics of Association Rules of hashtag co-occurrences has not been explored as a Topic Detection and Tracking method on Twitter. The application of Apriori to hashtags present in tweets at two consecutive period t and t + 1 produces two association rulesets, which represents rules evolvement in the context of this research. A change in rules is discovered by matching every rule in ruleset at time t with those in ruleset at time t + 1. The changes are grouped under four identified rules namely 'New' rules, 'Unexpected Consequent' and 'Unexpected Conditional' rules, 'Emerging' rules and 'Dead' rules. The four rules represent different levels of topic real-life evolvements. For example, the emerging rule represents very important occurrence such as breaking news, while unexpected rules represents unexpected twist of event in an on-going topic. The new rule represents dissimilarity in rules in rulesets at time t and t+1. Finally, the dead rule represents topic that is no longer present on the Twitter network. TRCM revealed the dynamics of Association Rules present in tweets and demonstrates the linkage between the different types of rule dynamics to targeted real-life topics/events. In this research, we conducted experimental studies on tweets from different domains such as sports and politics to test the performance effectiveness of our method. We validated our method, TRCM with carefully chosen ground truth. The outcome of our research experiments include: Identification of 4 rule dynamics in tweets' hashtags namely: New rules, Emerging rules, Unexpected rules and 'Dead' rules using Association Rule Mining. These rules signify how news and events evolved in real-life scenario. Identification of rule evolvements on Twitter network using Rule Trend Analysis and Rule Trace. Detection and tracking of topic evolvements on Twitter using Transaction-based Rule Change Mining TRCM. Identification of how the peculiar features of each TRCM rules affect their performance effectiveness on real datasets

    Detecting, Modeling, and Predicting User Temporal Intention

    Get PDF
    The content of social media has grown exponentially in the recent years and its role has evolved from narrating life events to actually shaping them. Unfortunately, content posted and shared in social networks is vulnerable and prone to loss or change, rendering the context associated with it (a tweet, post, status, or others) meaningless. There is an inherent value in maintaining the consistency of such social records as in some cases they take over the task of being the first draft of history as collections of these social posts narrate the pulse of the street during historic events, protest, riots, elections, war, disasters, and others as shown in this work. The user sharing the resource has an implicit temporal intent: either the state of the resource at the time of sharing, or the current state of the resource at the time of the reader \clicking . In this research, we propose a model to detect and predict the user\u27s temporal intention of the author upon sharing content in the social network and of the reader upon resolving this content. To build this model, we first examine the three aspects of the problem: the resource, time, and the user. For the resource we start by analyzing the content on the live web and its persistence. We noticed that a portion of the resources shared in social media disappear, and with further analysis we unraveled a relationship between this disappearance and time. We lose around 11% of the resources after one year of sharing and a steady 7% every following year. With this, we turn to the public archives and our analysis reveals that not all posted resources are archived and even they were an average 8% per year disappears from the archives and in some cases the archived content is heavily damaged. These observations prove that in regards to archives resources are not well-enough populated to consistently and reliably reconstruct the missing resource as it existed at the time of sharing. To analyze the concept of time we devised several experiments to estimate the creation date of the shared resources. We developed Carbon Date, a tool which successfully estimated the correct creation dates for 76% of the test sets. Since the resources\u27 creation we wanted to measure if and how they change with time. We conducted a longitudinal study on a data set of very recently-published tweet-resource pairs and recording observations hourly. We found that after just one hour, ~4% of the resources have changed by ≥30% while after a day the change rate slowed to be ~12% of the resources changed by ≥40%. In regards to the third and final component of the problem we conducted user behavioral analysis experiments and built a data set of 1,124 instances manually assigned by test subjects. Temporal intention proved to be a difficult concept for average users to understand. We developed our Temporal Intention Relevancy Model (TIRM) to transform the highly subjective temporal intention problem into the more easily understood idea of relevancy between a tweet and the resource it links to, and change of the resource through time. On our collected data set TIRM produced a significant 90.27% success rate. Furthermore, we extended TIRM and used it to build a time-based model to predict temporal intention change or steadiness at the time of posting with 77% accuracy. We built a service API around this model to provide predictions and a few prototypes. Future tools could implement TIRM to assist users in pushing copies of shared resources into public web archives to ensure the integrity of the historical record. Additional tools could be used to assist the mining of the existing social media corpus by derefrencing the intended version of the shared resource based on the intention strength and the time between the tweeting and mining
    • …
    corecore