1,123 research outputs found

    Scalable Topic Detection Approaches fromTwitter Streams

    Get PDF
    Real time topic detection in Twitter streams is an important task that helps discovering natural disasters in a real time from users’ posts and helps political parties and companies understand users’ opinions and needs. In 2014 the number of active users on Twitter is reported to be more than 288 million users who are posting around 500 million tweets daily. Therefore, detecting topics from Twitter streams in a real time becomes a challenging task that needs scalable and efficient techniques to handle this large amount of data. In this work, we scale an Exemplar-based technique that detects topics from Twitter streams, where each of the detected topics is represented by one tweet (i.e, exemplar). Using exemplar tweets to represent the detected topics, makes these topics easier to interpret as opposed to representing them by uncorrelated terms as in other topic detection algorithms. The approach is implemented using Apache Giraph and is being extended here to efficiently support sliding windows. Experimental results on four datasets show that the optimized Giraph implementation achieves a speedup of up to nineteen times over the native implementation, while maintaining good quality of the detected topics. In addition, Giraph Exemplar-based approach achieves the best topic recall and term precision against K-means, Latent Dirichlet Allocation (LDA), Non-negative matrix factorization (NMF) and Latent Semantic Analysis (LSA), while maintaining a good term recall and running time. The approach is also deployed for detecting topics from real-time Twitter streams and its scalability is demonstrated. Moreover, another clustering technique called Local Variance-based Clustering (LVC) is proposed in this thesis for detecting topics from Twitter streams. Local Variance-based Clustering (LVC) defines the data points densities based on their similarities. The proposed local variance measure is calculated based on the variance of the data points similarity histogram and is shown to well distinguish between core, border, connecting and outliers points. Experimental results show that LVC outperforms spectral clustering and affinity propagation in clustering quality using control charts, Ecoli and images datasets, while maintaining a good running time. In addition, results show that LVC can detect topics from Twitter with higher topic recall by 15% and higher term precision by 3% over DBSCAN

    Crowdsourcing Cybersecurity: Cyber Attack Detection using Social Media

    Full text link
    Social media is often viewed as a sensor into various societal events such as disease outbreaks, protests, and elections. We describe the use of social media as a crowdsourced sensor to gain insight into ongoing cyber-attacks. Our approach detects a broad range of cyber-attacks (e.g., distributed denial of service (DDOS) attacks, data breaches, and account hijacking) in an unsupervised manner using just a limited fixed set of seed event triggers. A new query expansion strategy based on convolutional kernels and dependency parses helps model reporting structure and aids in identifying key event characteristics. Through a large-scale analysis over Twitter, we demonstrate that our approach consistently identifies and encodes events, outperforming existing methods.Comment: 13 single column pages, 5 figures, submitted to KDD 201

    Enhanced Heartbeat Graph for emerging event detection on Twitter using time series networks

    Full text link
    © 2019 Elsevier Ltd With increasing popularity of social media, Twitter has become one of the leading platforms to report events in real-time. Detecting events from Twitter stream requires complex techniques. Event-related trending topics consist of a group of words which successfully detect and identify events. Event detection techniques must be scalable and robust, so that they can deal with the huge volume and noise associated with social media. Existing event detection methods mostly rely on burstiness, mainly the frequency of words and their co-occurrences. However, burstiness sometimes dominates other relevant details in the data which could be equally significant. Besides, the topological and temporal relationships in the data are often ignored. In this work, we propose a novel graph-based approach, called the Enhanced Heartbeat Graph (EHG), which detects events efficiently. EHG suppresses dominating topics in the subsequent data stream, after their first detection. Experimental results on three real-world datasets (i.e., Football Association Challenge Cup Final, Super Tuesday, and the US Election 2012) show superior performance of the proposed approach in comparison to the state-of-the-art techniques

    Exemplar-based Kernel Preserving Embedding

    Get PDF
    With the rapid increase of available data, it becomes computationally harder to extract useful information, specially in the case of high-dimensional data. Choosing a representative subset of the data can be useful to overcome this challenge as these representatives can be used by data analysts or presented to end users to give them a grasp of the data nature and structure. In this dissertation, first an Exemplar-based approach for topic detection is proposed, in which detected topics are represented using a few selected tweets. Using exemplar tweets instead of a set of keywords allows for an easy interpretation of the meaning of the detected topics. The approach is then extended to detect topics that emerge in new epochs of data. Experimental evaluation on benchmark Twitter datasets shows that the proposed topic detection approach achieves the best term precision. It does this while maintaining good topic recall and running times compared to other approaches for topic detection. Moreover, the proposed emerging extension achieves higher topic recall with improved running times when compared to recent emerging topic detection approaches. To overcome the challenge of high-dimensional data, several techniques, like PCA and NMF, were proposed to embed high-dimensional data into low-dimensional latent space. However, data represented in latent space is difficult for data analysts to understand and grasp the information encoded in it. In addition, these techniques do not take the relations between the data points into account. This motivated the development of other techniques like MDS, LLE and ISOMAP which preserve the relations between the data instances, but they still use latent features. In this dissertation, a new embedding technique is proposed to mitigate the previous problems by projecting the data to a space described by few points (i.e., the exemplars) which preserves the relations between the data points. The proposed method Exemplar-based Kernel Preserving (EBEK) embedding is shown theoretically to achieve the lowest reconstruction error of the kernel matrix. EBEK achieves a linear running time complexity in terms of the number of the samples. Using EBEK in the approximate nearest neighbor search task shows its ability to outperform related work by up to 60% in the recall while maintaining a good running time. In addition, empirical evaluation on clustering shows that EBEK achieves higher NMI than LLE and NMF by differences up to 40% and 15% respectively. It also achieves a comparable cluster quality to ISOMAP with a difference up to 3% in NMI and F-measure with a speedup up to 15×. In addition, our interpretability experiments show that EBEK’s selected basis are more understandable than the latent basis in images datasets

    A Semantic Modular Framework for Events Topic Modeling in Social Media

    Full text link
    The advancement of social media contributes to the growing amount of content they share frequently. This framework provides a sophisticated place for people to report various real-life events. Detecting these events with the help of natural language processing has received researchers' attention, and various algorithms have been developed for this goal. In this paper, we propose a Semantic Modular Model (SMM) consisting of 5 different modules, namely Distributional Denoising Autoencoder, Incremental Clustering, Semantic Denoising, Defragmentation, and Ranking and Processing. The proposed model aims to (1) cluster various documents and ignore the documents that might not contribute to the identification of events, (2) identify more important and descriptive keywords. Compared to the state-of-the-art methods, the results show that the proposed model has a higher performance in identifying events with lower ranks and extracting keywords for more important events in three English Twitter datasets: FACup, SuperTuesday, and USElection. The proposed method outperformed the best reported results in the mean keyword-precision metric by 7.9\%.Comment: 32 pages, 2 figure

    Discovering the core semantics of event from social media

    Full text link
    © 2015 Elsevier B.V. As social media is opening up such as Twitter and Sina Weibo,1 large volumes of short texts are flooding on the Web. The ocean of short texts dilutes the limited core semantics of event in cyberspace by redundancy, noises and irrelevant content on the web, which make it difficult to discover the core semantics of event. The major challenges include how to efficiently learn the semantic association distribution by small-scale association relations and how to maximize the coverage of the semantic association distribution by the minimum number of redundancy-free short texts. To solve the above issues, we explore a Markov random field based method for discovering the core semantics of event. This method makes semantics collaborative computation for learning association relation distribution and makes information gradient computation for discovering k redundancy-free texts as the core semantics of event. We evaluate our method by comparing with two state-of-the-art methods on the TAC dataset and the microblog dataset. The results show our method outperforms other methods in extracting core semantics accurately and efficiently. The proposed method can be applied to short text automatic generation, event discovery and summarization for big data analysis
    • …
    corecore