3,112 research outputs found

    Efficient Distributed Outlier Detection in Data Streams

    Get PDF
    Anomaly detection is one of the major data mining tasks in modern applications. An element that shows significant deviation from the "usual" behavior is marked as an outlier. This means that this element either corresponds to noise or it requires more careful examination because it may be important. Also, many clustering algorithms are very sensitive to outliers. In any case, outliers must be identified and explored further, meaning that efficient outlier mining techniques are required. In this paper, we focus on distributed density-based outlier detection over multi-dimensional data streams. In particular, we focus on the approximation method for computing the Local Correlation Integral (LOCI) of multi-dimensional points. Each object p is assigned a score score(p) which represents the outlier score of p. Thus, one can select the top-k elements from the dataset that have the highest outlier scores. Our proposal has been implemented in Apache Spark using Scala and experiments have been conducted in a physical cluster running Apache Hadoop 2.7 and Apache Spark 2.4.0. Performance evaluation results demonstrate that the proposed algorithm is efficient and scalable and therefore it can be used to mine outliers in large distributed datasets

    Event detection in location-based social networks

    Get PDF
    With the advent of social networks and the rise of mobile technologies, users have become ubiquitous sensors capable of monitoring various real-world events in a crowd-sourced manner. Location-based social networks have proven to be faster than traditional media channels in reporting and geo-locating breaking news, i.e. Osama Bin Laden’s death was first confirmed on Twitter even before the announcement from the communication department at the White House. However, the deluge of user-generated data on these networks requires intelligent systems capable of identifying and characterizing such events in a comprehensive manner. The data mining community coined the term, event detection , to refer to the task of uncovering emerging patterns in data streams . Nonetheless, most data mining techniques do not reproduce the underlying data generation process, hampering to self-adapt in fast-changing scenarios. Because of this, we propose a probabilistic machine learning approach to event detection which explicitly models the data generation process and enables reasoning about the discovered events. With the aim to set forth the differences between both approaches, we present two techniques for the problem of event detection in Twitter : a data mining technique called Tweet-SCAN and a machine learning technique called Warble. We assess and compare both techniques in a dataset of tweets geo-located in the city of Barcelona during its annual festivities. Last but not least, we present the algorithmic changes and data processing frameworks to scale up the proposed techniques to big data workloads.This work is partially supported by Obra Social “la Caixa”, by the Spanish Ministry of Science and Innovation under contract (TIN2015-65316), by the Severo Ochoa Program (SEV2015-0493), by SGR programs of the Catalan Government (2014-SGR-1051, 2014-SGR-118), Collectiveware (TIN2015-66863-C2-1-R) and BSC/UPC NVIDIA GPU Center of Excellence.We would also like to thank the reviewers for their constructive feedback.Peer ReviewedPostprint (author's final draft

    DENCAST: distributed density-based clustering for multi-target regression

    Get PDF
    Recent developments in sensor networks and mobile computing led to a huge increase in data generated that need to be processed and analyzed efficiently. In this context, many distributed data mining algorithms have recently been proposed. Following this line of research, we propose the DENCAST system, a novel distributed algorithm implemented in Apache Spark, which performs density-based clustering and exploits the identified clusters to solve both single- and multi-target regression tasks (and thus, solves complex tasks such as time series prediction). Contrary to existing distributed methods, DENCAST does not require a final merging step (usually performed on a single machine) and is able to handle large-scale, high-dimensional data by taking advantage of locality sensitive hashing. Experiments show that DENCAST performs clustering more efficiently than a state-of-the-art distributed clustering algorithm, especially when the number of objects increases significantly. The quality of the extracted clusters is confirmed by the predictive capabilities of DENCAST on several datasets: It is able to significantly outperform (p-value <0.05<0.05 ) state-of-the-art distributed regression methods, in both single and multi-target settings

    Noise Resilient Learning for Attack Detection in Smart Grid Pmu Infrastructure

    Get PDF
    Falsified data from compromised Phasor Measurement Units (PMUs) in a smart grid induce Energy Management Systems (EMS) to have an inaccurate estimation of the state of the grid, disrupting various operations of the power grid. Moreover, the PMUs deployed at the distribution layer of a smart grid show dynamic fluctuations in their data streams, which make it extremely challenging to design effective learning frameworks for anomaly-based attack detection. In this paper, we propose a noise resilient learning framework for anomaly-based attack detection specifically for distribution layer PMU infrastructure, that show real time indicators of data falsifications attacks while offsetting the effect of false alarms caused by the noise. Specifically, we propose a feature extraction framework that uses some Pythagorean Means of the active power from a cluster of PMUs, reducing multi-dimensional nature of the PMU data streams via quick big data summarization. We also propose a robust and noise resilient methodology for learning thresholds based on generalized robust estimation theory of our invariant feature. We experimentally validate our approach and demonstrate improved reliability performance using two completely different datasets collected from real distribution level PMU infrastructures

    A Clustering System for Dynamic Data Streams Based on Metaheuristic Optimisation

    Get PDF
    This article presents the Optimised Stream clustering algorithm (OpStream), a novel approach to cluster dynamic data streams. The proposed system displays desirable features, such as a low number of parameters and good scalability capabilities to both high-dimensional data and numbers of clusters in the dataset, and it is based on a hybrid structure using deterministic clustering methods and stochastic optimisation approaches to optimally centre the clusters. Similar to other state-of-the-art methods available in the literature, it uses “microclusters” and other established techniques, such as density based clustering. Unlike other methods, it makes use of metaheuristic optimisation to maximise performances during the initialisation phase, which precedes the classic online phase. Experimental results show that OpStream outperforms the state-of-the-art methods in several cases, and it is always competitive against other comparison algorithms regardless of the chosen optimisation method. Three variants of OpStream, each coming with a different optimisation algorithm, are presented in this study. A thorough sensitive analysis is performed by using the best variant to point out OpStream’s robustness to noise and resiliency to parameter change
    • …
    corecore