233 research outputs found

    Randomizing Ensemble-based approaches for Outlier

    Get PDF
    The data size is increasing dramatically every day, therefore, it has emerged the need of detecting abnormal behaviors, which can harm seriously our systems. Outlier detection refers to the process of identifying outlying activities, which diverge from the remaining group of data. This process, an integral part of data mining field, has experienced recently a substantial interest from the data mining community. An outlying activity or an outlier refers to a data point, which significantly deviates and appears to be inconsistent compared to other data members. Ensemble-based outlier detection is a line of research employed in order to reduce the model dependence from datasets or data locality by raising the robustness of the data mining procedures. The key principle of an ensemble approach is using the combination of individual detection results, which do not contain the same list of outliers in order to come up with a consensus finding. In this paper, we propose a novel strategy of constructing randomized ensemble outlier detection. This approach is an extension of the heuristic greedy ensemble construction previously built by the research community. We will focus on the core components of constructing an ensemble –based algorithm for outlier detection. The randomization will be performed by intervening into the pseudo code of greedy ensemble and implementing randomization in the respective java code through the ELKI data-mining platform. The key purpose of our approach is to improve the greedy ensemble and to overcome its local maxima problem. In order to induce diversity, it is performed randomization by initializing the search with a random outlier detector from the pool of detectors. Finally, the paper provides strong insights regarding the ongoing work of our randomized ensemble-based approach for outlier detection. Empirical results indicate that due to inducing diversity by employing various outlier detection algorithms, the randomized ensemble approach performs better than using only one outlier detector

    Data Stream Clustering for Real-Time Anomaly Detection: An Application to Insider Threats

    Get PDF
    Insider threat detection is an emergent concern for academia, industries, and governments due to the growing number of insider incidents in recent years. The continuous streaming of unbounded data coming from various sources in an organisation, typically in a high velocity, leads to a typical Big Data computational problem. The malicious insider threat refers to anomalous behaviour(s) (outliers) that deviate from the normal baseline of a data stream. The absence of previously logged activities executed by users shapes the insider threat detection mechanism into an unsupervised anomaly detection approach over a data stream. A common shortcoming in the existing data mining approaches to detect insider threats is the high number of false alarms/positives (FPs). To handle the Big Data issue and to address the shortcoming, we propose a streaming anomaly detection approach, namely Ensemble of Random subspace Anomaly detectors In Data Streams (E-RAIDS), for insider threat detection. E-RAIDS learns an ensemble of p established outlier detection techniques [Micro-cluster-based Continuous Outlier Detection (MCOD) or Anytime Outlier Detection (AnyOut)] which employ clustering over continuous data streams. Each model of the p models learns from a random feature subspace to detect local outliers, which might not be detected over the whole feature space. E-RAIDS introduces an aggregate component that combines the results from the p feature subspaces, in order to confirm whether to generate an alarm at each window iteration. The merit of E-RAIDS is that it defines a survival factor and a vote factor to address the shortcoming of high number of FPs. Experiments on E-RAIDS-MCOD and E-RAIDS-AnyOut are carried out, on synthetic data sets including malicious insider threat scenarios generated at Carnegie Mellon University, to test the effectiveness of voting feature subspaces, and the capability to detect (more than one)-behaviour-all-threat in real-time. The results show that E-RAIDS-MCOD reports the highest F1 measure and less number of false alarm = 0 compared to E-RAIDS-AnyOut, as well as it attains to detect approximately all the insider threats in real-time

    Homophily Outlier Detection in Non-IID Categorical Data

    Full text link
    Most of existing outlier detection methods assume that the outlier factors (i.e., outlierness scoring measures) of data entities (e.g., feature values and data objects) are Independent and Identically Distributed (IID). This assumption does not hold in real-world applications where the outlierness of different entities is dependent on each other and/or taken from different probability distributions (non-IID). This may lead to the failure of detecting important outliers that are too subtle to be identified without considering the non-IID nature. The issue is even intensified in more challenging contexts, e.g., high-dimensional data with many noisy features. This work introduces a novel outlier detection framework and its two instances to identify outliers in categorical data by capturing non-IID outlier factors. Our approach first defines and incorporates distribution-sensitive outlier factors and their interdependence into a value-value graph-based representation. It then models an outlierness propagation process in the value graph to learn the outlierness of feature values. The learned value outlierness allows for either direct outlier detection or outlying feature selection. The graph representation and mining approach is employed here to well capture the rich non-IID characteristics. Our empirical results on 15 real-world data sets with different levels of data complexities show that (i) the proposed outlier detection methods significantly outperform five state-of-the-art methods at the 95%/99% confidence level, achieving 10%-28% AUC improvement on the 10 most complex data sets; and (ii) the proposed feature selection methods significantly outperform three competing methods in enabling subsequent outlier detection of two different existing detectors.Comment: To appear in Data Ming and Knowledge Discovery Journa
    • …
    corecore