276 research outputs found

    Local Subspace-Based Outlier Detection using Global Neighbourhoods

    Full text link
    Outlier detection in high-dimensional data is a challenging yet important task, as it has applications in, e.g., fraud detection and quality control. State-of-the-art density-based algorithms perform well because they 1) take the local neighbourhoods of data points into account and 2) consider feature subspaces. In highly complex and high-dimensional data, however, existing methods are likely to overlook important outliers because they do not explicitly take into account that the data is often a mixture distribution of multiple components. We therefore introduce GLOSS, an algorithm that performs local subspace outlier detection using global neighbourhoods. Experiments on synthetic data demonstrate that GLOSS more accurately detects local outliers in mixed data than its competitors. Moreover, experiments on real-world data show that our approach identifies relevant outliers overlooked by existing methods, confirming that one should keep an eye on the global perspective even when doing local outlier detection.Comment: Short version accepted at IEEE BigData 201

    A survey of outlier detection methodologies

    Get PDF
    Outlier detection has been used for centuries to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. Their detection can identify system faults and fraud before they escalate with potentially catastrophic consequences. It can identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The original outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we introduce a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review

    On Algorithms Selection for Unsupervised Anomaly Detection

    Get PDF

    LSCP: Locally Selective Combination in Parallel Outlier Ensembles

    Full text link
    In unsupervised outlier ensembles, the absence of ground truth makes the combination of base outlier detectors a challenging task. Specifically, existing parallel outlier ensembles lack a reliable way of selecting competent base detectors, affecting accuracy and stability, during model combination. In this paper, we propose a framework---called Locally Selective Combination in Parallel Outlier Ensembles (LSCP)---which addresses the issue by defining a local region around a test instance using the consensus of its nearest neighbors in randomly selected feature subspaces. The top-performing base detectors in this local region are selected and combined as the model's final output. Four variants of the LSCP framework are compared with seven widely used parallel frameworks. Experimental results demonstrate that one of these variants, LSCP_AOM, consistently outperforms baselines on the majority of twenty real-world datasets.Comment: Proceedings of the 2019 SIAM International Conference on Data Mining (SDM

    Estimating Dependency, Monitoring and Knowledge Discovery in High-Dimensional Data Streams

    Get PDF
    Data Mining – known as the process of extracting knowledge from massive data sets – leads to phenomenal impacts on our society, and now affects nearly every aspect of our lives: from the layout in our local grocery store, to the ads and product recommendations we receive, the availability of treatments for common diseases, the prevention of crime, or the efficiency of industrial production processes. However, Data Mining remains difficult when (1) data is high-dimensional, i.e., has many attributes, and when (2) data comes as a stream. Extracting knowledge from high-dimensional data streams is impractical because one must cope with two orthogonal sets of challenges. On the one hand, the effects of the so-called "curse of dimensionality" bog down the performance of statistical methods and yield to increasingly complex Data Mining problems. On the other hand, the statistical properties of data streams may evolve in unexpected ways, a phenomenon known in the community as "concept drift". Thus, one needs to update their knowledge about data over time, i.e., to monitor the stream. While previous work addresses high-dimensional data sets and data streams to some extent, the intersection of both has received much less attention. Nevertheless, extracting knowledge in this setting is advantageous for many industrial applications: identifying patterns from high-dimensional data streams in real-time may lead to larger production volumes, or reduce operational costs. The goal of this dissertation is to bridge this gap. We first focus on dependency estimation, a fundamental task of Data Mining. Typically, one estimates dependency by quantifying the strength of statistical relationships. We identify the requirements for dependency estimation in high-dimensional data streams and propose a new estimation framework, Monte Carlo Dependency Estimation (MCDE), that fulfils them all. We show that MCDE leads to efficient dependency monitoring. Then, we generalise the task of monitoring by introducing the Scaling Multi-Armed Bandit (S-MAB) algorithms, extending the Multi-Armed Bandit (MAB) model. We show that our algorithms can efficiently monitor statistics by leveraging user-specific criteria. Finally, we describe applications of our contributions to Knowledge Discovery. We propose an algorithm, Streaming Greedy Maximum Random Deviation (SGMRD), which exploits our new methods to extract patterns, e.g., outliers, in high-dimensional data streams. Also, we present a new approach, that we name kj-Nearest Neighbours (kj-NN), to detect outlying documents within massive text corpora. We support our algorithmic contributions with theoretical guarantees, as well as extensive experiments against both synthetic and real-world data. We demonstrate the benefits of our methods against real-world use cases. Overall, this dissertation establishes fundamental tools for Knowledge Discovery in high-dimensional data streams, which help with many applications in the industry, e.g., anomaly detection, or predictive maintenance. To facilitate the application of our results and future research, we publicly release our implementations, experiments, and benchmark data via open-source platforms

    Outlier detection and robust normal-curvature estimation in mobile laser scanning 3D point cloud data

    Get PDF
    This paper proposes two robust statistical techniques for outlier detection and robust saliency features, such as surface normal and curvature, estimation in laser scanning 3D point cloud data. One is based on a robust z-score and the other uses a Mahalanobis type robust distance. The methods couple the ideas of point to plane orthogonal distance and local surface point consistency to get Maximum Consistency with Minimum Distance (MCMD). The methods estimate the best-fit-plane based on most probable outlier free, and most consistent, points set in a local neighbourhood. Then the normal and curvature from the best-fit-plane will be highly robust to noise and outliers. Experiments are performed to show the performance of the algorithms compared to several existing well-known methods (from computer vision, data mining, machine learning and statistics) using synthetic and real laser scanning datasets of complex (planar and non-planar) objects. Results for plane fitting, denoising, sharp feature preserving and segmentation are significantly improved. The algorithms are demonstrated to be significantly faster, more accurate and robust. Quantitatively, for a sample size of 50 with 20% outliers the proposed MCMD_Z is approximately 5, 15 and 98 times faster than the existing methods: uLSIF, RANSAC and RPCA, respectively. The proposed MCMD_MD method can tolerate 75% clustered outliers, whereas, RPCA and RANSAC can only tolerate 47% and 64% outliers, respectively. In terms of outlier detection, for the same dataset, MCMD_Z has an accuracy of 99.72%, 0.4% false positive rate and 0% false negative rate; for RPCA, RANSAC and uLSIF, the accuracies are 97.05%, 47.06% and 94.54%, respectively, and they have misclassification rates higher than the proposed methods. The new methods have potential for local surface reconstruction, fitting, and other point cloud processing tasks

    Explainable Contextual Anomaly Detection using Quantile Regression Forests

    Get PDF
    Traditional anomaly detection methods aim to identify objects that deviate from most other objects by treating all features equally. In contrast, contextual anomaly detection methods aim to detect objects that deviate from other objects within a context of similar objects by dividing the features into contextual features and behavioral features. In this paper, we develop connections between dependency-based traditional anomaly detection methods and contextual anomaly detection methods. Based on resulting insights, we propose a novel approach to inherently interpretable contextual anomaly detection that uses Quantile Regression Forests to model dependencies between features. Extensive experiments on various synthetic and real-world datasets demonstrate that our method outperforms state-of-the-art anomaly detection methods in identifying contextual anomalies in terms of accuracy and interpretability.Comment: Manuscript submitted to Data Mining and Knowledge Discovery in October 2022 for possible publication. This is the revised version submitted in April 202
    • …
    corecore