24 research outputs found

    Randomized outlier detection with trees

    Get PDF
    Isolation forest (IF) is a popular outlier detection algorithm that isolates outlier observations from regular observations by building multiple random isolation trees. The average number of comparisons required to isolate a given observation can then be used as a measure of its outlierness. Multiple extensions of this approach have been proposed in the literature including the extended isolation forest (EIF) as well as the SCiForest. However, we find a lack of theoretical explanation on why IF, EIF, and SCiForest offer such good practical performance. In this paper, we present a theoretical framework that views these approaches from a distributional viewpoint. Using this viewpoint, we show that isolation-based approaches first accurately approximate the data distribution and then secondly approximate the coefficients of mixture components using the average path length. Using this framework, we derive the generalized isolation forest (GIF) that also trains random isolation trees, but combining them moves beyond using the average path length. That is, GIF splits the data into multiple sub-spaces by sampling random splits as do the original IF variants do and directly estimates the mixture coefficients of a mixture distribution to score the outlierness on entire regions of data. In an extensive evaluation, we compare GIF with 18 state-of-the-art outlier detection methods on 14 different datasets. We show that GIF outperforms three competing tree-based methods and has a competitive performance to other nearest-neighbor approaches while having a lower runtime. Last, we highlight a use-case study that uses GIF to detect transaction fraud in financial data

    Systematic construction of anomaly detection benchmarks from real data

    Full text link
    Research in anomaly detection suffers from a lack of realis-tic and publicly-available problem sets. This paper discusses what properties such problem sets should possess. It then introduces a methodology for transforming existing classi-fication data sets into ground-truthed benchmark data sets for anomaly detection. The methodology produces data sets that vary along three important dimensions: (a) point diffi-culty, (b) relative frequency of anomalies, and (c) clustered-ness. We apply our generated datasets to benchmark several popular anomaly detection algorithms under a range of dif-ferent conditions. 1

    OptIForest: Optimal Isolation Forest for Anomaly Detection

    Full text link
    Anomaly detection plays an increasingly important role in various fields for critical tasks such as intrusion detection in cybersecurity, financial risk detection, and human health monitoring. A variety of anomaly detection methods have been proposed, and a category based on the isolation forest mechanism stands out due to its simplicity, effectiveness, and efficiency, e.g., iForest is often employed as a state-of-the-art detector for real deployment. While the majority of isolation forests use the binary structure, a framework LSHiForest has demonstrated that the multi-fork isolation tree structure can lead to better detection performance. However, there is no theoretical work answering the fundamentally and practically important question on the optimal tree structure for an isolation forest with respect to the branching factor. In this paper, we establish a theory on isolation efficiency to answer the question and determine the optimal branching factor for an isolation tree. Based on the theoretical underpinning, we design a practical optimal isolation forest OptIForest incorporating clustering based learning to hash which enables more information to be learned from data for better isolation quality. The rationale of our approach relies on a better bias-variance trade-off achieved by bias reduction in OptIForest. Extensive experiments on a series of benchmarking datasets for comparative and ablation studies demonstrate that our approach can efficiently and robustly achieve better detection performance in general than the state-of-the-arts including the deep learning based methods.Comment: This paper has been accepted by International Joint Conference on Artificial Intelligence (IJCAI-23

    Anomaly detection in wireless mesh lighting networks

    Get PDF

    An efficient framework for mining outlying aspects

    Get PDF
    In the era of big data, an immense volume of information is being continuously generated. It is common to encounter errors or anomalies within datasets. These anomalies can arise due to system malfunctions or human errors, resulting in data points that deviate from expected patterns or values. Anomaly detection algorithms have been developed to identify such anomalies effectively from the data. However, these algorithms often fall short in providing insights into why a particular data point is considered as an anomaly. They cannot explain the specific feature subset(s) in which a,data point significantly differs from the majority of the data. To address this limitation, researchers have recently turned their attention,to a new research area called outlying aspect mining. This area focuses on discovering feature subset(s), known as aspects or subspaces, in which anomalous data points exhibit significant deviations from the remaining set of data. Outlying aspect mining algorithms aim to provide a more detailed,understanding of the characteristics that make a data point anomalous. Although outlying aspect mining is an emerging area of research only a few studies have been published so far. One of the key challenges in this field is scaling up these algorithms to handle large datasets, characterised by,either a large data size or high dimensionality. Many existing outlying aspect mining algorithms are not well-suited for such datasets, as they exhaustively enumerate all possible subspaces and utilise density or distance-based anomaly scores to rank subspaces. As a result, most of these algorithms struggle to handle datasets with dimensions exceeding 20. Addressing the scalability issue and developing efficient algorithms for outlying aspect mining in large datasets remain active area of research. The ability to identify and understand the specific feature subsets contributing to anomalies in big data holds great potential for various applications, including fraud detection, network intrusion detection, and anomaly-based decision support systems. Existing outlying aspect mining methods are suffering from three main problems. Firstly, these measures often rely on distance or density-based calculations, which can be biased toward high-dimensional spaces. As the dimensionality of the subspace increases, the density tends to decrease, making it difficult to assess the outlyingness of data points within specific subspaces accurately. Secondly, distances or density-based measures are computationally expensive, especially when dealing with large-scale datasets that contain millions of data points. As distance and density-based measures require computing pairwise distance, it makes them computationally expensive. In addition to that, existing work uses Z-Score normalisation to make density-based scoring measure dimensionally unbias. However, it adds additional computational overload on already computationally expensive measures. Lastly, existing outlying aspect mining uses brute-force methods to search subspaces. Thus, it is essential to tackle this efficiency issue because when the dimensionality of the data is high – candidate subspace grows exponentially, which is beyond computational resources. This research project aims to solve this challenge by developing efficient and effective methods for mining outlying aspects in high-dimensional and large datasets. I have explored and designed different scoring measures to find the outlyingness of a given data point in each subspace. The effectiveness and efficiency of these proposed measures have been verified with extensive experiments on synthetic and real-world datasets. To overcome the first problem, this thesis first identifies and analyses the condition under which Z-Score based normalisation scoring measure fails to find the most outlying aspects and proposes two different approaches called HMass and sGrid++, both measures are dimensionally unbiased in their raw form, which means they do not require any additional normalisation. sGrid++ is a simpler version of sGrid that is not only efficient and effective but also dimensionality unbiased. It does not require Z-score normalisation. HMass is a simple but effective and efficient histogram-based solution to rank outlying aspects of a given query in each subspace. In addition to detecting anomalies, HMass provides explanations on why the points are anomalous. Both sGrid++ and HMass do not require pair-wise calculation like distance or density-based measures; therefore, sGrid++ and HMass are computationally faster than distance and density-based measures, which solves the second issue of existing work. The effectiveness and efficiency of both sGrid++ and HMass are evaluated using synthetic and real-world datasets. In addition to that, I presented an exciting application of outlying aspect mining in the cybersecurity domain. To tackle the third problem, this thesis proposes an efficient and effective outlying aspect mining framework named OIMiner (for Outlying - Inlying Aspect Miner). It introduces a new scoring measure to compute outlying degree, called Simple Isolation score using Nearest Neighbor Ensemble (SiNNE), which not only detects the outliers but also provides an explanation on why the selected point is an outlier. SiNNE is a dimensionally unbias measure in its raw form, which means the scores produced by SiNNE are compared directly with subspaces having different dimensions. Thus, it does not require any normalisation to make the score unbiased. Our experimental results on synthetic and publicly available real-world datasets revealed that (i) SiNNE produces better or at least the same results as existing scores. (ii) It improves the run time of the existing outlying aspect mining algorithm based on beam search by at least two orders of magnitude. SiNNE allows the existing outlying aspect mining algorithm to run in datasets with hundreds of thousands of instances and thousands of dimensions, which was not possible before.Doctor of Philosoph
    corecore