7 research outputs found

    Hiding Outliers in HighDimensional Data Spaces

    Get PDF
    Detecting outliers in high-dimensional data is crucial in many domains. Due to the curse of dimensionality, one typically does not detect outliers in the full space, but in subspaces of it. More specifically, since the number of subspaces is huge, the detection takes place in only some subspaces. In consequence, one might miss hidden outliers, i.e., outliers only detectable in certain subspaces. In this paper, we take the opposite perspective, which is of practical relevance as well, and study how to hide outliers in high-dimensional data spaces. We formally prove characteristics of hidden outliers. We also propose an algorithm to place them in the data. It focuses on the regions close to existing data objects and is more efficient than an exhaustive approach. In experiments, we both evaluate our formal results and show the usefulness of our algorithm using di↵erent subspace selection schemes, outlier detection methods and data sets

    On the Nature and Types of Anomalies: A Review

    Full text link
    Anomalies are occurrences in a dataset that are in some way unusual and do not fit the general patterns. The concept of the anomaly is generally ill-defined and perceived as vague and domain-dependent. Moreover, despite some 250 years of publications on the topic, no comprehensive and concrete overviews of the different types of anomalies have hitherto been published. By means of an extensive literature review this study therefore offers the first theoretically principled and domain-independent typology of data anomalies, and presents a full overview of anomaly types and subtypes. To concretely define the concept of the anomaly and its different manifestations, the typology employs five dimensions: data type, cardinality of relationship, anomaly level, data structure and data distribution. These fundamental and data-centric dimensions naturally yield 3 broad groups, 9 basic types and 61 subtypes of anomalies. The typology facilitates the evaluation of the functional capabilities of anomaly detection algorithms, contributes to explainable data science, and provides insights into relevant topics such as local versus global anomalies.Comment: 38 pages (30 pages content), 10 figures, 3 tables. Preprint; review comments will be appreciated. Improvements in version 2: Explicit mention of fifth anomaly dimension; Added section on explainable anomaly detection; Added section on variations on the anomaly concept; Various minor additions and improvement

    Calibration and Evaluation of Outlier Detection with Generated Data

    Get PDF
    Outlier detection is an essential part of data science --- an area with increasing relevance in a plethora of domains. While there already exist numerous approaches for the detection of outliers, some significant challenges remain relevant. Two prominent such challenges are that outliers are rare and not precisely defined. They both have serious consequences, especially on the calibration and evaluation of detection methods. This thesis is concerned with a possible way of dealing with these challenges: the generation of outliers. It discusses existing techniques for generating outliers but specifically also their use in tackling the mentioned challenges. In the literature, the topic of outlier generation seems to have only little general structure so far --- despite that many techniques were already proposed. Thus, the first contribution of this thesis is a unified and crisp description of the state-of-the-art in outlier generation and their usages. Given the variety of characteristics of the generated outliers and the variety of methods designed for the detection of real outliers, it becomes apparent that a comparison of detection performance should be more distinctive than state-of-the-art comparisons are. Such a distinctive comparison is tackled in the second central contribution of this thesis: a general process for the distinctive evaluation of outlier detection methods with generated data. The process developed in this thesis uses entirely artificial data in which the inliers are realistic representations of some real-world data and the outliers deviations from these inliers with specific characteristics. The realness of the inliers allows the generalization of performance evaluations to many other data domains. The carefully designed generation techniques for outliers allow insights on the effect of the characteristics of outliers. So-called hidden outliers represent a special type of outliers: they also depend on a set of selections of data attributes, i.e., a set of subspaces. Hidden outliers are only detectable in a particular set of subspaces. In the subspaces they are hidden from, they are not detectable. For outlier detection methods that make use of subspaces, hidden outliers are a blind-spot: if they hide from the subspaces, searched for outliers. Thus, hidden outliers are exciting to study, for the evaluation of detection methods that use subspaces in particular. The third central contribution of this thesis is a technique for the generation of hidden outliers. An analysis of the characteristics of such instances is featured as well. First, the concept of hidden outliers is broached theoretical for this analysis. Then the developed technique is also used to validate the theoretical findings in more realistic contexts. For example, to show that hidden outliers could appear in many real-world data sets. All in all, this dissertation gives the field of outlier generation needed structure and shows their usefulness in tackling prominent challenges of the outlier detection problem

    User-Centric Active Learning for Outlier Detection

    Get PDF
    Outlier detection searches for unusual, rare observations in large, often high-dimensional data sets. One of the fundamental challenges of outlier detection is that ``unusual\u27\u27 typically depends on the perception of a user, the recipient of the detection result. This makes finding a formal definition of ``unusual\u27\u27 that matches with user expectations difficult. One way to deal with this issue is active learning, i.e., methods that ask users to provide auxiliary information, such as class label annotations, to return algorithmic results that are more in line with the user input. Active learning is well-suited for outlier detection, and many respective methods have been proposed over the last years. However, existing methods build upon strong assumptions. One example is the assumption that users can always provide accurate feedback, regardless of how algorithmic results are presented to them -- an assumption which is unlikely to hold when data is high-dimensional. It is an open question to which extent existing assumptions are in the way of realizing active learning in practice. In this thesis, we study this question from different perspectives with a differentiated, user-centric view on active learning. In the beginning, we structure and unify the research area on active learning for outlier detection. Specifically, we present a rigorous specification of the learning setup, structure the basic building blocks, and propose novel evaluation standards. Throughout our work, this structure has turned out to be essential to select a suitable active learning method, and to assess novel contributions in this field. We then present two algorithmic contributions to make active learning for outlier detection user-centric. First, we bring together two research areas that have been looked at independently so far: outlier detection in subspaces and active learning. Subspace outlier detection are methods to improve outlier detection quality in high-dimensional data, and to make detection results more easy to interpret. Our approach combines them with active learning such that one can balance between detection quality and annotation effort. Second, we address one of the fundamental difficulties with adapting active learning to specific applications: selecting good hyperparameter values. Existing methods to estimate hyperparameter values are heuristics, and it is unclear in which settings they work well. In this thesis, we therefore propose the first principled method to estimate hyperparameter values. Our approach relies on active learning to estimate hyperparameter values, and returns a quality estimate of the values selected. In the last part of the thesis, we look at validating active learning for outlier detection practically. There, we have identified several technical and conceptual challenges which we have experienced firsthand in our research. We structure and document them, and finally derive a roadmap towards validating active learning for outlier detection with user studies

    Hiding outliers in high-dimensional data spaces

    No full text
    corecore