16 research outputs found

    A new dimensionality-unbiased score for efficient and effective outlying aspect mining

    Get PDF
    The main aim of the outlying aspect mining algorithm is to automatically detect the subspace(s) (a.k.a. aspect(s)), where a given data point is dramatically different than the rest of the data in each of those subspace(s) (aspect(s)). To rank the subspaces for a given data point, a scoring measure is required to compute the outlying degree of the given data in each subspace. In this paper, we introduce a new measure to compute outlying degree, called Simple Isolation score using Nearest Neighbor Ensemble (SiNNE), which not only detects the outliers but also provides an explanation on why the selected point is an outlier. SiNNE is a dimensionally unbias measure in its raw form, which means the scores produced by SiNNE are compared directly with subspaces having different dimensions. Thus, it does not require any normalization to make the score unbiased. Our experimental results on synthetic and publicly available real-world datasets revealed that (i) SiNNE produces better or at least the same results as existing scores. (ii) It improves the run time of the existing outlying aspect mining algorithm based on beam search by at least two orders of magnitude. SiNNE allows the existing outlying aspect mining algorithm to run in datasets with hundreds of thousands of instances and thousands of dimensions which was not possible before. © 2022, The Author(s)

    Advances in knowledge discovery and data mining Part II

    Get PDF
    19th Pacific-Asia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, May 19-22, 2015, Proceedings, Part II</p

    Why is this an anomaly? Explaining anomalies using sequential explanations

    Get PDF
    In most applications, anomaly detection operates in an unsupervised mode by looking for outliers hoping that they are anomalies. Unfortunately, most anomaly detectors do not come with explanations about which features make a detected outlier point anomalous. Therefore, it requires human analysts to manually browse through each detected outlier point’s feature space to obtain the subset of features that will help them determine whether they are genuinely anomalous or not. This paper introduces sequential explanation (SE) methods that sequentially explain to the analyst which features make the detected outlier anomalous. We present two methods for computing SEs called the outlier and sample-based SE that will work alongside any anomaly detector. The outlier-based SE methods use an anomaly detector’s outlier scoring measure guided by a search algorithm to compute the SEs. Meanwhile, the sample-based SE methods employ sampling to turn the problem into a classical feature selection problem. In our experiments, we compare the performances of the different outlier- and sample-based SEs. Our results show that both the outlier and sample-based methods compute SEs that perform well and outperform sequential feature explanations.http://www.elsevier.com/locate/patcoghj2021Computer Scienc

    User-Centric Active Learning for Outlier Detection

    Get PDF
    Outlier detection searches for unusual, rare observations in large, often high-dimensional data sets. One of the fundamental challenges of outlier detection is that ``unusual\u27\u27 typically depends on the perception of a user, the recipient of the detection result. This makes finding a formal definition of ``unusual\u27\u27 that matches with user expectations difficult. One way to deal with this issue is active learning, i.e., methods that ask users to provide auxiliary information, such as class label annotations, to return algorithmic results that are more in line with the user input. Active learning is well-suited for outlier detection, and many respective methods have been proposed over the last years. However, existing methods build upon strong assumptions. One example is the assumption that users can always provide accurate feedback, regardless of how algorithmic results are presented to them -- an assumption which is unlikely to hold when data is high-dimensional. It is an open question to which extent existing assumptions are in the way of realizing active learning in practice. In this thesis, we study this question from different perspectives with a differentiated, user-centric view on active learning. In the beginning, we structure and unify the research area on active learning for outlier detection. Specifically, we present a rigorous specification of the learning setup, structure the basic building blocks, and propose novel evaluation standards. Throughout our work, this structure has turned out to be essential to select a suitable active learning method, and to assess novel contributions in this field. We then present two algorithmic contributions to make active learning for outlier detection user-centric. First, we bring together two research areas that have been looked at independently so far: outlier detection in subspaces and active learning. Subspace outlier detection are methods to improve outlier detection quality in high-dimensional data, and to make detection results more easy to interpret. Our approach combines them with active learning such that one can balance between detection quality and annotation effort. Second, we address one of the fundamental difficulties with adapting active learning to specific applications: selecting good hyperparameter values. Existing methods to estimate hyperparameter values are heuristics, and it is unclear in which settings they work well. In this thesis, we therefore propose the first principled method to estimate hyperparameter values. Our approach relies on active learning to estimate hyperparameter values, and returns a quality estimate of the values selected. In the last part of the thesis, we look at validating active learning for outlier detection practically. There, we have identified several technical and conceptual challenges which we have experienced firsthand in our research. We structure and document them, and finally derive a roadmap towards validating active learning for outlier detection with user studies

    Functional data methods for climatological processes

    Get PDF
    Many climatological and environmental processes take the form of trajectories or surfaces. In the language of Statistics, these observations can be considered as functional data and the tools for studying the behavior of functional data define a framework known as Functional Data Analysis (FDA). In the following Chapters we will propose three FDA methods to model three different climatological phenomena. Chapter 1 will develop a robust test statistic for differentiating between two ensembles of spatial processes. We use this method to test for significant influence of historical proxy observations in paleoclimate reconstructions. Chapter 2 introduces a new class of functional data depths and a rigorous shape outlier detector based on elastic distance. This method handled functional data observed on nonlinear manifolds, such as spheres, which allows us to identify anomalously shaped hurricane trajectories in the Atlantic. Finally, in Chapter 3 we propose a computationally efficient and robust changepoint detector for functional data. We use this to test for, and estimate, changepoints in a long sequence of atmospheric interferometer profile measurements

    End-to-end anomaly detection in stream data

    Get PDF
    Nowadays, huge volumes of data are generated with increasing velocity through various systems, applications, and activities. This increases the demand for stream and time series analysis to react to changing conditions in real-time for enhanced efficiency and quality of service delivery as well as upgraded safety and security in private and public sectors. Despite its very rich history, time series anomaly detection is still one of the vital topics in machine learning research and is receiving increasing attention. Identifying hidden patterns and selecting an appropriate model that fits the observed data well and also carries over to unobserved data is not a trivial task. Due to the increasing diversity of data sources and associated stochastic processes, this pivotal data analysis topic is loaded with various challenges like complex latent patterns, concept drift, and overfitting that may mislead the model and cause a high false alarm rate. Handling these challenges leads the advanced anomaly detection methods to develop sophisticated decision logic, which turns them into mysterious and inexplicable black-boxes. Contrary to this trend, end-users expect transparency and verifiability to trust a model and the outcomes it produces. Also, pointing the users to the most anomalous/malicious areas of time series and causal features could save them time, energy, and money. For the mentioned reasons, this thesis is addressing the crucial challenges in an end-to-end pipeline of stream-based anomaly detection through the three essential phases of behavior prediction, inference, and interpretation. The first step is focused on devising a time series model that leads to high average accuracy as well as small error deviation. On this basis, we propose higher-quality anomaly detection and scoring techniques that utilize the related contexts to reclassify the observations and post-pruning the unjustified events. Last but not least, we make the predictive process transparent and verifiable by providing meaningful reasoning behind its generated results based on the understandable concepts by a human. The provided insight can pinpoint the anomalous regions of time series and explain why the current status of a system has been flagged as anomalous. Stream-based anomaly detection research is a principal area of innovation to support our economy, security, and even the safety and health of societies worldwide. We believe our proposed analysis techniques can contribute to building a situational awareness platform and open new perspectives in a variety of domains like cybersecurity, and health

    Towards Probabilistic and Partially-Supervised Structural Health Monitoring

    Get PDF
    One of the most significant challenges for signal processing in data-based structural health monitoring (SHM) is a lack of comprehensive data; in particular, recording labels to describe what each of the measured signals represent. For example, consider an offshore wind-turbine, monitored by an SHM strategy. It is infeasible to artificially damage such a high-value asset to collect signals that might relate to the damaged structure in situ; additionally, signals that correspond to abnormal wave-loading, or unusually low-temperatures, could take several years to be recorded. Regular inspections of the turbine in operation, to describe (and label) what measured data represent, would also prove impracticable -- conventionally, it is only possible to check various components (such as the turbine blades) following manual inspection; this involves travelling to a remote, offshore location, which is a high-cost procedure. Therefore, the collection of labelled data is generally limited by some expense incurred when investigating the signals; this might include direct costs, or loss of income due to down-time. Conventionally, incomplete label information forces a dependence on unsupervised machine learning, limiting SHM strategies to damage (i.e. novelty) detection. However, while comprehensive and fully labelled data can be rare, it is often possible to provide labels for a limited subset of data, given a label budget. In this scenario, partially-supervised machine learning should become relevant. The associated algorithms offer an alternative approach to monitor measured data, as they can utilise both labelled and unlabelled signals, within a unifying training scheme. In consequence, this work introduces (and adapts) partially-supervised algorithms for SHM; specifically, semi-supervised and active learning methods. Through applications to experimental data, semi-supervised learning is shown to utilise information in the unlabelled signals, alongside a limited set of labelled data, to further update a predictive-model. On the other hand, active learning improves the predictive performance by querying specific signals to investigate, which are assumed the most informative. Both discriminative and generative methods are investigated, leading towards a novel, probabilistic framework, to classify, investigate, and label signals for online SHM. The findings indicate that, through partially-supervised learning, the cost associated with labelling data can be managed, as the information in a selected subset of labelled signals can be combined with larger sets of unlabelled data -- increasing the potential scope and predictive performance for data-driven SHM
    corecore