2,356 research outputs found

    A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets

    Get PDF
    The term "outlier" can generally be defined as an observation that is significantly different from the other values in a data set. The outliers may be instances of error or indicate events. The task of outlier detection aims at identifying such outliers in order to improve the analysis of data and further discover interesting and useful knowledge about unusual events within numerous applications domains. In this paper, we report on contemporary unsupervised outlier detection techniques for multiple types of data sets and provide a comprehensive taxonomy framework and two decision trees to select the most suitable technique based on data set. Furthermore, we highlight the advantages, disadvantages and performance issues of each class of outlier detection techniques under this taxonomy framework

    Automatic Bayesian Density Analysis

    Full text link
    Making sense of a dataset in an automatic and unsupervised fashion is a challenging problem in statistics and AI. Classical approaches for {exploratory data analysis} are usually not flexible enough to deal with the uncertainty inherent to real-world data: they are often restricted to fixed latent interaction models and homogeneous likelihoods; they are sensitive to missing, corrupt and anomalous data; moreover, their expressiveness generally comes at the price of intractable inference. As a result, supervision from statisticians is usually needed to find the right model for the data. However, since domain experts are not necessarily also experts in statistics, we propose Automatic Bayesian Density Analysis (ABDA) to make exploratory data analysis accessible at large. Specifically, ABDA allows for automatic and efficient missing value estimation, statistical data type and likelihood discovery, anomaly detection and dependency structure mining, on top of providing accurate density estimation. Extensive empirical evidence shows that ABDA is a suitable tool for automatic exploratory analysis of mixed continuous and discrete tabular data.Comment: In proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19

    Classification and Anomaly Detection for Astronomical Datasets

    No full text
    This work develops two new statistical techniques for astronomical problems: a star / galaxy separator for the UKIRT Infrared Deep Sky Survey (UKIDSS) and a novel anomaly detection method for cross-matched astronomical datasets. The star / galaxy separator is a statistical classification method which outputs class membership probabilities rather than class labels and allows the use of prior knowledge about the source populations. Deep Sloan Digital Sky Survey (SDSS) data from the multiply imaged Stripe 82 region is used to check the results from our classifier, which compares favourably with the UKIDSS pipeline classification algorithm. The anomaly detection method addresses the problem posed by objects having different sets of recorded variables in cross-matched datasets. This prevents the use of methods unable to handle missing values and makes direct comparison between objects difficult. For each source, our method computes anomaly scores in subspaces of the observed feature space and combines them to an overall anomaly score. The proposed technique is very general and can easily be used in applications other than astronomy. The properties and performance of our method are investigated using both real and simulated datasets

    Ensemble Methods for Anomaly Detection

    Get PDF
    Anomaly detection has many applications in numerous areas such as intrusion detection, fraud detection, and medical diagnosis. Most current techniques are specialized for detecting one type of anomaly, and work well on specific domains and when the data satisfies specific assumptions. We address this problem, proposing ensemble anomaly detection techniques that perform well in many applications, with four major contributions: using bootstrapping to better detect anomalies on multiple subsamples, sequential application of diverse detection algorithms, a novel adaptive sampling and learning algorithm in which the anomalies are iteratively examined, and improving the random forest algorithms for detecting anomalies in streaming data. We design and evaluate multiple ensemble strategies using score normalization, rank aggregation and majority voting, to combine the results from six well-known base algorithms. We propose a bootstrapping algorithm in which anomalies are evaluated from multiple subsets of the data. Results show that our independent ensemble performs better than the base algorithms, and using bootstrapping achieves competitive quality and faster runtime compared with existing works. We develop new sequential ensemble algorithms in which the second algorithm performs anomaly detection based on the first algorithm\u27s outputs; best results are obtained by combining algorithms that are substantially different. We propose a novel adaptive sampling algorithm which uses the score output of the base algorithm to determine the hard-to-detect examples, and iteratively resamples more points from such examples in a complete unsupervised context. On streaming datasets, we analyze the impact of parameters used in random trees, and propose new algorithms that work well with high-dimensional data, improving performance without increasing the number of trees or their heights. We show that further improvements can be obtained with an Evolutionary Algorithm

    Applications and Modeling Techniques of Wind Turbine Power Curve for Wind Farms - A Review

    Get PDF
    In the wind energy industry, the power curve represents the relationship between the “wind speed” at the hub height and the corresponding “active power” to be generated. It is the most versatile condition indicator and of vital importance in several key applications, such as wind turbine selection, capacity factor estimation, wind energy assessment and forecasting, and condition monitoring, among others. Ensuring an effective implementation of the aforementioned applications mostly requires a modeling technique that best approximates the normal properties of an optimal wind turbines operation in a particular wind farm. This challenge has drawn the attention of wind farm operators and researchers towards the “state of the art” in wind energy technology. This paper provides an exhaustive and updated review on power curve based applications, the most common anomaly and fault types including their root-causes, along with data preprocessing and correction schemes (i.e., filtering, clustering, isolation, and others), and modeling techniques (i.e., parametric and non-parametric) which cover a wide range of algorithms. More than 100 references, for the most part selected from recently published journal articles, were carefully compiled to properly assess the past, present, and future research directions in this active domain
    corecore