2,356 research outputs found
A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets
The term "outlier" can generally be defined as an observation that is significantly different from
the other values in a data set. The outliers may be instances of error or indicate events. The
task of outlier detection aims at identifying such outliers in order to improve the analysis of
data and further discover interesting and useful knowledge about unusual events within numerous
applications domains. In this paper, we report on contemporary unsupervised outlier detection
techniques for multiple types of data sets and provide a comprehensive taxonomy framework and
two decision trees to select the most suitable technique based on data set. Furthermore, we
highlight the advantages, disadvantages and performance issues of each class of outlier detection
techniques under this taxonomy framework
Automatic Bayesian Density Analysis
Making sense of a dataset in an automatic and unsupervised fashion is a
challenging problem in statistics and AI. Classical approaches for {exploratory
data analysis} are usually not flexible enough to deal with the uncertainty
inherent to real-world data: they are often restricted to fixed latent
interaction models and homogeneous likelihoods; they are sensitive to missing,
corrupt and anomalous data; moreover, their expressiveness generally comes at
the price of intractable inference. As a result, supervision from statisticians
is usually needed to find the right model for the data. However, since domain
experts are not necessarily also experts in statistics, we propose Automatic
Bayesian Density Analysis (ABDA) to make exploratory data analysis accessible
at large. Specifically, ABDA allows for automatic and efficient missing value
estimation, statistical data type and likelihood discovery, anomaly detection
and dependency structure mining, on top of providing accurate density
estimation. Extensive empirical evidence shows that ABDA is a suitable tool for
automatic exploratory analysis of mixed continuous and discrete tabular data.Comment: In proceedings of the Thirty-Third AAAI Conference on Artificial
Intelligence (AAAI-19
Classification and Anomaly Detection for Astronomical Datasets
This work develops two new statistical techniques for astronomical problems: a star /
galaxy separator for the UKIRT Infrared Deep Sky Survey (UKIDSS) and a novel anomaly
detection method for cross-matched astronomical datasets.
The star / galaxy separator is a statistical classification method which outputs class
membership probabilities rather than class labels and allows the use of prior knowledge
about the source populations. Deep Sloan Digital Sky Survey (SDSS) data from the multiply
imaged Stripe 82 region is used to check the results from our classifier, which compares
favourably with the UKIDSS pipeline classification algorithm.
The anomaly detection method addresses the problem posed by objects having different
sets of recorded variables in cross-matched datasets. This prevents the use of methods
unable to handle missing values and makes direct comparison between objects difficult.
For each source, our method computes anomaly scores in subspaces of the observed feature
space and combines them to an overall anomaly score. The proposed technique is very
general and can easily be used in applications other than astronomy. The properties and
performance of our method are investigated using both real and simulated datasets
Ensemble Methods for Anomaly Detection
Anomaly detection has many applications in numerous areas such as intrusion detection, fraud detection, and medical diagnosis. Most current techniques are specialized for detecting one type of anomaly, and work well on specific domains and when the data satisfies specific assumptions.
We address this problem, proposing ensemble anomaly detection techniques that perform well in many applications, with four major contributions: using bootstrapping to better detect anomalies on multiple subsamples, sequential application of diverse detection
algorithms, a novel adaptive sampling and learning algorithm in which the anomalies are iteratively examined, and improving the random forest algorithms for detecting anomalies in streaming data.
We design and evaluate multiple ensemble strategies using score normalization, rank aggregation and majority voting, to combine the results from six well-known base algorithms. We propose a bootstrapping algorithm in which anomalies are evaluated from multiple subsets of the data. Results show that our independent ensemble performs better than the base algorithms, and using bootstrapping achieves competitive quality and faster runtime compared with existing works.
We develop new sequential ensemble algorithms in which the second algorithm performs anomaly detection based on the first algorithm\u27s outputs; best results are obtained by combining algorithms that are substantially different. We propose a novel adaptive sampling algorithm which uses the score output of the base algorithm to determine the hard-to-detect examples, and iteratively resamples more points from such examples in a complete unsupervised context.
On streaming datasets, we analyze the impact of parameters used in random trees, and propose new algorithms that work well with high-dimensional data, improving performance without increasing the number of trees or their heights. We show that further improvements can be obtained with an Evolutionary Algorithm
Applications and Modeling Techniques of Wind Turbine Power Curve for Wind Farms - A Review
In the wind energy industry, the power curve represents the relationship between the “wind speed” at the hub height and the corresponding “active power” to be generated. It is the most versatile condition indicator and of vital importance in several key applications, such as wind turbine selection, capacity factor estimation, wind energy assessment and forecasting, and condition monitoring, among others. Ensuring an effective implementation of the aforementioned applications mostly requires a modeling technique that best approximates the normal properties of an optimal wind turbines operation in a particular wind farm. This challenge has drawn the attention of wind farm operators and researchers towards the “state of the art” in wind energy technology. This paper provides an exhaustive and updated review on power curve based applications, the most common anomaly and fault types including their root-causes, along with data preprocessing and correction schemes (i.e., filtering, clustering, isolation, and others), and modeling techniques (i.e., parametric and non-parametric) which cover a wide range of algorithms. More than 100 references, for the most part selected from recently published journal articles, were carefully compiled to properly assess the past, present, and future research directions in this active domain
- …