11,992 research outputs found
A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets
The term "outlier" can generally be defined as an observation that is significantly different from
the other values in a data set. The outliers may be instances of error or indicate events. The
task of outlier detection aims at identifying such outliers in order to improve the analysis of
data and further discover interesting and useful knowledge about unusual events within numerous
applications domains. In this paper, we report on contemporary unsupervised outlier detection
techniques for multiple types of data sets and provide a comprehensive taxonomy framework and
two decision trees to select the most suitable technique based on data set. Furthermore, we
highlight the advantages, disadvantages and performance issues of each class of outlier detection
techniques under this taxonomy framework
Detecting Outliers in Data with Correlated Measures
Advances in sensor technology have enabled the collection of large-scale
datasets. Such datasets can be extremely noisy and often contain a significant
amount of outliers that result from sensor malfunction or human operation
faults. In order to utilize such data for real-world applications, it is
critical to detect outliers so that models built from these datasets will not
be skewed by outliers.
In this paper, we propose a new outlier detection method that utilizes the
correlations in the data (e.g., taxi trip distance vs. trip time). Different
from existing outlier detection methods, we build a robust regression model
that explicitly models the outliers and detects outliers simultaneously with
the model fitting.
We validate our approach on real-world datasets against methods specifically
designed for each dataset as well as the state of the art outlier detectors.
Our outlier detection method achieves better performances, demonstrating the
robustness and generality of our method. Last, we report interesting case
studies on some outliers that result from atypical events.Comment: 10 page
Scalable and Interpretable One-class SVMs with Deep Learning and Random Fourier features
One-class support vector machine (OC-SVM) for a long time has been one of the
most effective anomaly detection methods and extensively adopted in both
research as well as industrial applications. The biggest issue for OC-SVM is
yet the capability to operate with large and high-dimensional datasets due to
optimization complexity. Those problems might be mitigated via dimensionality
reduction techniques such as manifold learning or autoencoder. However,
previous work often treats representation learning and anomaly prediction
separately. In this paper, we propose autoencoder based one-class support
vector machine (AE-1SVM) that brings OC-SVM, with the aid of random Fourier
features to approximate the radial basis kernel, into deep learning context by
combining it with a representation learning architecture and jointly exploit
stochastic gradient descent to obtain end-to-end training. Interestingly, this
also opens up the possible use of gradient-based attribution methods to explain
the decision making for anomaly detection, which has ever been challenging as a
result of the implicit mappings between the input space and the kernel space.
To the best of our knowledge, this is the first work to study the
interpretability of deep learning in anomaly detection. We evaluate our method
on a wide range of unsupervised anomaly detection tasks in which our end-to-end
training architecture achieves a performance significantly better than the
previous work using separate training.Comment: Accepted at European Conference on Machine Learning and Principles
and Practice of Knowledge Discovery in Databases (ECML-PKDD) 201
Classification methods for Hilbert data based on surrogate density
An unsupervised and a supervised classification approaches for Hilbert random
curves are studied. Both rest on the use of a surrogate of the probability
density which is defined, in a distribution-free mixture context, from an
asymptotic factorization of the small-ball probability. That surrogate density
is estimated by a kernel approach from the principal components of the data.
The focus is on the illustration of the classification algorithms and the
computational implications, with particular attention to the tuning of the
parameters involved. Some asymptotic results are sketched. Applications on
simulated and real datasets show how the proposed methods work.Comment: 33 pages, 11 figures, 6 table
- …