2,849 research outputs found
Detecting Outliers in Data with Correlated Measures
Advances in sensor technology have enabled the collection of large-scale
datasets. Such datasets can be extremely noisy and often contain a significant
amount of outliers that result from sensor malfunction or human operation
faults. In order to utilize such data for real-world applications, it is
critical to detect outliers so that models built from these datasets will not
be skewed by outliers.
In this paper, we propose a new outlier detection method that utilizes the
correlations in the data (e.g., taxi trip distance vs. trip time). Different
from existing outlier detection methods, we build a robust regression model
that explicitly models the outliers and detects outliers simultaneously with
the model fitting.
We validate our approach on real-world datasets against methods specifically
designed for each dataset as well as the state of the art outlier detectors.
Our outlier detection method achieves better performances, demonstrating the
robustness and generality of our method. Last, we report interesting case
studies on some outliers that result from atypical events.Comment: 10 page
DxNAT - Deep Neural Networks for Explaining Non-Recurring Traffic Congestion
Non-recurring traffic congestion is caused by temporary disruptions, such as
accidents, sports games, adverse weather, etc. We use data related to real-time
traffic speed, jam factors (a traffic congestion indicator), and events
collected over a year from Nashville, TN to train a multi-layered deep neural
network. The traffic dataset contains over 900 million data records. The
network is thereafter used to classify the real-time data and identify
anomalous operations. Compared with traditional approaches of using statistical
or machine learning techniques, our model reaches an accuracy of 98.73 percent
when identifying traffic congestion caused by football games. Our approach
first encodes the traffic across a region as a scaled image. After that the
image data from different timestamps is fused with event- and time-related
data. Then a crossover operator is used as a data augmentation method to
generate training datasets with more balanced classes. Finally, we use the
receiver operating characteristic (ROC) analysis to tune the sensitivity of the
classifier. We present the analysis of the training time and the inference time
separately
Anomaly Detection in Categorical Datasets with Artificial Contrasts
abstract: Anomaly is a deviation from the normal behavior of the system and anomaly detection techniques try to identify unusual instances based on deviation from the normal data. In this work, I propose a machine-learning algorithm, referred to as Artificial Contrasts, for anomaly detection in categorical data in which neither the dimension, the specific attributes involved, nor the form of the pattern is known a priori. I use RandomForest (RF) technique as an effective learner for artificial contrast. RF is a powerful algorithm that can handle relations of attributes in high dimensional data and detect anomalies while providing probability estimates for risk decisions.
I apply the model to two simulated data sets and one real data set. The model was able to detect anomalies with a very high accuracy. Finally, by comparing the proposed model with other models in the literature, I demonstrate superior performance of the proposed model.Dissertation/ThesisMasters Thesis Industrial Engineering 201
A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets
The term "outlier" can generally be defined as an observation that is significantly different from
the other values in a data set. The outliers may be instances of error or indicate events. The
task of outlier detection aims at identifying such outliers in order to improve the analysis of
data and further discover interesting and useful knowledge about unusual events within numerous
applications domains. In this paper, we report on contemporary unsupervised outlier detection
techniques for multiple types of data sets and provide a comprehensive taxonomy framework and
two decision trees to select the most suitable technique based on data set. Furthermore, we
highlight the advantages, disadvantages and performance issues of each class of outlier detection
techniques under this taxonomy framework
- …