1,131 research outputs found
A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets
The term "outlier" can generally be defined as an observation that is significantly different from
the other values in a data set. The outliers may be instances of error or indicate events. The
task of outlier detection aims at identifying such outliers in order to improve the analysis of
data and further discover interesting and useful knowledge about unusual events within numerous
applications domains. In this paper, we report on contemporary unsupervised outlier detection
techniques for multiple types of data sets and provide a comprehensive taxonomy framework and
two decision trees to select the most suitable technique based on data set. Furthermore, we
highlight the advantages, disadvantages and performance issues of each class of outlier detection
techniques under this taxonomy framework
Feature discovery and visualization of robot mission data using convolutional autoencoders and Bayesian nonparametric topic models
The gap between our ability to collect interesting data and our ability to
analyze these data is growing at an unprecedented rate. Recent algorithmic
attempts to fill this gap have employed unsupervised tools to discover
structure in data. Some of the most successful approaches have used
probabilistic models to uncover latent thematic structure in discrete data.
Despite the success of these models on textual data, they have not generalized
as well to image data, in part because of the spatial and temporal structure
that may exist in an image stream.
We introduce a novel unsupervised machine learning framework that
incorporates the ability of convolutional autoencoders to discover features
from images that directly encode spatial information, within a Bayesian
nonparametric topic model that discovers meaningful latent patterns within
discrete data. By using this hybrid framework, we overcome the fundamental
dependency of traditional topic models on rigidly hand-coded data
representations, while simultaneously encoding spatial dependency in our topics
without adding model complexity. We apply this model to the motivating
application of high-level scene understanding and mission summarization for
exploratory marine robots. Our experiments on a seafloor dataset collected by a
marine robot show that the proposed hybrid framework outperforms current
state-of-the-art approaches on the task of unsupervised seafloor terrain
characterization.Comment: 8 page
On the Nature and Types of Anomalies: A Review
Anomalies are occurrences in a dataset that are in some way unusual and do
not fit the general patterns. The concept of the anomaly is generally
ill-defined and perceived as vague and domain-dependent. Moreover, despite some
250 years of publications on the topic, no comprehensive and concrete overviews
of the different types of anomalies have hitherto been published. By means of
an extensive literature review this study therefore offers the first
theoretically principled and domain-independent typology of data anomalies, and
presents a full overview of anomaly types and subtypes. To concretely define
the concept of the anomaly and its different manifestations, the typology
employs five dimensions: data type, cardinality of relationship, anomaly level,
data structure and data distribution. These fundamental and data-centric
dimensions naturally yield 3 broad groups, 9 basic types and 61 subtypes of
anomalies. The typology facilitates the evaluation of the functional
capabilities of anomaly detection algorithms, contributes to explainable data
science, and provides insights into relevant topics such as local versus global
anomalies.Comment: 38 pages (30 pages content), 10 figures, 3 tables. Preprint; review
comments will be appreciated. Improvements in version 2: Explicit mention of
fifth anomaly dimension; Added section on explainable anomaly detection;
Added section on variations on the anomaly concept; Various minor additions
and improvement
Mining sensor datasets with spatiotemporal neighborhoods
Many spatiotemporal data mining methods are dependent on how relationships between a spatiotemporal unit and its neighbors are defined. These relationships are often termed the neighborhood of a spatiotemporal object. The focus of this paper is the discovery of spatiotemporal neighborhoods to find automatically spatiotemporal sub-regions in a sensor dataset. This research is motivated by the need to characterize large sensor datasets like those found in oceanographic and meteorological research. The approach presented in this paper finds spatiotemporal neighborhoods in sensor datasets by combining an agglomerative method to create temporal intervals and a graph-based method to find spatial neighborhoods within each temporal interval. These methods were tested on real-world datasets including (a) sea surface temperature data from the Tropical Atmospheric Ocean Project (TAO) array in the Equatorial Pacific Ocean and (b) NEXRAD precipitation data from the Hydro-NEXRAD system. The results were evaluated based on known patterns of the phenomenon being measured. Furthermore the results were quantified by performing hypothesis testing to establish the statistical significance using Monte Carlo simulations. The approach was also compared with existing approaches using validation metrics namely spatial autocorrelation and temporal interval dissimilarity. The results of these experiments show that our approach indeed identifies highly refined spatiotemporal neighborhoods
Effective anomaly detection in sensor networks data streams
This paper addresses a major challenge in data mining applications where the full information about the underlying processes, such as sensor networks or large online database, cannot be practically obtained due to physical limitations such as low bandwidth or memory, storage, or computing power. Motivated by the recent theory on direct information sampling called compressed sensing (CS), we propose a framework for detecting anomalies from these largescale data mining applications where the full information is not practically possible to obtain. Exploiting the fact that the intrinsic dimension of the data in these applications are typically small relative to the raw dimension and the fact that compressed sensing is capable of capturing most information with few measurements, our work show that spectral methods that used for volume anomaly detection can be directly applied to the CS data with guarantee on performance. Our theoretical contributions are supported by extensive experimental results on large datasets which show satisfactory performance.<br /
- …