6 research outputs found
Wisdom of the Contexts: Active Ensemble Learning for Contextual Anomaly Detection
In contextual anomaly detection (CAD), an object is only considered anomalous
within a specific context. Most existing methods for CAD use a single context
based on a set of user-specified contextual features. However, identifying the
right context can be very challenging in practice, especially in datasets, with
a large number of attributes. Furthermore, in real-world systems, there might
be multiple anomalies that occur in different contexts and, therefore, require
a combination of several "useful" contexts to unveil them. In this work, we
leverage active learning and ensembles to effectively detect complex contextual
anomalies in situations where the true contextual and behavioral attributes are
unknown. We propose a novel approach, called WisCon (Wisdom of the Contexts),
that automatically creates contexts from the feature set. Our method constructs
an ensemble of multiple contexts, with varying importance scores, based on the
assumption that not all useful contexts are equally so. Experiments show that
WisCon significantly outperforms existing baselines in different categories
(i.e., active classifiers, unsupervised contextual and non-contextual anomaly
detectors, and supervised classifiers) on seven datasets. Furthermore, the
results support our initial hypothesis that there is no single perfect context
that successfully uncovers all kinds of contextual anomalies, and leveraging
the "wisdom" of multiple contexts is necessary.Comment: Submitted to IEEE TKD
Improved spatial outlier detection method within a river network
A spatial outlier refers to the observation whose non-spatial attribute values are significantly different from those of its neighbors. Such observations can also be found in water quality data at monitoring stations within a river network. However, existing spatial outlier detection procedures based on distance measures such as the Euclidean distance between monitoring stations do not take into account the river network topology. In general, water quality levels in lower streams will be affected by the flow from the upper streams. Similarly, the water quality at some tributaries may have little influence on the other tributaries. Hence, a method for identifying spatial outliers in a river network, taking into account the effect of river flow connectivity on the determination of the neighbors of the monitoring stations, is proposed. While the robust Mahalalobis distance is used in both methods, the proposed method uses river distance instead of the Euclidean distance. The performance of the proposed method is shown to be superior using a synthetic river dataset through simulation. For illustration, we apply the proposed method on the water quality data from Sg. Klang Basin in 2016 provided by the Department of Environment, Malaysia. The finding provides a better identification of the water quality in some stations that significantly differ from their neighbouring stations. Such information is useful for the authorities in their planning of the environmental monitoring of water quality in the areas
Homophily Outlier Detection in Non-IID Categorical Data
Most of existing outlier detection methods assume that the outlier factors
(i.e., outlierness scoring measures) of data entities (e.g., feature values and
data objects) are Independent and Identically Distributed (IID). This
assumption does not hold in real-world applications where the outlierness of
different entities is dependent on each other and/or taken from different
probability distributions (non-IID). This may lead to the failure of detecting
important outliers that are too subtle to be identified without considering the
non-IID nature. The issue is even intensified in more challenging contexts,
e.g., high-dimensional data with many noisy features. This work introduces a
novel outlier detection framework and its two instances to identify outliers in
categorical data by capturing non-IID outlier factors. Our approach first
defines and incorporates distribution-sensitive outlier factors and their
interdependence into a value-value graph-based representation. It then models
an outlierness propagation process in the value graph to learn the outlierness
of feature values. The learned value outlierness allows for either direct
outlier detection or outlying feature selection. The graph representation and
mining approach is employed here to well capture the rich non-IID
characteristics. Our empirical results on 15 real-world data sets with
different levels of data complexities show that (i) the proposed outlier
detection methods significantly outperform five state-of-the-art methods at the
95%/99% confidence level, achieving 10%-28% AUC improvement on the 10 most
complex data sets; and (ii) the proposed feature selection methods
significantly outperform three competing methods in enabling subsequent outlier
detection of two different existing detectors.Comment: To appear in Data Ming and Knowledge Discovery Journa