506 research outputs found
Efficient Computation of Multiple Density-Based Clustering Hierarchies
HDBSCAN*, a state-of-the-art density-based hierarchical clustering method,
produces a hierarchical organization of clusters in a dataset w.r.t. a
parameter mpts. While the performance of HDBSCAN* is robust w.r.t. mpts in the
sense that a small change in mpts typically leads to only a small or no change
in the clustering structure, choosing a "good" mpts value can be challenging:
depending on the data distribution, a high or low value for mpts may be more
appropriate, and certain data clusters may reveal themselves at different
values of mpts. To explore results for a range of mpts values, however, one has
to run HDBSCAN* for each value in the range independently, which is
computationally inefficient. In this paper, we propose an efficient approach to
compute all HDBSCAN* hierarchies for a range of mpts values by replacing the
graph used by HDBSCAN* with a much smaller graph that is guaranteed to contain
the required information. An extensive experimental evaluation shows that with
our approach one can obtain over one hundred hierarchies for the computational
cost equivalent to running HDBSCAN* about 2 times.Comment: A short version of this paper appears at IEEE ICDM 2017. Corrected
typos. Revised abstrac
Unsupervised Anomaly Detection of High Dimensional Data with Low Dimensional Embedded Manifold
Anomaly detection techniques are supposed to identify anomalies from loads of seemingly homogeneous data and being able to do so can lead us to timely, pivotal and actionable decisions, saving us from potential human, financial and informational loss. In anomaly detection, an often encountered situation is the absence of prior knowledge about the nature of anomalies. Such circumstances advocate for āunsupervisedā learning-based anomaly detection techniques. Compared to its āsupervisedā counterpart, which possesses the luxury to utilize a labeled training dataset containing both normal and anomalous samples, unsupervised problems are far more difficult. Moreover, high dimensional streaming data from tons of interconnected sensors present in modern day industries makes the task more challenging. To carry out an investigative effort to address these challenges is the overarching theme of this dissertation.
In this dissertation, the fundamental issue of similarity measure among observations, which is a central piece in any anomaly detection techniques, is reassessed. Manifold hypotheses suggests the possibility of low dimensional manifold structure embedded in high dimensional data. In the presence of such structured space, traditional similarity measures fail to measure the true intrinsic similarity. In light of this revelation, reevaluating the notion of similarity measure seems more pressing rather than providing incremental improvements over any of the existing techniques. A graph theoretic similarity measure is proposed to differentiate and thus identify the anomalies from normal observations. Specifically, the minimum spanning tree (MST), a graph-based approach is proposed to approximate the similarities among data points in the presence of high dimensional structured space. It can track the structure of the embedded manifold better than the existing measures and help to distinguish the anomalies from normal observations. This dissertation investigates further three different aspects of the anomaly detection problem and develops three sets of solution approaches with all of them revolving around the newly proposed MST based similarity measure.
In the first part of the dissertation, a local MST (LoMST) based anomaly detection approach is proposed to detect anomalies using the data in the original space. A two-step procedure is developed to detect both cluster and point anomalies. The next two sets of methods are proposed in the subsequent two parts of the dissertation, for anomaly detection in reduced data space. In the second part of the dissertation, a neighborhood structure assisted version of the nonnegative matrix factorization approach (NS-NMF) is proposed. To detect anomalies, it uses the neighborhood information captured by a sparse MST similarity matrix along with the original attribute information. To meet the industry demands, the online version of both LoMST and NS-NMF is also developed for real-time anomaly detection. In the last part of the dissertation, a graph regularized autoencoder is proposed which uses an MST regularizer in addition to the original loss function and is thus capable of maintaining the local invariance property. All of the approaches proposed in the dissertation are tested on 20 benchmark datasets and one real-life hydropower dataset. When compared with the state of art approaches, all three approaches produce statistically significant better outcomes.
āIndustry 4.0ā is a reality now and it calls for anomaly detection techniques capable of processing a large amount of high dimensional data generated in real-time. The proposed MST based similarity measure followed by the individual techniques developed in this dissertation are equipped to tackle each of these issues and provide an effective and reliable real-time anomaly identification platform
Testing for voter rigging in small polling stations
Since the 1970s there has been a large number of countries that combine
formal democratic institutions with authoritarian practices. Although in such
countries the ruling elites may receive considerable voter support they often
employ several manipulation tools to control election outcomes. A common
practice of these regimes is the coercion and mobilization of a significant
amount of voters to guarantee the electoral victory. This electoral
irregularity is known as voter rigging, distinguishing it from vote rigging,
which involves ballot stuffing or stealing. Here we develop a statistical test
to quantify to which extent the results of a particular election display traces
of voter rigging. Our key hypothesis is that small polling stations are more
susceptible to voter rigging, because it is easier to identify opposing
individuals, there are less eye witnesses, and supposedly less visits from
election observers. We devise a general statistical method for testing whether
voting behavior in small polling stations is significantly different from the
behavior of their neighbor stations in a way that is consistent with the
widespread occurrence of voter rigging. Based on a comparative analysis, the
method enables to rule out whether observed differences in voting behavior
might be explained by geographic heterogeneities in vote preferences. We
analyze 21 elections in ten different countries and find significant anomalies
compatible with voter rigging in Russia from 2007-2011, in Venezuela from
2006-2013, and in Uganda 2011. Particularly disturbing is the case of Venezuela
where these distortions have been outcome-determinative in the 2013
presidential elections
Towards outlier detection for high-dimensional data streams using projected outlier analysis strategy
[Abstract]: Outlier detection is an important research problem in data mining that aims to discover useful abnormal and irregular patterns hidden in large data sets. Most existing outlier detection methods only deal with static data with relatively low dimensionality.
Recently, outlier detection for high-dimensional stream data became a new emerging research problem. A key observation that motivates this research is that outliers
in high-dimensional data are projected outliers, i.e., they are embedded in lower-dimensional subspaces. Detecting projected outliers from high-dimensional stream
data is a very challenging task for several reasons. First, detecting projected outliers is difficult even for high-dimensional static data. The exhaustive search for the out-lying subspaces where projected outliers are embedded is a NP problem. Second, the algorithms for handling data streams are constrained to take only one pass to process the streaming data with the conditions of space limitation and time criticality. The currently existing methods for outlier detection are found to be ineffective for detecting projected outliers in high-dimensional data streams.
In this thesis, we present a new technique, called the Stream Project Outlier deTector (SPOT), which attempts to detect projected outliers in high-dimensional
data streams. SPOT employs an innovative window-based time model in capturing dynamic statistics from stream data, and a novel data structure containing a set of
top sparse subspaces to detect projected outliers effectively. SPOT also employs a multi-objective genetic algorithm as an effective search method for finding the
outlying subspaces where most projected outliers are embedded. The experimental results demonstrate that SPOT is efficient and effective in detecting projected outliers
for high-dimensional data streams. The main contribution of this thesis is that it provides a backbone in tackling the challenging problem of outlier detection for high-
dimensional data streams. SPOT can facilitate the discovery of useful abnormal patterns and can be potentially applied to a variety of high demand applications, such as for sensor network data monitoring, online transaction protection, etc
- ā¦