97,881 research outputs found
On Parallelization of Categorical Data Clustering
We study parallelization of categorical data clustering algorithms in an MPI platform. Clustering such data has been a daunting task even for sequential algorithms, mainly due to the challenges in finding suitable similarity/distance measures. We propose a parallel version of the k-modes algorithm, called PV3, which maintains the same clustering quality as produced by the sequential approach while achieving reasonable speed-ups. PV3 is programmed to ensure deterministic processing in a parallel environment. To produce better clustering results, we then develop an initialization method called Revised Density Method (RDM) based on the notion of density. Additionally, we develop variants of the RDM method to further enhance its performance. we then study effective ways to parallelize RDM and its variants. To further exploit parallelism opportunities, we develop an Ensemble Parallelizing Process (EPP) framework. This framework can be used with any desired initialization/clustering algorithms with different levels of parallelism. Using our different RDM initialization techniques along with the PV3 algorithm in the EPP framework, we then build an RDM realization of EPP, called RDM EPP. The result of our numerous experiments using benchmark categorical datasets indicate the quality metric of RDM EPP to be among the top three sequential k-modes based clustering algorithms. In terms of speed up, the results indicate to be 7 times faster for some datasets, though much larger datasets are required for a more comprehensive scalability study of RDM EPP
HIERARCHICAL CLUSTERING USING LEVEL SETS
Over the past several decades, clustering algorithms have earned their place as a go-to solution for database mining. This paper introduces a new concept which is used to develop a new recursive version of DBSCAN that can successfully perform hierarchical clustering, called Level- Set Clustering (LSC). A level-set is a subset of points of a data-set whose densities are greater than some threshold, ‘t’. By graphing the size of each level-set against its respective ‘t,’ indents are produced in the line graph which correspond to clusters in the data-set, as the points in a cluster have very similar densities. This new algorithm is able to produce the clustering result with the same O(n log n) time complexity as DBSCAN and OPTICS, while catching clusters the others missed
A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets
The term "outlier" can generally be defined as an observation that is significantly different from
the other values in a data set. The outliers may be instances of error or indicate events. The
task of outlier detection aims at identifying such outliers in order to improve the analysis of
data and further discover interesting and useful knowledge about unusual events within numerous
applications domains. In this paper, we report on contemporary unsupervised outlier detection
techniques for multiple types of data sets and provide a comprehensive taxonomy framework and
two decision trees to select the most suitable technique based on data set. Furthermore, we
highlight the advantages, disadvantages and performance issues of each class of outlier detection
techniques under this taxonomy framework
Optimal Clustering Framework for Hyperspectral Band Selection
Band selection, by choosing a set of representative bands in hyperspectral
image (HSI), is an effective method to reduce the redundant information without
compromising the original contents. Recently, various unsupervised band
selection methods have been proposed, but most of them are based on
approximation algorithms which can only obtain suboptimal solutions toward a
specific objective function. This paper focuses on clustering-based band
selection, and proposes a new framework to solve the above dilemma, claiming
the following contributions: 1) An optimal clustering framework (OCF), which
can obtain the optimal clustering result for a particular form of objective
function under a reasonable constraint. 2) A rank on clusters strategy (RCS),
which provides an effective criterion to select bands on existing clustering
structure. 3) An automatic method to determine the number of the required
bands, which can better evaluate the distinctive information produced by
certain number of bands. In experiments, the proposed algorithm is compared to
some state-of-the-art competitors. According to the experimental results, the
proposed algorithm is robust and significantly outperform the other methods on
various data sets
- …