99,118 research outputs found
Improved k-means clustering using principal component analysis and imputation methods for breast cancer dataset
Data mining techniques have been used to analyse pattern from data sets in order to derive useful information. Classification of data sets into clusters is one of the essential process for data manipulation. One of the most popular and efficient clustering methods is K-means method. However, the K-means clustering method has some difficulties in the analysis of high dimension data sets with the presence of missing values. Moreover, previous studies showed that high dimensionality of the feature in data set presented poses different problems for K-means clustering. For missing value problem, imputation method is needed to minimise the effect of incomplete high dimensional data sets in K-means clustering process. This research studies the effect of imputation algorithm and dimensionality reduction techniques on the performance of K-means clustering. Three imputation methods are implemented for the missing value estimation which are K-nearest neighbours (KNN), Least Local Square (LLS), and Bayesian Principle Component Analysis (BPCA). Principal Component Analysis (PCA) is a dimension reduction method that has a dimensional reduction capability by removing the unnecessary attribute of high dimensional data sets. Hence, PCA hybrid with K-means (PCA K-means) is proposed to give a better clustering result. The experimental process was performed by using Wisconsin Breast Cancer. By using LLS imputation method, the proposed hybrid PCA K-means outperformed the standard Kmeans clustering based on the results for breast cancer data set; in terms of clustering accuracy (0.29%) and computing time (95.76%)
Dynamic load balancing in parallel KD-tree k-means
One among the most influential and popular data mining methods is the k-Means algorithm for cluster analysis.
Techniques for improving the efficiency of k-Means have been
largely explored in two main directions. The amount of computation can be significantly reduced by adopting geometrical constraints and an efficient data structure, notably a multidimensional binary search tree (KD-Tree). These techniques allow to reduce the number of distance computations the algorithm performs at each iteration. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient k-Means variants in parallel computing environments. In this work, we provide a parallel formulation of the KD-Tree based k-Means algorithm for distributed memory systems and address its load balancing
issue. Three solutions have been developed and tested. Two
approaches are based on a static partitioning of the data set and a third solution incorporates a dynamic load balancing policy
Image classification by visual bag-of-words refinement and reduction
This paper presents a new framework for visual bag-of-words (BOW) refinement
and reduction to overcome the drawbacks associated with the visual BOW model
which has been widely used for image classification. Although very influential
in the literature, the traditional visual BOW model has two distinct drawbacks.
Firstly, for efficiency purposes, the visual vocabulary is commonly constructed
by directly clustering the low-level visual feature vectors extracted from
local keypoints, without considering the high-level semantics of images. That
is, the visual BOW model still suffers from the semantic gap, and thus may lead
to significant performance degradation in more challenging tasks (e.g. social
image classification). Secondly, typically thousands of visual words are
generated to obtain better performance on a relatively large image dataset. Due
to such large vocabulary size, the subsequent image classification may take
sheer amount of time. To overcome the first drawback, we develop a graph-based
method for visual BOW refinement by exploiting the tags (easy to access
although noisy) of social images. More notably, for efficient image
classification, we further reduce the refined visual BOW model to a much
smaller size through semantic spectral clustering. Extensive experimental
results show the promising performance of the proposed framework for visual BOW
refinement and reduction
Fast Color Quantization Using Weighted Sort-Means Clustering
Color quantization is an important operation with numerous applications in
graphics and image processing. Most quantization methods are essentially based
on data clustering algorithms. However, despite its popularity as a general
purpose clustering algorithm, k-means has not received much respect in the
color quantization literature because of its high computational requirements
and sensitivity to initialization. In this paper, a fast color quantization
method based on k-means is presented. The method involves several modifications
to the conventional (batch) k-means algorithm including data reduction, sample
weighting, and the use of triangle inequality to speed up the nearest neighbor
search. Experiments on a diverse set of images demonstrate that, with the
proposed modifications, k-means becomes very competitive with state-of-the-art
color quantization methods in terms of both effectiveness and efficiency.Comment: 30 pages, 2 figures, 4 table
- …