24,132 research outputs found
Reducing the Time Requirement of k-Means Algorithm
Traditional k-means and most k-means variants are still computationally expensive for large datasets, such as microarray
data, which have large datasets with large dimension size d. In k-means clustering, we are given a set of n data points in ddimensional
space Rd and an integer k. The problem is to determine a set of k points in Rd, called centers, so as to minimize
the mean squared distance from each data point to its nearest center. In this work, we develop a novel k-means algorithm,
which is simple but more efficient than the traditional k-means and the recent enhanced k-means. Our new algorithm is
based on the recently established relationship between principal component analysis and the k-means clustering. We
provided the correctness proof for this algorithm. Results obtained from testing the algorithm on three biological data and
six non-biological data (three of these data are real, while the other three are simulated) also indicate that our algorithm is
empirically faster than other known k-means algorithms. We assessed the quality of our algorithm clusters against the
clusters of a known structure using the Hubert-Arabie Adjusted Rand index (ARIHA). We found that when k is close to d, the
quality is good (ARIHA.0.8) and when k is not close to d, the quality of our new k-means algorithm is excellent (ARIHA.0.9).
In this paper, emphases are on the reduction of the time requirement of the k-means algorithm and its application to
microarray data due to the desire to create a tool for clustering and malaria research. However, the new clustering algorithm
can be used for other clustering needs as long as an appropriate measure of distance between the centroids and the
members is used. This has been demonstrated in this work on six non-biological data
Study of document clustering using the k-means algorithm
One of the most commonly used data mining techniques is document clustering or unsupervised document classification which deals with the grouping of documents based on some document similarity function; This thesis deals with research issues associated with categorizing documents using the k-means clustering algorithm which groups objects into K number of groups based on document representations and similarities; The proposed hypothesis of this thesis is to prove that unsupervised clustering of a set of documents produces similar results to that of their supervised categorization
A Faster -means++ Algorithm
K-means++ is an important algorithm to choose initial cluster centers for the
k-means clustering algorithm. In this work, we present a new algorithm that can
solve the -means++ problem with near optimal running time. Given data
points in , the current state-of-the-art algorithm runs in
iterations, and each iteration takes
time. The overall running time is thus . We propose a
new algorithm \textsc{FastKmeans++} that only takes in time, in total
Color image segmentation using a spatial k-means clustering algorithm
This paper details the implementation of a new adaptive technique for color-texture segmentation that is a generalization of the standard K-Means algorithm. The standard K-Means algorithm produces accurate segmentation results only when applied to images defined by homogenous regions with respect to texture and color since no local constraints are applied to impose spatial continuity. In addition, the initialization of the K-Means algorithm is problematic and usually the initial cluster centers are randomly picked. In this paper we detail the implementation of a novel technique to select the dominant colors from the input image using the information from the color histograms. The main contribution of this work is the generalization of the K-Means algorithm that includes the primary features that describe the color smoothness and texture complexity in the process of pixel assignment. The resulting color segmentation scheme has been applied to a large number of natural images and the experimental data indicates the robustness of the new developed segmentation algorithm
- …