Location of Repository

Scalability of efficient parallel K-Means

By David Pettinger and Giuseppe Di Fatta

Abstract

Clustering is defined as the grouping of similar items in a set, and is an important process within the field of data mining. As the amount of data for various applications continues to increase, in terms of its size and dimensionality, it is necessary to have efficient clustering methods. A popular clustering algorithm is K-Means, which adopts a greedy approach to produce a set of K-clusters with associated centres of mass, and uses a squared error distortion measure to determine convergence. Methods for improving the efficiency of K-Means have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting a more efficient data structure, notably a multi-dimensional binary search tree (KD-Tree) to store either centroids or data points. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient K-Means techniques in parallel computational environments. In this work, we provide a parallel formulation for the KD-Tree based K-Means algorithm and address its load balancing issues

Year: 2009
OAI identifier: oai:centaur.reading.ac.uk:4499
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://centaur.reading.ac.uk/4... (external link)
  • http://dx.doi.org/10.1109/ESCI... (external link)
  • http://ieeexplore.ieee.org/xpl... (external link)
  • Suggested articles

    Citations

    1. (2000). A data-clustering algorithm on distributed memory multiprocessors.
    2. (1999). Accelerating exact k-means algorithms with geometric reasoning.
    3. (2002). An efficient k-means clustering algorithm: Analysis and implementation. doi
    4. (1998). An efficient k-means clustering algorithm.
    5. (1999). Data clustering: A review.
    6. (1991). Efficient memory based learning for robot control.
    7. (1998). Large-scale parallel data clustering.
    8. (2006). MPJ Express: Towards thread safe Java HPC.
    9. (1975). Multidimensional binary search trees used for associative searching.
    10. (1978). Multidimensional Scaling. Sage University Paper series on Quantitative Applications in the Social Sciences, Beverly Hills and London:
    11. (1998). Refining initial points for kmeans clustering.
    12. (2000). Scalable parallel clustering for data mining on multicomputers.
    13. Some methods for classification and analysis of multivariate observations.
    14. (2000). X-means: Extending k-means with efficient estimation of the number of clusters.

    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.