41 research outputs found

    Linear, Deterministic, and Order-Invariant Initialization Methods for the K-Means Clustering Algorithm

    Full text link
    Over the past five decades, k-means has become the clustering algorithm of choice in many application domains primarily due to its simplicity, time/space efficiency, and invariance to the ordering of the data points. Unfortunately, the algorithm's sensitivity to the initial selection of the cluster centers remains to be its most serious drawback. Numerous initialization methods have been proposed to address this drawback. Many of these methods, however, have time complexity superlinear in the number of data points, which makes them impractical for large data sets. On the other hand, linear methods are often random and/or sensitive to the ordering of the data points. These methods are generally unreliable in that the quality of their results is unpredictable. Therefore, it is common practice to perform multiple runs of such methods and take the output of the run that produces the best results. Such a practice, however, greatly increases the computational requirements of the otherwise highly efficient k-means algorithm. In this chapter, we investigate the empirical performance of six linear, deterministic (non-random), and order-invariant k-means initialization methods on a large and diverse collection of data sets from the UCI Machine Learning Repository. The results demonstrate that two relatively unknown hierarchical initialization methods due to Su and Dy outperform the remaining four methods with respect to two objective effectiveness criteria. In addition, a recent method due to Erisoglu et al. performs surprisingly poorly.Comment: 21 pages, 2 figures, 5 tables, Partitional Clustering Algorithms (Springer, 2014). arXiv admin note: substantial text overlap with arXiv:1304.7465, arXiv:1209.196

    A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm

    Full text link
    K-means is undoubtedly the most widely used partitional clustering algorithm. Unfortunately, due to its gradient descent nature, this algorithm is highly sensitive to the initial placement of the cluster centers. Numerous initialization methods have been proposed to address this problem. In this paper, we first present an overview of these methods with an emphasis on their computational efficiency. We then compare eight commonly used linear time complexity initialization methods on a large and diverse collection of data sets using various performance criteria. Finally, we analyze the experimental results using non-parametric statistical tests and provide recommendations for practitioners. We demonstrate that popular initialization methods often perform poorly and that there are in fact strong alternatives to these methods.Comment: 17 pages, 1 figure, 7 table

    Automatic Clustering with Single Optimal Solution

    Get PDF
    Determining optimal number of clusters in a dataset is a challenging task. Though some methods are available, there is no algorithm that produces unique clustering solution. The paper proposes an Automatic Merging for Single Optimal Solution (AMSOS) which aims to generate unique and nearly optimal clusters for the given datasets automatically. The AMSOS is iteratively merges the closest clusters automatically by validating with cluster validity measure to find single and nearly optimal clusters for the given data set. Experiments on both synthetic and real data have proved that the proposed algorithm finds single and nearly optimal clustering structure in terms of number of clusters, compactness and separation.Comment: 13 pages,4 Tables, 3 figure

    Robust seed selection algorithm for k-means type algorithms

    Full text link
    Selection of initial seeds greatly affects the quality of the clusters and in k-means type algorithms. Most of the seed selection methods result different results in different independent runs. We propose a single, optimal, outlier insensitive seed selection algorithm for k-means type algorithms as extension to k-means++. The experimental results on synthetic, real and on microarray data sets demonstrated that effectiveness of the new algorithm in producing the clustering resultsComment: 17 pages, 5 tables, 9figure

    Classification of infectious diseases via hybrid k-means clustering technique

    Get PDF
    Identifying groups of objects that are similar to each other but different from individuals in other groups can be intellectually satisfying, profitable, or sometimes both. Kmeans clustering is one of the well known partitioning algorithms. But basic K-means method is insufficient to extract meaningful information and its output is very conscious to initial positions of cluster centers. In this paper, data of infectious diseases were analyzed with the hybrid K-means clustering technique. This method is developed to preprocess the dataset that will be used in the K-means clustering problems. Specifically, it performs K-means clustering on preprocessed dataset instead of raw dataset to remove the impact of irrelevant features and selection of good initial centers. The experimental results revealed that all the three water related diseases are grouped together in one cluster for both KGHK and FMCK data sets. They also show the high prevalence compared to airborne particle related diseases in the other group. The study concludes that K-means clustering method provides a suitable tool for assessing the level of infectious diseases

    3D Partition-Based Clustering for Supply Chain Data Management

    Get PDF
    Supply Chain Management (SCM) is the management of the products and goods flow from its origin point to point of consumption. During the process of SCM, information and dataset gathered for this application is massive and complex. This is due to its several processes such as procurement, product development and commercialization, physical distribution, outsourcing and partnerships. For a practical application, SCM datasets need to be managed and maintained to serve a better service to its three main categories; distributor, customer and supplier. To manage these datasets, a structure of data constellation is used to accommodate the data into the spatial database. However, the situation in geospatial database creates few problems, for example the performance of the database deteriorate especially during the query operation. We strongly believe that a more practical hierarchical tree structure is required for efficient process of SCM. Besides that, three-dimensional approach is required for the management of SCM datasets since it involve with the multi-level location such as shop lots and residential apartments. 3D R-Tree has been increasingly used for 3D geospatial database management due to its simplicity and extendibility. However, it suffers from serious overlaps between nodes. In this paper, we proposed a partition-based clustering for the construction of a hierarchical tree structure. Several datasets are tested using the proposed method and the percentage of the overlapping nodes and volume coverage are computed and compared with the original 3D R-Tree and other practical approaches. The experiments demonstrated in this paper substantiated that the hierarchical structure of the proposed partitionbased clustering is capable of preserving minimal overlap and coverage. The query performance was tested using 300,000 points of a SCM dataset and the results are presented in this paper. This paper also discusses the outlook of the structure for future reference

    PENINGKATAN KINERJA ALGORITMA K MEANS DENGAN MENGGUNAKAN PARTICLE SWARM OPTIMIZATION DALAM PENGELOMPOKAN DATA PENYEDIAAN AKSES

    Get PDF
               Water is one of the things that plays a very important role in human survival, because the Indonesian government has a community-based water supply and sanitation (PAMSIMAS) program, so that all the programs run well need a regional status grouping technique in this thesis. with the K-means algorithm. K-means is a partition algorithm that aims to divide the data into the specified number of clusters, the results of the K means algorithm depend on the selection of the initial klater center but problems that often occur when selecting the initial centroid are randomly drawn from the solution. from the grouping is not quite right. To overcome this problem the author wants to use the PSO algorithm in the initial centroid selector for the K-means algorithm, in this study also compared the selection of the first 3 centroids according to random, second according to government standards the value of high, medium and low drinking water quality then the third method proposed by the PSO algorithm was then tested with Davies Bouldin Index. From the test results, the K-means method with the selection of random initial centroid with a value of 0.208856082, the K-means method with the selection of centroids in accordance with government standards about SAM conditions of 0.280077 and the best selection method is K-means PSO 0, 08383. So testing the PAMSIMAS data using K-means PSO found that the method was more optimal. &nbsp
    corecore