Search CORE

41 research outputs found

Linear, Deterministic, and Order-Invariant Initialization Methods for the K-Means Clustering Algorithm

Over the past five decades, k-means has become the clustering algorithm of choice in many application domains primarily due to its simplicity, time/space efficiency, and invariance to the ordering of the data points. Unfortunately, the algorithm's sensitivity to the initial selection of the cluster centers remains to be its most serious drawback. Numerous initialization methods have been proposed to address this drawback. Many of these methods, however, have time complexity superlinear in the number of data points, which makes them impractical for large data sets. On the other hand, linear methods are often random and/or sensitive to the ordering of the data points. These methods are generally unreliable in that the quality of their results is unpredictable. Therefore, it is common practice to perform multiple runs of such methods and take the output of the run that produces the best results. Such a practice, however, greatly increases the computational requirements of the otherwise highly efficient k-means algorithm. In this chapter, we investigate the empirical performance of six linear, deterministic (non-random), and order-invariant k-means initialization methods on a large and diverse collection of data sets from the UCI Machine Learning Repository. The results demonstrate that two relatively unknown hierarchical initialization methods due to Su and Dy outperform the remaining four methods with respect to two objective effectiveness criteria. In addition, a recent method due to Erisoglu et al. performs surprisingly poorly.Comment: 21 pages, 2 figures, 5 tables, Partitional Clustering Algorithms (Springer, 2014). arXiv admin note: substantial text overlap with arXiv:1304.7465, arXiv:1209.196

arXiv.org e-Print Archive

Crossref

A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm

Author: Al Hasan
Al-Daoud
Aloise
Aloise
Anderberg
Babu
Babu
Ball
Bei
Bergmann
Bottou
Breunig
Cao
Celebi
Chen
Chen
Daniel
Forgy
Friedman
Garcia
Garcia
Gonzalez
Hartigan
Hassan A. Kingravi
Hotelling
Huang
Huang
Hubert
Hyvärinen
Iman
Jain
Jain
Jancey
Kanungo
Katsavounidis
Kaufman
Lance
Likas
Linde
Lloyd
Lu
Luengo
M. Emre Celebi
Maitra
Mao
Matsumoto
Meilă
Milligan
Milligan
Norušis
Onoda
Ordonez
Pal
Patricio A. Vela
Pena
Redmond
Selim
Späth
Su
Tarsitano
Tou
Wu
Zhang
Publication venue: 'Elsevier BV'
Publication date: 10/09/2012
Field of study

K-means is undoubtedly the most widely used partitional clustering algorithm. Unfortunately, due to its gradient descent nature, this algorithm is highly sensitive to the initial placement of the cluster centers. Numerous initialization methods have been proposed to address this problem. In this paper, we first present an overview of these methods with an emphasis on their computational efficiency. We then compare eight commonly used linear time complexity initialization methods on a large and diverse collection of data sets using various performance criteria. Finally, we analyze the experimental results using non-parametric statistical tests and provide recommendations for practitioners. We demonstrate that popular initialization methods often perform poorly and that there are in fact strong alternatives to these methods.Comment: 17 pages, 1 figure, 7 table

arXiv.org e-Print Archive

Crossref

Automatic Clustering with Single Optimal Solution

Author: Pavan K. Karteeka
Rao A. V. Dattatreya
Rao Allam Appa
Publication venue
Publication date: 01/10/2011
Field of study

Determining optimal number of clusters in a dataset is a challenging task. Though some methods are available, there is no algorithm that produces unique clustering solution. The paper proposes an Automatic Merging for Single Optimal Solution (AMSOS) which aims to generate unique and nearly optimal clusters for the given datasets automatically. The AMSOS is iteratively merges the closest clusters automatically by validating with cluster validity measure to find single and nearly optimal clusters for the given data set. Experiments on both synthetic and real data have proved that the proposed algorithm finds single and nearly optimal clustering structure in terms of number of clusters, compactness and separation.Comment: 13 pages,4 Tables, 3 figure

arXiv.org e-Print Archive

International Institute for Science, Technology and Education (IISTE): E-Journals

Robust seed selection algorithm for k-means type algorithms

Author: Pavan K. Karteeka
Rao A. V. Dattatreya
Rao Allam Appa
Sridhar G. R.
Publication venue: 'Academy and Industry Research Collaboration Center (AIRCC)'
Publication date: 07/02/2012
Field of study

Selection of initial seeds greatly affects the quality of the clusters and in k-means type algorithms. Most of the seed selection methods result different results in different independent runs. We propose a single, optimal, outlier insensitive seed selection algorithm for k-means type algorithms as extension to k-means++. The experimental results on synthetic, real and on microarray data sets demonstrated that effectiveness of the new algorithm in producing the clustering resultsComment: 17 pages, 5 tables, 9figure

arXiv.org e-Print Archive

Crossref

Classification of infectious diseases via hybrid k-means clustering technique

Author: Mohamad Ismail
Usman Dauda
Publication venue
Publication date: 01/01/2015
Field of study

Identifying groups of objects that are similar to each other but different from individuals in other groups can be intellectually satisfying, profitable, or sometimes both. Kmeans clustering is one of the well known partitioning algorithms. But basic K-means method is insufficient to extract meaningful information and its output is very conscious to initial positions of cluster centers. In this paper, data of infectious diseases were analyzed with the hybrid K-means clustering technique. This method is developed to preprocess the dataset that will be used in the K-means clustering problems. Specifically, it performs K-means clustering on preprocessed dataset instead of raw dataset to remove the impact of irrelevant features and selection of good initial centers. The experimental results revealed that all the three water related diseases are grouped together in one cluster for both KGHK and FMCK data sets. They also show the high prevalence compared to airborne particle related diseases in the other group. The study concludes that K-means clustering method provides a suitable tool for assessing the level of infectious diseases

Universiti Teknologi Malaysia Institutional Repository

3D Partition-Based Clustering for Supply Chain Data Management

Author: Anton François
Mioc Darka
Rahman A. A.
Suhaibah A.
Uznir U.
Publication venue: 'Copernicus GmbH'
Publication date: 01/01/2015
Field of study

Supply Chain Management (SCM) is the management of the products and goods flow from its origin point to point of consumption. During the process of SCM, information and dataset gathered for this application is massive and complex. This is due to its several processes such as procurement, product development and commercialization, physical distribution, outsourcing and partnerships. For a practical application, SCM datasets need to be managed and maintained to serve a better service to its three main categories; distributor, customer and supplier. To manage these datasets, a structure of data constellation is used to accommodate the data into the spatial database. However, the situation in geospatial database creates few problems, for example the performance of the database deteriorate especially during the query operation. We strongly believe that a more practical hierarchical tree structure is required for efficient process of SCM. Besides that, three-dimensional approach is required for the management of SCM datasets since it involve with the multi-level location such as shop lots and residential apartments. 3D R-Tree has been increasingly used for 3D geospatial database management due to its simplicity and extendibility. However, it suffers from serious overlaps between nodes. In this paper, we proposed a partition-based clustering for the construction of a hierarchical tree structure. Several datasets are tested using the proposed method and the percentage of the overlapping nodes and volume coverage are computed and compared with the original 3D R-Tree and other practical approaches. The experiments demonstrated in this paper substantiated that the hierarchical structure of the proposed partitionbased clustering is capable of preserving minimal overlap and coverage. The query performance was tested using 300,000 points of a SCM dataset and the results are presented in this paper. This paper also discusses the outlook of the structure for future reference

Directory of Open Access Journals

Online Research Database In Technology

PENINGKATAN KINERJA ALGORITMA K MEANS DENGAN MENGGUNAKAN PARTICLE SWARM OPTIMIZATION DALAM PENGELOMPOKAN DATA PENYEDIAAN AKSES

Author: HENDRAWAN ARI YUNUS
Publication venue: 'Politeknik Katolik Saint Paul'
Publication date: 03/11/2020
Field of study

           Water is one of the things that plays a very important role in human survival, because the Indonesian government has a community-based water supply and sanitation (PAMSIMAS) program, so that all the programs run well need a regional status grouping technique in this thesis. with the K-means algorithm. K-means is a partition algorithm that aims to divide the data into the specified number of clusters, the results of the K means algorithm depend on the selection of the initial klater center but problems that often occur when selecting the initial centroid are randomly drawn from the solution. from the grouping is not quite right. To overcome this problem the author wants to use the PSO algorithm in the initial centroid selector for the K-means algorithm, in this study also compared the selection of the first 3 centroids according to random, second according to government standards the value of high, medium and low drinking water quality then the third method proposed by the PSO algorithm was then tested with Davies Bouldin Index. From the test results, the K-means method with the selection of random initial centroid with a value of 0.208856082, the K-means method with the selection of centroids in accordance with government standards about SAM conditions of 0.280077 and the best selection method is K-means PSO 0, 08383. So testing the PAMSIMAS data using K-means PSO found that the method was more optimal. &nbsp

Poltek-EJS (Politeknik Katolik Saint Paul Sorong - Jurnal Elektronic System)