23,135 research outputs found

    An iterative initial-points refinement algorithm for categorical data clustering

    Get PDF
    The original k-means clustering algorithm is designed to work primarily on numeric data sets. This prohibits the algorithm from being directly applied to categorical data clustering in many data mining applications. The k-modes algorithm [Z. Huang, Clustering large data sets with mixed numeric and categorical value, in: Proceedings of the First Pacific Asia Knowledge Discovery and Data Mining Conference. World Scientific, Singapore, 1997, pp. 21–34] extended the k-means paradigm to cluster categorical data by using a frequency-based method to update the cluster modes versus the k-means fashion of minimizing a numerically valued cost. However, as is the case with most data clustering algorithms, the algorithm requires a pre-setting or random selection of initial points (modes) of the clusters. The differences on the initial points often lead to considerable distinct cluster results. In this paper we present an experimental study on applying Bradley and Fayyad\u27s iterative initial-point refinement algorithm to the k-modes clustering to improve the accurate and repetitiveness of the clustering results [cf. P. Bradley, U. Fayyad, Refining initial points for k-mean clustering, in: Proceedings of the 15th International Conference on Machine Learning, Morgan Kaufmann, Los Altos, CA, 1998]. Experiments show that the k-modes clustering algorithm using refined initial points leads to higher precision results much more reliably than the random selection method without refinement, thus making the refinement process applicable to many data mining applications with categorical data

    A fuzzy k-modes algorithm for clustering categorical data

    Get PDF
    This correspondence describes extensions to the fuzzy k-means algorithm for clustering categorical data. By using a simple matching dissimilarity measure for categorical objects and modes instead of means for clusters, a new approach is developed, which allows the use of the k-means paradigm to efficiently cluster large categorical data sets. A fuzzy k-modes algorithm is presented and the effectiveness of the algorithm is demonstrated with experimental results.published_or_final_versio

    The Clustering of Households in Madura Based on Factors Affecting Their Ingestion of Clean Water Using Similarity Weight and Filter Method

    Get PDF
    Clean Water and Sanitation is one of SDGs’ indicators that relates to human’ demand for clean water. Three of four regencies in Madura Island reportedly have suffered in drought, thus it leads this research to fulfill Madura people need of water. Madura Island has 3097 households in need of water. However, not all households could fetch their need. This research aims to classify the households of Madura Island regarding factors which affect their ingestion of clean water using cluster analysis. There are clustering numerical data and categorical data. Therefore, this research uses Similarity Weight and Filter Method. SWFM is one of clustering mix methods in which there are clustering numerical, using hierarchical ward, and clustering categorical, using k-modes. To analyze the clustering numerical data, there are 3 variables and it gains two optimum groups by using ward method with pseudo-F 1001,172. Clustering categorical analysis uses 6 variables with k-modes and gains three groups and SWFM gains five groups. Five groups are selected because they produced the smallest ratio 0,006627 in the group

    An Efficient kk-modes Algorithm for Clustering Categorical Datasets

    Get PDF
    Mining clusters from data is an important endeavor in many applications. The kk-means method is a popular, efficient, and distribution-free approach for clustering numerical-valued data, but does not apply for categorical-valued observations. The kk-modes method addresses this lacuna by replacing the Euclidean with the Hamming distance and the means with the modes in the kk-means objective function. We provide a novel, computationally efficient implementation of kk-modes, called OTQT. We prove that OTQT finds updates to improve the objective function that are undetectable to existing kk-modes algorithms. Although slightly slower per iteration due to algorithmic complexity, OTQT is always more accurate per iteration and almost always faster (and only barely slower on some datasets) to the final optimum. Thus, we recommend OTQT as the preferred, default algorithm for kk-modes optimization.Comment: 16 pages, 10 figures, 5 table

    An Enhanced Initialization Method to Find an Initial Center for K-modes Clustering

    Get PDF
    Data mining is a technique which extracts the information from the large amount of data. To group the objects having similar characteristics, clustering method is used. K-means clustering algorithm is very efficient for large data sets deals with numerical quantities however it not works well for real world data sets which contain categorical values for most of the attributes. K-modes algorithm is used in the place of K-means algorithm. In the existing system, the initialization of K- modes clustering from the view of outlier detection is considered. It avoids that various initial cluster centers come from the same cluster. To overcome the above said limitation, it uses Initial_Distance and Initial_Entropy algorithms which use a new weightage formula to calculate the degree of outlierness of each object. K-modes algorithm can guarantee that the chosen initial cluster centers are not outliers. To improve the performance further, a new modified distance metric -weighted matching distance is used to calculate the distance between two objects during the process of initialization. As well as, one of the data pre-processing methods is used to improve the quality of data. Experiments are carried out on several data sets from UCI repository and the results demonstrated the effectiveness of the initialization method in the proposed algorithm

    Analisis Clustering Menggunakan Algoritma K-Modes

    Get PDF
    ABSTRAKSI: Klasterisasi adalah proses mengelompokkan data ke dalam suatu kelas atau klaster, sehingga objek pada suatu klaster memiliki kemiripan yang sangat besar dengan objek lain pada klaster yang sama, tetapi sangat tidak mirip dengan objek pada klaster lain.Salah satu algoritma yang sering digunakan untuk melakukan proses clustering data adalah algoritma k-means. K-means sangat populer dalam proses klasterisasi data karena efisiensinya dalam mengklaster data. Namun, algoritma ini hanya terbatas untuk pengelompokan pada data numerik, sedangkan pada kenyataannya di dunia nyata banyak juga data yang atributnya bernilai kategorik.Untuk menangani masalah data kategorik, dalam Tugas Akhir ini akan dibahas sebuah algoritma bernama k-modes yang merupakan varian dari algoritma k-means. Sama halnya seperti algoritma k-means, algoritma k-modes ini menghasilkan solusi local optimum. Hal tersebut berkaitan dengan proses inisialisasi pada penentuan centroid awal klaster. Dalam tugas akhir ini dibahas mengenai metode penentuan inisialisasi awal pada algoritma k-modes yaitu, secara random, dan menggunakan metode frequency based.Ditunjukkan dalam tugas akhir ini bahwa metode pemilihan k inisialisasi awal menggunakan metode frequency based memiliki tingkat akurasi yang lebih baik dalam mengelompokkan data dibandingkan dengan inisialisasi secara random.Kata Kunci : Clustering, k-means, k-modes, frequency basedABSTRACT: Clustering is a process of grouping data into a class or cluster, so that the objects in a cluster has a very large similarity with other objects in the same cluster, but not similar to objects in other clusters.One commonly used algorithm for data clustering process is the k-means algorithm. K-means is very popular in clustering data process because its efficiency for clustering data. However, this algorithm is limited to numerical data grouping, whereas in fact, in the real world there are many valuable attributes of categorical data.To handle the problem of categorical data, in this Final Project will be discussed an algorithm called the k-modes which is a variant of k-means algorithm. Just as k-means algorithm, k-modes algorithm produces local optimum solution. This is related to the initialization process in determining the initial cluster centroid. This Final Project explains about the methods for determining first initialization of k-modes algorithm by randomly, and using frequency-based method.It is shown in this Final Project that the selection method of first k initialization using frequency-based method which has better accuracy in grouping data compared with random initialization.Keyword: Clustering, k-means, k-modes, frequency-base

    Optimal mathematical programming and variable neighborhood search for k-modes categorical data clustering

    Get PDF
    The conventional k-modes algorithm and its variants have been extensively used for categorical data clustering. However, these algorithms have some drawbacks, e.g., they can be trapped into local optima and sensitive to initial clusters/modes. Our numerical experiments even showed that the k-modes algorithm could not identify the optimal clustering results for some special datasets regardless the selection of the initial centers. In this paper, we developed an integer linear programming (ILP) approach for the k-modes clustering, which is independent to the initial solution and can obtain directly the optimal results for small-sized datasets. We also developed a heuristic algorithm that implements iterative partial optimization in the ILP approach based on a framework of variable neighborhood search, known as IPO-ILP-VNS, to search for near-optimal results of medium and large sized datasets with controlled computing time. Experiments on 38 datasets, including 27 synthesized small datasets and 11 known benchmark datasets from the UCI site were carried out to test the proposed ILP approach and the IPO-ILP-VNS algorithm. The experimental results outperformed the conventional and other existing enhanced k-modes algorithms in literature, updated 9 of the UCI benchmark datasets with new and improved results
    • …
    corecore