6 research outputs found
An Enhanced Initialization Method to Find an Initial Center for K-modes Clustering
Data mining is a technique which extracts the information from the large amount of data. To group the objects having similar characteristics, clustering method is used. K-means clustering algorithm is very efficient for large data sets deals with numerical quantities however it not works well for real world data sets which contain categorical values for most of the attributes. K-modes algorithm is used in the place of K-means algorithm. In the existing system, the initialization of K- modes clustering from the view of outlier detection is considered. It avoids that various initial cluster centers come from the same cluster. To overcome the above said limitation, it uses Initial_Distance and Initial_Entropy algorithms which use a new weightage formula to calculate the degree of outlierness of each object. K-modes algorithm can guarantee that the chosen initial cluster centers are not outliers. To improve the performance further, a new modified distance metric -weighted matching distance is used to calculate the distance between two objects during the process of initialization. As well as, one of the data pre-processing methods is used to improve the quality of data. Experiments are carried out on several data sets from UCI repository and the results demonstrated the effectiveness of the initialization method in the proposed algorithm
Optimal mathematical programming and variable neighborhood search for k-modes categorical data clustering
The conventional k-modes algorithm and its variants have been extensively used for categorical data clustering. However, these algorithms have some drawbacks, e.g., they can be trapped into local optima and sensitive to initial clusters/modes. Our numerical experiments even showed that the k-modes algorithm could not identify the optimal clustering results for some special datasets regardless the selection of the initial centers. In this paper, we developed an integer linear programming (ILP) approach for the k-modes clustering, which is independent to the initial solution and can obtain directly the optimal results for small-sized datasets. We also developed a heuristic algorithm that implements iterative partial optimization in the ILP approach based on a framework of variable neighborhood search, known as IPO-ILP-VNS, to search for near-optimal results of medium and large sized datasets with controlled computing time. Experiments on 38 datasets, including 27 synthesized small datasets and 11 known benchmark datasets from the UCI site were carried out to test the proposed ILP approach and the IPO-ILP-VNS algorithm. The experimental results outperformed the conventional and other existing enhanced k-modes algorithms in literature, updated 9 of the UCI benchmark datasets with new and improved results
An Efficient -modes Algorithm for Clustering Categorical Datasets
Mining clusters from data is an important endeavor in many applications. The
-means method is a popular, efficient, and distribution-free approach for
clustering numerical-valued data, but does not apply for categorical-valued
observations. The -modes method addresses this lacuna by replacing the
Euclidean with the Hamming distance and the means with the modes in the
-means objective function. We provide a novel, computationally efficient
implementation of -modes, called OTQT. We prove that OTQT finds updates to
improve the objective function that are undetectable to existing -modes
algorithms. Although slightly slower per iteration due to algorithmic
complexity, OTQT is always more accurate per iteration and almost always faster
(and only barely slower on some datasets) to the final optimum. Thus, we
recommend OTQT as the preferred, default algorithm for -modes optimization.Comment: 16 pages, 10 figures, 5 table