6 research outputs found

    An Enhanced Initialization Method to Find an Initial Center for K-modes Clustering

    Get PDF
    Data mining is a technique which extracts the information from the large amount of data. To group the objects having similar characteristics, clustering method is used. K-means clustering algorithm is very efficient for large data sets deals with numerical quantities however it not works well for real world data sets which contain categorical values for most of the attributes. K-modes algorithm is used in the place of K-means algorithm. In the existing system, the initialization of K- modes clustering from the view of outlier detection is considered. It avoids that various initial cluster centers come from the same cluster. To overcome the above said limitation, it uses Initial_Distance and Initial_Entropy algorithms which use a new weightage formula to calculate the degree of outlierness of each object. K-modes algorithm can guarantee that the chosen initial cluster centers are not outliers. To improve the performance further, a new modified distance metric -weighted matching distance is used to calculate the distance between two objects during the process of initialization. As well as, one of the data pre-processing methods is used to improve the quality of data. Experiments are carried out on several data sets from UCI repository and the results demonstrated the effectiveness of the initialization method in the proposed algorithm

    Optimal mathematical programming and variable neighborhood search for k-modes categorical data clustering

    Get PDF
    The conventional k-modes algorithm and its variants have been extensively used for categorical data clustering. However, these algorithms have some drawbacks, e.g., they can be trapped into local optima and sensitive to initial clusters/modes. Our numerical experiments even showed that the k-modes algorithm could not identify the optimal clustering results for some special datasets regardless the selection of the initial centers. In this paper, we developed an integer linear programming (ILP) approach for the k-modes clustering, which is independent to the initial solution and can obtain directly the optimal results for small-sized datasets. We also developed a heuristic algorithm that implements iterative partial optimization in the ILP approach based on a framework of variable neighborhood search, known as IPO-ILP-VNS, to search for near-optimal results of medium and large sized datasets with controlled computing time. Experiments on 38 datasets, including 27 synthesized small datasets and 11 known benchmark datasets from the UCI site were carried out to test the proposed ILP approach and the IPO-ILP-VNS algorithm. The experimental results outperformed the conventional and other existing enhanced k-modes algorithms in literature, updated 9 of the UCI benchmark datasets with new and improved results

    An Efficient kk-modes Algorithm for Clustering Categorical Datasets

    Get PDF
    Mining clusters from data is an important endeavor in many applications. The kk-means method is a popular, efficient, and distribution-free approach for clustering numerical-valued data, but does not apply for categorical-valued observations. The kk-modes method addresses this lacuna by replacing the Euclidean with the Hamming distance and the means with the modes in the kk-means objective function. We provide a novel, computationally efficient implementation of kk-modes, called OTQT. We prove that OTQT finds updates to improve the objective function that are undetectable to existing kk-modes algorithms. Although slightly slower per iteration due to algorithmic complexity, OTQT is always more accurate per iteration and almost always faster (and only barely slower on some datasets) to the final optimum. Thus, we recommend OTQT as the preferred, default algorithm for kk-modes optimization.Comment: 16 pages, 10 figures, 5 table
    corecore