140 research outputs found

    A survey on feature weighting based K-Means algorithms

    Get PDF
    This is a pre-copyedited, author-produced PDF of an article accepted for publication in Journal of Classification [de Amorim, R. C., 'A survey on feature weighting based K-Means algorithms', Journal of Classification, Vol. 33(2): 210-242, August 25, 2016]. Subject to embargo. Embargo end date: 25 August 2017. The final publication is available at Springer via http://dx.doi.org/10.1007/s00357-016-9208-4 © Classification Society of North America 2016In a real-world data set there is always the possibility, rather high in our opinion, that different features may have different degrees of relevance. Most machine learning algorithms deal with this fact by either selecting or deselecting features in the data preprocessing phase. However, we maintain that even among relevant features there may be different degrees of relevance, and this should be taken into account during the clustering process. With over 50 years of history, K-Means is arguably the most popular partitional clustering algorithm there is. The first K-Means based clustering algorithm to compute feature weights was designed just over 30 years ago. Various such algorithms have been designed since but there has not been, to our knowledge, a survey integrating empirical evidence of cluster recovery ability, common flaws, and possible directions for future research. This paper elaborates on the concept of feature weighting and addresses these issues by critically analysing some of the most popular, or innovative, feature weighting mechanisms based in K-Means.Peer reviewedFinal Accepted Versio

    Linear, Deterministic, and Order-Invariant Initialization Methods for the K-Means Clustering Algorithm

    Full text link
    Over the past five decades, k-means has become the clustering algorithm of choice in many application domains primarily due to its simplicity, time/space efficiency, and invariance to the ordering of the data points. Unfortunately, the algorithm's sensitivity to the initial selection of the cluster centers remains to be its most serious drawback. Numerous initialization methods have been proposed to address this drawback. Many of these methods, however, have time complexity superlinear in the number of data points, which makes them impractical for large data sets. On the other hand, linear methods are often random and/or sensitive to the ordering of the data points. These methods are generally unreliable in that the quality of their results is unpredictable. Therefore, it is common practice to perform multiple runs of such methods and take the output of the run that produces the best results. Such a practice, however, greatly increases the computational requirements of the otherwise highly efficient k-means algorithm. In this chapter, we investigate the empirical performance of six linear, deterministic (non-random), and order-invariant k-means initialization methods on a large and diverse collection of data sets from the UCI Machine Learning Repository. The results demonstrate that two relatively unknown hierarchical initialization methods due to Su and Dy outperform the remaining four methods with respect to two objective effectiveness criteria. In addition, a recent method due to Erisoglu et al. performs surprisingly poorly.Comment: 21 pages, 2 figures, 5 tables, Partitional Clustering Algorithms (Springer, 2014). arXiv admin note: substantial text overlap with arXiv:1304.7465, arXiv:1209.196

    An exponential-family multidimensional scaling mixture methodology

    No full text
    A multidimensional scaling methodology (STUNMIX) for the analysis of subjects' preference/choice of stimuli that sets out to integrate the previous work in this area into a single framework, as well as to provide a variety of new options and models, is presented. Locations of the stimuli and the ideal points of derived segments of subjects on latent dimensions are estimated simultaneously. The methodology is formulated in the framework of the exponential family of distributions, whereby a wide range of different data types can be analyzed. Possible reparameterizations of stimulus coordinates by stimulus characteristics, as well as of probabilities of segment membership by subject background variables, are permitted. The models are estimated in a maximum likelihood framework. The performance of the models is demonstrated on synthetic data, and robustness is investigated. An empirical application is provided, concerning intentions to buy portable telephones
    corecore