60 research outputs found

    Towards explaining the speed of kk-means

    Get PDF
    The kk-means method is a popular algorithm for clustering, known for its speed in practice. This stands in contrast to its exponential worst-case running-time. To explain the speed of the kk-means method, a smoothed analysis has been conducted. We sketch this smoothed analysis and a generalization to Bregman divergences

    Improved method of clustering states of computer equipment K-MEANS

    Get PDF
    Робота присвячена удосконаленню методу кластеризації станів комп’ютерної техніки K-MEANS з метою підвищення якості розбиття множини таких станів. При досліджені відомих модифікацій означеного методу щодо можливості їх застосування при аналізі станів комп’ютерної техніки було виявлено недостатню точність віднесення такого стану до певного кластеру через випадковий вибір їх початкових центрів. Виявлений недолік було усунуто шляхом визначення початкових центрів кластерів на основі значень потенціалів, а також, виділення в окремий таксон станів, які могли бути помилково віднесені до кластера за рахунок допустимих відхилень значень параметрів та характеристик таких станів, що дозволило підвищити якість розбиття множини станів комп’ютерної техніки в середньому на 7%.The work is dedicated to improving the method of clustering classes of computer equipment K-MEANS to improve the quality partition of such states. The object of research is the process of clustering state of computer equipment. Subject of research – methods of cluster analysis states of computer equipment. Relevance of these studies is conditioned by the rapid scientific and technical progress, in which significantly increased the number of computer equipment, which is used in various fields, and therefore increases the likelihood of situations specific to this equipment, given the diversity of functions that it performs. Thus, depending on the state of computer equipment taken various administrative decisions regarding its further functioning. So important is the development or improvement of methods of cluster analysis state of computer equipment that will determine the decision on its further functioning. In the analysis of decomposition of objects that can be used for solving the problem of clustering state of computer equipment, it was determined that such methods must be clear, non-hierarchical and scalable, expressed by characteristics inherent in the clustering method K-MEANS. When tested method known modifications appointed on the possibility of their use in the analysis of the state of computer equipment was found insufficient accuracy of classification of this state to a cluster through a random selection of initial centers. Identified deficiencies have been corrected by determining the initial cluster centers based on the values of potentials, as well as the allocation of a separate cluster of conditions that could be mistakenly attributed to the cluster due to tolerances values of parameters and characteristics of these states, thus improving the quality of the partition of the states of computer equipment an average of 7%

    Linear, Deterministic, and Order-Invariant Initialization Methods for the K-Means Clustering Algorithm

    Full text link
    Over the past five decades, k-means has become the clustering algorithm of choice in many application domains primarily due to its simplicity, time/space efficiency, and invariance to the ordering of the data points. Unfortunately, the algorithm's sensitivity to the initial selection of the cluster centers remains to be its most serious drawback. Numerous initialization methods have been proposed to address this drawback. Many of these methods, however, have time complexity superlinear in the number of data points, which makes them impractical for large data sets. On the other hand, linear methods are often random and/or sensitive to the ordering of the data points. These methods are generally unreliable in that the quality of their results is unpredictable. Therefore, it is common practice to perform multiple runs of such methods and take the output of the run that produces the best results. Such a practice, however, greatly increases the computational requirements of the otherwise highly efficient k-means algorithm. In this chapter, we investigate the empirical performance of six linear, deterministic (non-random), and order-invariant k-means initialization methods on a large and diverse collection of data sets from the UCI Machine Learning Repository. The results demonstrate that two relatively unknown hierarchical initialization methods due to Su and Dy outperform the remaining four methods with respect to two objective effectiveness criteria. In addition, a recent method due to Erisoglu et al. performs surprisingly poorly.Comment: 21 pages, 2 figures, 5 tables, Partitional Clustering Algorithms (Springer, 2014). arXiv admin note: substantial text overlap with arXiv:1304.7465, arXiv:1209.196

    Fast k-means based on KNN Graph

    Full text link
    In the era of big data, k-means clustering has been widely adopted as a basic processing tool in various contexts. However, its computational cost could be prohibitively high as the data size and the cluster number are large. It is well known that the processing bottleneck of k-means lies in the operation of seeking closest centroid in each iteration. In this paper, a novel solution towards the scalability issue of k-means is presented. In the proposal, k-means is supported by an approximate k-nearest neighbors graph. In the k-means iteration, each data sample is only compared to clusters that its nearest neighbors reside. Since the number of nearest neighbors we consider is much less than k, the processing cost in this step becomes minor and irrelevant to k. The processing bottleneck is therefore overcome. The most interesting thing is that k-nearest neighbor graph is constructed by iteratively calling the fast kk-means itself. Comparing with existing fast k-means variants, the proposed algorithm achieves hundreds to thousands times speed-up while maintaining high clustering quality. As it is tested on 10 million 512-dimensional data, it takes only 5.2 hours to produce 1 million clusters. In contrast, to fulfill the same scale of clustering, it would take 3 years for traditional k-means

    k-means requires exponentially many iterations even in the plane

    Get PDF
    The k-means algorithm is a well-known method for partitioning n points that lie in the d-dimensional space into k clusters. Its main features are simplicity and speed in practice. Theoretically, however, the best known upper bound on its running time (i.e. O(n kd)) can be exponential in the number of points. Recently, Arthur and Vassilvitskii [2] showed a superpolynomial worst-case analysis, improving the best known lower bound from Ω(n) to 2 Ω( √ n) with a construction in d = Ω ( √ n) dimensions. In [2] they also conjectured the existence of super-polynomial lower bounds for any d ≥ 2. Our contribution is twofold: we prove this conjecture and we improve the lower bound, by The k-means method is one of the most widely used algorithms for geometric clustering. It was originally proposed by Forgy in 1965 [7] and McQueen in 1967 [13], and is often known as Lloyd’s algorithm [12]. It is a local search algorithm and partitions n data points into k clusters in this way: seeded with k initial cluster centers, it assigns every data point to its closest center
    corecore