71 research outputs found

    Quantization/clustering: when and why does k-means work?

    Get PDF
    Though mostly used as a clustering algorithm, k-means are originally designed as a quantization algorithm. Namely, it aims at providing a compression of a probability distribution with k points. Building upon [21, 33], we try to investigate how and when these two approaches are compatible. Namely, we show that provided the sample distribution satisfies a margin like condition (in the sense of [27] for supervised learning), both the associated empirical risk minimizer and the output of Lloyd's algorithm provide almost optimal classification in certain cases (in the sense of [6]). Besides, we also show that they achieved fast and optimal convergence rates in terms of sample size and compression risk

    Power to the points: validating data memberships in clusterings

    Get PDF
    pre-printIn this paper, we present a method to attach affinity scores to the implicit labels of individual points in a clustering. The affinity scores capture the confidence level of the cluster that claims to "own" the point. We demonstrate that these scores accurately capture the quality of the label assigned to the point. We also show further applications of these scores to estimate global measures of clustering quality, as well as accelerate clustering algorithms by orders of magnitude using active selection based on affinity. This method is very general and applies to clusterings derived from any geometric source. It lends itself to easy visualization and can prove useful as part of an interactive visual analytics framework. It is also efficient: assigning an affinity score to a point depends only polynomially on the number of clusters and is independent both of the size and dimensionality of the data. It is based on techniques from the theory of interpolation, coupled with sampling and estimation algorithms from high dimensional computational geometry

    Feature Extractors for Describing Vehicle Routing Problem Instances

    Get PDF

    What are the true clusters?

    Get PDF
    Constructivist philosophy and Hasok Chang's active scientific realism are used to argue that the idea of "truth" in cluster analysis depends on the context and the clustering aims. Different characteristics of clusterings are required in different situations. Researchers should be explicit about on what requirements and what idea of "true clusters" their research is based, because clustering becomes scientific not through uniqueness but through transparent and open communication. The idea of "natural kinds" is a human construct, but it highlights the human experience that the reality outside the observer's control seems to make certain distinctions between categories inevitable. Various desirable characteristics of clusterings and various approaches to define a context-dependent truth are listed, and I discuss what impact these ideas can have on the comparison of clustering methods, and the choice of a clustering methods and related decisions in practice

    On Sampling Based Algorithms for k-Means

    Get PDF

    Theoretical Analysis of Hierarchical Clustering and the Shadow Vertex Algorithm

    Get PDF
    Agglomerative clustering (AC) is a very popular greedy method for computing hierarchical clusterings in practice, yet its theoretical properties have been studied relatively little. We consider AC with respect to the most popular objective functions, especially the diameter function, the radius function and the k-means function. Given a finite set P of points in Rd, AC starts with each point from P in a cluster of its own and then iteratively merges two clusters from the current clustering that minimize the respective objective function when merged into a single cluster. We study the problem of partitioning P into k clusters such that the largest diameter of the clusters is minimized and we prove that AC computes an O(1)-approximation for this problem for any metric that is induced by a norm, assuming that the dimension d is a constant. This improves the best previously known bound of O(log k) due to Ackermann et al. Our bound also carries over to the k-center and the continuous k-center problem. Moreover we study the behavior of agglomerative clustering for the hierarchical k-means problem. We show that AC computes a 2-approximation with respect to the k-means objective function if the optimal k-clustering is well separated. If additionally the optimal clustering also satisfies a balance condition, then AC fully recovers the optimum solution. These results hold in arbitrary dimension. We accompany our positive results with a lower bound of Ω((3/2)^d) for data sets in Rd that holds if no separation is guaranteed, and with lower bounds when the guaranteed separation is not sufficiently strong. Finally, we show that AC produces an O(1)-approximative clustering for one-dimensional data sets. Apart from AC we provide improved and in some cases new general upper and lower bounds on the existence of hierarchical clusterings. For the objective function discrete radius we provide a new lower bound of 2 and improve the upper bound of 4. For the k-means objective function we state a lower bound of 32 on the existence of hierarchical clusterings. This improves the best previously known bound of 576. The simplex algorithm is probably the most popular algorithm for solving linear pro grams in practice. It is determined by so called pivot rules. The shadow vertex simplex algorithm is a popular pivot rule which has gained attention in recent years because it was shown to have polynomial running time in the model of smoothed complexity. In the second part of the dissertation we show that the shadow vertex simplex algorithm can be used to solve linear programs in strongly polynomial time with respect to the number n of variables, the number m of constraints, and 1/δ, where δ is a parameter that measures the flatness of the vertices of the polyhedron. This extends a previous result that the shadow vertex algorithm finds paths of polynomial length (w.r.t. n, m, and 1/δ) between two given vertices of a polyhedron. Our result also complements a result due to Eisenbrand and Vempala who have shown that a certain version of the random edge pivot rule solves linear programs with a running time that is strongly polynomial in the number of variables n and 1/δ, but independent of the number m of constraints. Even though the running time of our algorithm depends on m, it is significantly faster for the important special case of totally unimodular linear programs, for which 1/δ is smaller or equal than n and which have only O(n2) constraints
    corecore