146 research outputs found

    A tight lower bound instance for k-means++ in constant dimension

    Full text link
    The k-means++ seeding algorithm is one of the most popular algorithms that is used for finding the initial kk centers when using the k-means heuristic. The algorithm is a simple sampling procedure and can be described as follows: Pick the first center randomly from the given points. For i>1i > 1, pick a point to be the ithi^{th} center with probability proportional to the square of the Euclidean distance of this point to the closest previously (i1)(i-1) chosen centers. The k-means++ seeding algorithm is not only simple and fast but also gives an O(logk)O(\log{k}) approximation in expectation as shown by Arthur and Vassilvitskii. There are datasets on which this seeding algorithm gives an approximation factor of Ω(logk)\Omega(\log{k}) in expectation. However, it is not clear from these results if the algorithm achieves good approximation factor with reasonably high probability (say 1/poly(k)1/poly(k)). Brunsch and R\"{o}glin gave a dataset where the k-means++ seeding algorithm achieves an O(logk)O(\log{k}) approximation ratio with probability that is exponentially small in kk. However, this and all other known lower-bound examples are high dimensional. So, an open problem was to understand the behavior of the algorithm on low dimensional datasets. In this work, we give a simple two dimensional dataset on which the seeding algorithm achieves an O(logk)O(\log{k}) approximation ratio with probability exponentially small in kk. This solves open problems posed by Mahajan et al. and by Brunsch and R\"{o}glin.Comment: To appear in TAMC 2014. arXiv admin note: text overlap with arXiv:1306.420

    Fair redistricting is hard

    Get PDF
    Gerrymandering is a long-standing issue within the U.S. political system, and it has received scrutiny recently by the U.S. Supreme Court. In this note, we prove that deciding whether there exists a fair redistricting among legal maps is NP-hard. To make this precise, we use simplified notions of "legal" and "fair" that account for desirable traits such as geographic compactness of districts and sufficient representation of voters. The proof of our result is inspired by the work of Mahanjan, Minbhorkar and Varadarajan that proves that planar k-means is NP-hard

    StreamLearner: Distributed Incremental Machine Learning on Event Streams: Grand Challenge

    Full text link
    Today, massive amounts of streaming data from smart devices need to be analyzed automatically to realize the Internet of Things. The Complex Event Processing (CEP) paradigm promises low-latency pattern detection on event streams. However, CEP systems need to be extended with Machine Learning (ML) capabilities such as online training and inference in order to be able to detect fuzzy patterns (e.g., outliers) and to improve pattern recognition accuracy during runtime using incremental model training. In this paper, we propose a distributed CEP system denoted as StreamLearner for ML-enabled complex event detection. The proposed programming model and data-parallel system architecture enable a wide range of real-world applications and allow for dynamically scaling up and out system resources for low-latency, high-throughput event processing. We show that the DEBS Grand Challenge 2017 case study (i.e., anomaly detection in smart factories) integrates seamlessly into the StreamLearner API. Our experiments verify scalability and high event throughput of StreamLearner.Comment: Christian Mayer, Ruben Mayer, and Majd Abdo. 2017. StreamLearner: Distributed Incremental Machine Learning on Event Streams: Grand Challenge. In Proceedings of the 11th ACM International Conference on Distributed and Event-based Systems (DEBS '17), 298-30

    Clustering processes

    Get PDF
    The problem of clustering is considered, for the case when each data point is a sample generated by a stationary ergodic process. We propose a very natural asymptotic notion of consistency, and show that simple consistent algorithms exist, under most general non-parametric assumptions. The notion of consistency is as follows: two samples should be put into the same cluster if and only if they were generated by the same distribution. With this notion of consistency, clustering generalizes such classical statistical problems as homogeneity testing and process classification. We show that, for the case of a known number of clusters, consistency can be achieved under the only assumption that the joint distribution of the data is stationary ergodic (no parametric or Markovian assumptions, no assumptions of independence, neither between nor within the samples). If the number of clusters is unknown, consistency can be achieved under appropriate assumptions on the mixing rates of the processes. (again, no parametric or independence assumptions). In both cases we give examples of simple (at most quadratic in each argument) algorithms which are consistent.Comment: in proceedings of ICML 2010. arXiv-admin note: for version 2 of this article please see: arXiv:1005.0826v
    corecore