10 research outputs found

    Improved Algorithms for Time Decay Streams

    Get PDF
    In the time-decay model for data streams, elements of an underlying data set arrive sequentially with the recently arrived elements being more important. A common approach for handling large data sets is to maintain a coreset, a succinct summary of the processed data that allows approximate recovery of a predetermined query. We provide a general framework that takes any offline-coreset and gives a time-decay coreset for polynomial time decay functions. We also consider the exponential time decay model for k-median clustering, where we provide a constant factor approximation algorithm that utilizes the online facility location algorithm. Our algorithm stores O(k log(h Delta)+h) points where h is the half-life of the decay function and Delta is the aspect ratio of the dataset. Our techniques extend to k-means clustering and M-estimators as well

    On Generalization Bounds for Projective Clustering

    Full text link
    Given a set of points, clustering consists of finding a partition of a point set into kk clusters such that the center to which a point is assigned is as close as possible. Most commonly, centers are points themselves, which leads to the famous kk-median and kk-means objectives. One may also choose centers to be jj dimensional subspaces, which gives rise to subspace clustering. In this paper, we consider learning bounds for these problems. That is, given a set of nn samples PP drawn independently from some unknown, but fixed distribution D\mathcal{D}, how quickly does a solution computed on PP converge to the optimal clustering of D\mathcal{D}? We give several near optimal results. In particular, For center-based objectives, we show a convergence rate of O~(k/n)\tilde{O}\left(\sqrt{{k}/{n}}\right). This matches the known optimal bounds of [Fefferman, Mitter, and Narayanan, Journal of the Mathematical Society 2016] and [Bartlett, Linder, and Lugosi, IEEE Trans. Inf. Theory 1998] for kk-means and extends it to other important objectives such as kk-median. For subspace clustering with jj-dimensional subspaces, we show a convergence rate of O~(kj2n)\tilde{O}\left(\sqrt{\frac{kj^2}{n}}\right). These are the first provable bounds for most of these problems. For the specific case of projective clustering, which generalizes kk-means, we show a convergence rate of Ω(kjn)\Omega\left(\sqrt{\frac{kj}{n}}\right) is necessary, thereby proving that the bounds from [Fefferman, Mitter, and Narayanan, Journal of the Mathematical Society 2016] are essentially optimal

    Coresets and streaming algorithms for the k-means problem and related clustering objectives

    Get PDF
    The k-means problem seeks a clustering that minimizes the sum of squared errors cost function: For input points P from the Euclidean space R^d and any solution consisting of k centers from R^d, the cost is the sum of the squared distances of any point to its closest center. This thesis studies concepts used for large input point sets. For inputs with many points, the term coreset refers to a reduced version with less but weighted points. For inputs with high-dimensional points, dimensionality reduction is used to reduce the number of dimensions. In both cases, the reduced version has to maintain the cost function up to an epsilon-fraction for all choices of k centers. We study coreset constructions and dimensionality reductions for the k-means problem. Further, we develop coreset constructions in the data stream model. Here, the data is so large that it should only be read once and cannot be stored in main memory. The input might even arrive as a stream of points in an arbitrary order. Thus, a data stream algorithm has to continuously process the input while it arrives and can only store small summaries. In the second part of the thesis, the obtained results are extended to related clustering objectives. Projective clustering minimizes the squared distances to k subspaces instead of k points. Kernel k-means is an extension of k-means to scenarios where the target clustering is not linearly separable. In addition to extensions to these objectives, we study coreset constructions for a probabilistic clustering problem where input points are given as distributions over a finite set of locations

    LIPIcs, Volume 244, ESA 2022, Complete Volume

    Get PDF
    LIPIcs, Volume 244, ESA 2022, Complete Volum

    LIPIcs, Volume 248, ISAAC 2022, Complete Volume

    Get PDF
    LIPIcs, Volume 248, ISAAC 2022, Complete Volum

    LIPIcs, Volume 274, ESA 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 274, ESA 2023, Complete Volum
    corecore