43 research outputs found

    Coresets for Clustering in Graphs of Bounded Treewidth

    Full text link
    We initiate the study of coresets for clustering in graph metrics, i.e., the shortest-path metric of edge-weighted graphs. Such clustering problems are essential to data analysis and used for example in road networks and data visualization. A coreset is a compact summary of the data that approximately preserves the clustering objective for every possible center set, and it offers significant efficiency improvements in terms of running time, storage, and communication, including in streaming and distributed settings. Our main result is a near-linear time construction of a coreset for k-Median in a general graph GG, with size OÏ”,k(tw(G))O_{\epsilon, k}(\mathrm{tw}(G)) where tw(G)\mathrm{tw}(G) is the treewidth of GG, and we complement the construction with a nearly-tight size lower bound. The construction is based on the framework of Feldman and Langberg [STOC 2011], and our main technical contribution, as required by this framework, is a uniform bound of O(tw(G))O(\mathrm{tw}(G)) on the shattering dimension under any point weights. We validate our coreset on real-world road networks, and our scalable algorithm constructs tiny coresets with high accuracy, which translates to a massive speedup of existing approximation algorithms such as local search for graph k-Median

    Coresets for Clustering in Geometric Intersection Graphs

    Get PDF

    A New Coreset Framework for Clustering

    Full text link
    Given a metric space, the (k,z)(k,z)-clustering problem consists of finding kk centers such that the sum of the of distances raised to the power zz of every point to its closest center is minimized. This encapsulates the famous kk-median (z=1z=1) and kk-means (z=2z=2) clustering problems. Designing small-space sketches of the data that approximately preserves the cost of the solutions, also known as \emph{coresets}, has been an important research direction over the last 15 years. In this paper, we present a new, simple coreset framework that simultaneously improves upon the best known bounds for a large variety of settings, ranging from Euclidean space, doubling metric, minor-free metric, and the general metric cases

    Towards Optimal Coreset Construction for (k,z)(k,z)-Clustering: Breaking the Quadratic Dependency on kk

    Full text link
    Constructing small-sized coresets for various clustering problems has attracted significant attention recently. We provide efficient coreset construction algorithms for (k,z)(k, z)-Clustering with improved coreset sizes in several metric spaces. In particular, we provide an O~z(k(2z+2)/(z+2)Δ−2)\tilde{O}_z(k^{(2z+2)/(z+2)}\varepsilon^{-2})-sized coreset for (k,z)(k, z)-Clustering for all z≄1z\geq 1 in Euclidean space, improving upon the best known O~z(k2Δ−2)\tilde{O}_z(k^2\varepsilon^{-2}) size upper bound [Cohen-Addad, Larsen, Saulpic, Schwiegelshohn. STOC'22], breaking the quadratic dependency on kk for the first time (when k≀Δ−1k\leq \varepsilon^{-1}). For example, our coreset size for Euclidean kk-Median is O~(k4/3Δ−2)\tilde{O}(k^{4/3} \varepsilon^{-2}), improving the best known result O~(min⁥{k2Δ−2,kΔ−3})\tilde{O}(\min\left\{k^2\varepsilon^{-2}, k\varepsilon^{-3}\right\}) by a factor k2/3k^{2/3} when k≀Δ−1k\leq \varepsilon^{-1}; for Euclidean kk-Means, our coreset size is O~(k3/2Δ−2)\tilde{O}(k^{3/2} \varepsilon^{-2}), improving the best known result O~(min⁥{k2Δ−2,kΔ−4})\tilde{O}(\min\left\{k^2\varepsilon^{-2}, k\varepsilon^{-4}\right\}) by a factor k1/2k^{1/2} when k≀Δ−2k\leq \varepsilon^{-2}. We also obtain optimal or improved coreset sizes for general metric space, metric space with bounded doubling dimension, and shortest path metric when the underlying graph has bounded treewidth, for all z≄1z\geq 1. Our algorithm largely follows the framework developed by Cohen-Addad et al. with some minor but useful changes. Our technical contribution mainly lies in the analysis. An important improvement in our analysis is a new notion of α\alpha-covering of distance vectors with a novel error metric, which allows us to provide a tighter variance bound. Another useful technical ingredient is terminal embedding with additive errors, for bounding the covering number in the Euclidean case

    Coresets for Clustering: Foundations and Challenges

    Get PDF
    Clustering is a fundamental task in machine learning and data analysis. The main challenge for clustering in big data sets is that classical clustering algorithms often do not scale well. Coresets are data reduction techniques that turn big data into a tiny proxy. Prior research has shown that coresets can provide a scalable solution to clustering problems and imply streaming and distributed algorithms. In this work, we aim to solve a fundamental question and two modern challenges in coresets for clustering. \textsf{Beyond Euclidean Space}: Coresets for Clustering in Euclidean space have been well studied and coresets of constant size are known to exist. While very few results are known beyond Euclidean space. It becomes a fundamental problem that what kind of metric space admits constant-sized coresets for clustering. We focus on graph metrics which is a common ambient space for clustering. We provide positive results that assert constant-sized coresets exist in various families of graph metrics including graphs of bounded treewidth, planar graphs and the more general excluded-minor graphs. \textsf{Missing Value}: Missing value is a common phenomenon in real data sets. Clustering under the existence of missing values is a very challenging task. In this work, we construct the first coresets for clustering with multiple missing values. Previously, such coresets were only known to exist when each data point has at most one missing value \cite{DBLP:conf/nips/MaromF19}. We further design a near-linear time algorithm to construct our coresets. This algorithm implies the first near-linear time approximation scheme for \kMeans clustering with missing values and improves a recent result by \cite{DBLP:conf/soda/EibenFGLPS21}. \textsf{Simultaneous Coresets}: Most classical coresets are limited to a specific clustering objective. When there are multiple potential objectives, a stronger notion of “simultaneous coresets” is needed. Simultaneous coresets provide the approximations for a family of objectives and can serve as a more flexible data reduction tool. In this work, we design the first simultaneous coresets for a large clustering family which includes both \kMedian and \kCenter

    A Survey on Approximation in Parameterized Complexity: Hardness and Algorithms

    Get PDF
    Parameterization and approximation are two popular ways of coping with NP-hard problems. More recently, the two have also been combined to derive many interesting results. We survey developments in the area both from the algorithmic and hardness perspectives, with emphasis on new techniques and potential future research directions

    Applied Randomized Algorithms for Efficient Genomic Analysis

    Get PDF
    The scope and scale of biological data continues to grow at an exponential clip, driven by advances in genetic sequencing, annotation and widespread adoption of surveillance efforts. For instance, the Sequence Read Archive (SRA) now contains more than 25 petabases of public data, while RefSeq, a collection of reference genomes, recently surpassed 100,000 complete genomes. In the process, it has outgrown the practical reach of many traditional algorithmic approaches in both time and space. Motivated by this extreme scale, this thesis details efficient methods for clustering and summarizing large collections of sequence data. While our primary area of interest is biological sequences, these approaches largely apply to sequence collections of any type, including natural language, software source code, and graph structured data. We applied recent advances in randomized algorithms to practical problems. We used MinHash and HyperLogLog, both examples of Locality- Sensitive Hashing, as well as coresets, which are approximate representations for finite sum problems, to build methods capable of scaling to billions of items. Ultimately, these are all derived from variations on sampling. We combined these advances with hardware-based optimizations and incorporated into free and open-source software libraries (sketch, frp, lib- simdsampling) and practical software tools built on these libraries (Dashing, Minicore, Dashing 2), empowering users to interact practically with colossal datasets on commodity hardware

    The Power of Uniform Sampling for Coresets

    Full text link
    Motivated by practical generalizations of the classic kk-median and kk-means objectives, such as clustering with size constraints, fair clustering, and Wasserstein barycenter, we introduce a meta-theorem for designing coresets for constrained-clustering problems. The meta-theorem reduces the task of coreset construction to one on a bounded number of ring instances with a much-relaxed additive error. This reduction enables us to construct coresets using uniform sampling, in contrast to the widely-used importance sampling, and consequently we can easily handle constrained objectives. Notably and perhaps surprisingly, this simpler sampling scheme can yield coresets whose size is independent of nn, the number of input points. Our technique yields smaller coresets, and sometimes the first coresets, for a large number of constrained clustering problems, including capacitated clustering, fair clustering, Euclidean Wasserstein barycenter, clustering in minor-excluded graph, and polygon clustering under Fr\'{e}chet and Hausdorff distance. Finally, our technique yields also smaller coresets for 11-median in low-dimensional Euclidean spaces, specifically of size O~(Δ−1.5)\tilde{O}(\varepsilon^{-1.5}) in R2\mathbb{R}^2 and O~(Δ−1.6)\tilde{O}(\varepsilon^{-1.6}) in R3\mathbb{R}^3

    Fair Correlation Clustering in Forests

    Get PDF
    The study of algorithmic fairness received growing attention recently. This stems from the awareness that bias in the input data for machine learning systems may result in discriminatory outputs. For clustering tasks, one of the most central notions of fairness is the formalization by Chierichetti, Kumar, Lattanzi, and Vassilvitskii [NeurIPS 2017]. A clustering is said to be fair, if each cluster has the same distribution of manifestations of a sensitive attribute as the whole input set. This is motivated by various applications where the objects to be clustered have sensitive attributes that should not be over- or underrepresented. Most research on this version of fair clustering has focused on centriod-based objectives. In contrast, we discuss the applicability of this fairness notion to Correlation Clustering. The existing literature on the resulting Fair Correlation Clustering problem either presents approximation algorithms with poor approximation guarantees or severely limits the possible distributions of the sensitive attribute (often only two manifestations with a 1:1 ratio are considered). Our goal is to understand if there is hope for better results in between these two extremes. To this end, we consider restricted graph classes which allow us to characterize the distributions of sensitive attributes for which this form of fairness is tractable from a complexity point of view. While existing work on Fair Correlation Clustering gives approximation algorithms, we focus on exact solutions and investigate whether there are efficiently solvable instances. The unfair version of Correlation Clustering is trivial on forests, but adding fairness creates a surprisingly rich picture of complexities. We give an overview of the distributions and types of forests where Fair Correlation Clustering turns from tractable to intractable. As the most surprising insight, we consider the fact that the cause of the hardness of Fair Correlation Clustering is not the strictness of the fairness condition. We lift most of our results to also hold for the relaxed version of the fairness condition. Instead, the source of hardness seems to be the distribution of the sensitive attribute. On the positive side, we identify some reasonable distributions that are indeed tractable. While this tractability is only shown for forests, it may open an avenue to design reasonable approximations for larger graph classes
    corecore