23 research outputs found

    Fully dynamic clustering and diversity maximization in doubling metrics

    Full text link
    We present approximation algorithms for some variants of center-based clustering and related problems in the fully dynamic setting, where the pointset evolves through an arbitrary sequence of insertions and deletions. Specifically, we target the following problems: kk-center (with and without outliers), matroid-center, and diversity maximization. All algorithms employ a coreset-based strategy and rely on the use of the cover tree data structure, which we crucially augment to maintain, at any time, some additional information enabling the efficient extraction of the solution for the specific problem. For all of the aforementioned problems our algorithms yield (α+ε)(\alpha+\varepsilon)-approximations, where α\alpha is the best known approximation attainable in polynomial time in the standard off-line setting (except for kk-center with zz outliers where α=2\alpha = 2 but we get a (3+ε)(3+\varepsilon)-approximation) and ε>0\varepsilon>0 is a user-provided accuracy parameter. The analysis of the algorithms is performed in terms of the doubling dimension of the underlying metric. Remarkably, and unlike previous works, the data structure and the running times of the insertion and deletion procedures do not depend in any way on the accuracy parameter ε\varepsilon and, for the two kk-center variants, on the parameter kk. For spaces of bounded doubling dimension, the running times are dramatically smaller than those that would be required to compute solutions on the entire pointset from scratch. To the best of our knowledge, ours are the first solutions for the matroid-center and diversity maximization problems in the fully dynamic setting

    Improved Approximation and Scalability for Fair Max-Min Diversification

    Get PDF
    Given an nn-point metric space (X,d)(\mathcal{X},d) where each point belongs to one of m=O(1)m=O(1) different categories or groups and a set of integers k1,…,kmk_1, \ldots, k_m, the fair Max-Min diversification problem is to select kik_i points belonging to category i∈[m]i\in [m], such that the minimum pairwise distance between selected points is maximized. The problem was introduced by Moumoulidou et al. [ICDT 2021] and is motivated by the need to down-sample large data sets in various applications so that the derived sample achieves a balance over diversity, i.e., the minimum distance between a pair of selected points, and fairness, i.e., ensuring enough points of each category are included. We prove the following results: 1. We first consider general metric spaces. We present a randomized polynomial time algorithm that returns a factor 22-approximation to the diversity but only satisfies the fairness constraints in expectation. Building upon this result, we present a 66-approximation that is guaranteed to satisfy the fairness constraints up to a factor 1−ϵ1-\epsilon for any constant ϵ\epsilon. We also present a linear time algorithm returning an m+1m+1 approximation with exact fairness. The best previous result was a 3m−13m-1 approximation. 2. We then focus on Euclidean metrics. We first show that the problem can be solved exactly in one dimension. For constant dimensions, categories and any constant ϵ>0\epsilon>0, we present a 1+ϵ1+\epsilon approximation algorithm that runs in O(nk)+2O(k)O(nk) + 2^{O(k)} time where k=k1+…+kmk=k_1+\ldots+k_m. We can improve the running time to O(nk)+poly(k)O(nk)+ poly(k) at the expense of only picking (1−ϵ)ki(1-\epsilon) k_i points from category i∈[m]i\in [m]. Finally, we present algorithms suitable to processing massive data sets including single-pass data stream algorithms and composable coresets for the distributed processing.Comment: To appear in ICDT 202

    Streaming Algorithms for Diversity Maximization with Fairness Constraints

    Full text link
    Diversity maximization is a fundamental problem with wide applications in data summarization, web search, and recommender systems. Given a set XX of nn elements, it asks to select a subset SS of k≪nk \ll n elements with maximum \emph{diversity}, as quantified by the dissimilarities among the elements in SS. In this paper, we focus on the diversity maximization problem with fairness constraints in the streaming setting. Specifically, we consider the max-min diversity objective, which selects a subset SS that maximizes the minimum distance (dissimilarity) between any pair of distinct elements within it. Assuming that the set XX is partitioned into mm disjoint groups by some sensitive attribute, e.g., sex or race, ensuring \emph{fairness} requires that the selected subset SS contains kik_i elements from each group i∈[1,m]i \in [1,m]. A streaming algorithm should process XX sequentially in one pass and return a subset with maximum \emph{diversity} while guaranteeing the fairness constraint. Although diversity maximization has been extensively studied, the only known algorithms that can work with the max-min diversity objective and fairness constraints are very inefficient for data streams. Since diversity maximization is NP-hard in general, we propose two approximation algorithms for fair diversity maximization in data streams, the first of which is 1−ε4\frac{1-\varepsilon}{4}-approximate and specific for m=2m=2, where ε∈(0,1)\varepsilon \in (0,1), and the second of which achieves a 1−ε3m+2\frac{1-\varepsilon}{3m+2}-approximation for an arbitrary mm. Experimental results on real-world and synthetic datasets show that both algorithms provide solutions of comparable quality to the state-of-the-art algorithms while running several orders of magnitude faster in the streaming setting.Comment: 13 pages, 11 figures; published in ICDE 202

    Coresets for Clustering with General Assignment Constraints

    Full text link
    Designing small-sized \emph{coresets}, which approximately preserve the costs of the solutions for large datasets, has been an important research direction for the past decade. We consider coreset construction for a variety of general constrained clustering problems. We significantly extend and generalize the results of a very recent paper (Braverman et al., FOCS'22), by demonstrating that the idea of hierarchical uniform sampling (Chen, SICOMP'09; Braverman et al., FOCS'22) can be applied to efficiently construct coresets for a very general class of constrained clustering problems with general assignment constraints, including capacity constraints on cluster centers, and assignment structure constraints for data points (modeled by a convex body B)\mathcal{B}). Our main theorem shows that a small-sized ϵ\epsilon-coreset exists as long as a complexity measure Lip(B)\mathsf{Lip}(\mathcal{B}) of the structure constraint, and the \emph{covering exponent} Λϵ(X)\Lambda_\epsilon(\mathcal{X}) for metric space (X,d)(\mathcal{X},d) are bounded. The complexity measure Lip(B)\mathsf{Lip}(\mathcal{B}) for convex body B\mathcal{B} is the Lipschitz constant of a certain transportation problem constrained in B\mathcal{B}, called \emph{optimal assignment transportation problem}. We prove nontrivial upper bounds of Lip(B)\mathsf{Lip}(\mathcal{B}) for various polytopes, including the general matroid basis polytopes, and laminar matroid polytopes (with better bound). As an application of our general theorem, we construct the first coreset for the fault-tolerant clustering problem (with or without capacity upper/lower bound) for the above metric spaces, in which the fault-tolerance requirement is captured by a uniform matroid basis polytope

    The Power of Randomization: Distributed Submodular Maximization on Massive Datasets

    Full text link
    A wide variety of problems in machine learning, including exemplar clustering, document summarization, and sensor placement, can be cast as constrained submodular maximization problems. Unfortunately, the resulting submodular optimization problems are often too large to be solved on a single machine. We develop a simple distributed algorithm that is embarrassingly parallel and it achieves provable, constant factor, worst-case approximation guarantees. In our experiments, we demonstrate its efficiency in large problems with different kinds of constraints with objective values always close to what is achievable in the centralized setting

    GLISTER: Generalization based Data Subset Selection for Efficient and Robust Learning

    Full text link
    Large scale machine learning and deep models are extremely data-hungry. Unfortunately, obtaining large amounts of labeled data is expensive, and training state-of-the-art models (with hyperparameter tuning) requires significant computing resources and time. Secondly, real-world data is noisy and imbalanced. As a result, several recent papers try to make the training process more efficient and robust. However, most existing work either focuses on robustness or efficiency, but not both. In this work, we introduce Glister, a GeneraLIzation based data Subset selecTion for Efficient and Robust learning framework. We formulate Glister as a mixed discrete-continuous bi-level optimization problem to select a subset of the training data, which maximizes the log-likelihood on a held-out validation set. Next, we propose an iterative online algorithm Glister-Online, which performs data selection iteratively along with the parameter updates and can be applied to any loss-based learning algorithm. We then show that for a rich class of loss functions including cross-entropy, hinge-loss, squared-loss, and logistic-loss, the inner discrete data selection is an instance of (weakly) submodular optimization, and we analyze conditions for which Glister-Online reduces the validation loss and converges. Finally, we propose Glister-Active, an extension to batch active learning, and we empirically demonstrate the performance of Glister on a wide range of tasks including, (a) data selection to reduce training time, (b) robust learning under label noise and imbalance settings, and (c) batch-active learning with several deep and shallow models. We show that our framework improves upon state of the art both in efficiency and accuracy (in cases (a) and (c)) and is more efficient compared to other state-of-the-art robust learning algorithms in case (b)
    corecore