183 research outputs found

    Small Space Stream Summary for Matroid Center

    Get PDF
    In the matroid center problem, which generalizes the k-center problem, we need to pick a set of centers that is an independent set of a matroid with rank r. We study this problem in streaming, where elements of the ground set arrive in the stream. We first show that any randomized one-pass streaming algorithm that computes a better than Delta-approximation for partition-matroid center must use Omega(r^2) bits of space, where Delta is the aspect ratio of the metric and can be arbitrarily large. This shows a quadratic separation between matroid center and k-center, for which the Doubling algorithm [Charikar et al., 1997] gives an 8-approximation using O(k)-space and one pass. To complement this, we give a one-pass algorithm for matroid center that stores at most O(r^2 log(1/epsilon)/epsilon) points (viz., stream summary) among which a (7+epsilon)-approximate solution exists, which can be found by brute force, or a (17+epsilon)-approximation can be found with an efficient algorithm. If we are allowed a second pass, we can compute a (3+epsilon)-approximation efficiently. We also consider the problem of matroid center with z outliers and give a one-pass algorithm that outputs a set of O((r^2+rz)log(1/epsilon)/epsilon) points that contains a (15+epsilon)-approximate solution. Our techniques extend to knapsack center and knapsack center with z outliers in a straightforward way, and we get algorithms that use space linear in the size of a largest feasible set (as opposed to quadratic space for matroid center)

    Matroid-center clustering in sliding windows

    Get PDF
    In questa tesi, studiamo il problema del Matroid Center, il cui obiettivo Ăš trovare, dato un insieme di punti presi da uno spazio metrico e un intero < ||, un sottoinsieme di di centri tale per cui la distanza massima di un punto di da sia minimizzata e sia un insieme indipendente di una matroide specificata. In particolare, esaminiamo i casi relativi alla matroide di partizione, che puĂČ essere utilizzata per modellare vincoli di equitĂ , e alla piĂč generale matroide trasversale. Per entrambe le matroidi costruiamo i primi algoritmi di approssimazione per il modello di calcolo a finestre scorrevoli, i quali, in qualsiasi momento, sono in grado di calcolare in modo efficiente una soluzione del Matroid Center per l’ultima finestra di punti dello stream. Questi algoritmi presentano un fattore di approssimazione pari a (3+), dove 3 Ăš la migliore approssimazione ottenibile sequenzialmente in tempo polinomiale e in (0, 1) rappresenta un parametro di precisione definito dall’utente. L’analisi degli algoritmi viene effettuata in termini di dimensionalitĂ  dei dati e mostra che, per dati a bassa dimensionalitĂ , la memoria di lavoro richiesta e il tempo di elaborazione sono asintoticamente e significativamente inferiori alla dimensione della finestra.In this thesis, we study the Matroid Center problem which, given a set of points from a metric space and an integer < ||, requires to find a subset of of centers such that the maximum distance of a point of from is minimized, and is an independent set of a specified matroid. In particular, we consider the partition matroid, which can be used to model fairness constraints, and the more general transversal matroid. For both matroids we devise the first approximation streaming algorithms under the sliding window model, which, at any time, are able to efficiently compute a solution to Matroid Center for the latest window of points of the stream. The algorithms exhibit a (3+)-approximate ratio, where 3 is the best approximation attainable in sequential polynomial time and in (0, 1) is a user-defined accuracy parameter. The analysis of the algorithms is carried out in terms of the dimensionality of the data, and it shows that, for low dimensional data the required working memory and processing time are asymptotically and significantly smaller than the window size

    Streaming Algorithms for Diversity Maximization with Fairness Constraints

    Full text link
    Diversity maximization is a fundamental problem with wide applications in data summarization, web search, and recommender systems. Given a set XX of nn elements, it asks to select a subset SS of kâ‰Șnk \ll n elements with maximum \emph{diversity}, as quantified by the dissimilarities among the elements in SS. In this paper, we focus on the diversity maximization problem with fairness constraints in the streaming setting. Specifically, we consider the max-min diversity objective, which selects a subset SS that maximizes the minimum distance (dissimilarity) between any pair of distinct elements within it. Assuming that the set XX is partitioned into mm disjoint groups by some sensitive attribute, e.g., sex or race, ensuring \emph{fairness} requires that the selected subset SS contains kik_i elements from each group i∈[1,m]i \in [1,m]. A streaming algorithm should process XX sequentially in one pass and return a subset with maximum \emph{diversity} while guaranteeing the fairness constraint. Although diversity maximization has been extensively studied, the only known algorithms that can work with the max-min diversity objective and fairness constraints are very inefficient for data streams. Since diversity maximization is NP-hard in general, we propose two approximation algorithms for fair diversity maximization in data streams, the first of which is 1−Δ4\frac{1-\varepsilon}{4}-approximate and specific for m=2m=2, where Δ∈(0,1)\varepsilon \in (0,1), and the second of which achieves a 1−Δ3m+2\frac{1-\varepsilon}{3m+2}-approximation for an arbitrary mm. Experimental results on real-world and synthetic datasets show that both algorithms provide solutions of comparable quality to the state-of-the-art algorithms while running several orders of magnitude faster in the streaming setting.Comment: 13 pages, 11 figures; published in ICDE 202

    Fully dynamic clustering and diversity maximization in doubling metrics

    Full text link
    We present approximation algorithms for some variants of center-based clustering and related problems in the fully dynamic setting, where the pointset evolves through an arbitrary sequence of insertions and deletions. Specifically, we target the following problems: kk-center (with and without outliers), matroid-center, and diversity maximization. All algorithms employ a coreset-based strategy and rely on the use of the cover tree data structure, which we crucially augment to maintain, at any time, some additional information enabling the efficient extraction of the solution for the specific problem. For all of the aforementioned problems our algorithms yield (α+Δ)(\alpha+\varepsilon)-approximations, where α\alpha is the best known approximation attainable in polynomial time in the standard off-line setting (except for kk-center with zz outliers where α=2\alpha = 2 but we get a (3+Δ)(3+\varepsilon)-approximation) and Δ>0\varepsilon>0 is a user-provided accuracy parameter. The analysis of the algorithms is performed in terms of the doubling dimension of the underlying metric. Remarkably, and unlike previous works, the data structure and the running times of the insertion and deletion procedures do not depend in any way on the accuracy parameter Δ\varepsilon and, for the two kk-center variants, on the parameter kk. For spaces of bounded doubling dimension, the running times are dramatically smaller than those that would be required to compute solutions on the entire pointset from scratch. To the best of our knowledge, ours are the first solutions for the matroid-center and diversity maximization problems in the fully dynamic setting

    Fair and Representative Subset Selection from Data Streams

    Get PDF
    We study the problem of extracting a small subset of representative items from a large data stream. In many data mining and machine learning applications such as social network analysis and recommender systems, this problem can be formulated as maximizing a monotone submodular function subject to a cardinality constraint k. In this work, we consider the setting where data items in the stream belong to one of several disjoint groups and investigate the optimization problem with an additional fairness constraint that limits selection to a given number of items from each group. We then propose efficient algorithms for the fairness-aware variant of the streaming submodular maximization problem. In particular, we first give a (1/2-Δ)-approximation algorithm that requires O((1/Δ) log(k/Δ)) passes over the stream for any constant Δ>0. Moreover, we give a single-pass streaming algorithm that has the same approximation ratio of (1/2-Δ) when unlimited buffer sizes and post-processing time are permitted, and discuss how to adapt it to more practical settings where the buffer sizes are bounded. Finally, we demonstrate the efficiency and effectiveness of our proposed algorithms on two real-world applications, namely maximum coverage on large graphs and personalized recommendation.Peer reviewe

    MapReduce and Streaming Algorithms for Diversity Maximization in Metric Spaces of Bounded Doubling Dimension

    Get PDF
    Given a dataset of points in a metric space and an integer kk, a diversity maximization problem requires determining a subset of kk points maximizing some diversity objective measure, e.g., the minimum or the average distance between two points in the subset. Diversity maximization is computationally hard, hence only approximate solutions can be hoped for. Although its applications are mainly in massive data analysis, most of the past research on diversity maximization focused on the sequential setting. In this work we present space and pass/round-efficient diversity maximization algorithms for the Streaming and MapReduce models and analyze their approximation guarantees for the relevant class of metric spaces of bounded doubling dimension. Like other approaches in the literature, our algorithms rely on the determination of high-quality core-sets, i.e., (much) smaller subsets of the input which contain good approximations to the optimal solution for the whole input. For a variety of diversity objective functions, our algorithms attain an (α+ϔ)(\alpha+\epsilon)-approximation ratio, for any constant ϔ>0\epsilon>0, where α\alpha is the best approximation ratio achieved by a polynomial-time, linear-space sequential algorithm for the same diversity objective. This improves substantially over the approximation ratios attainable in Streaming and MapReduce by state-of-the-art algorithms for general metric spaces. We provide extensive experimental evidence of the effectiveness of our algorithms on both real world and synthetic datasets, scaling up to over a billion points.Comment: Extended version of http://www.vldb.org/pvldb/vol10/p469-ceccarello.pdf, PVLDB Volume 10, No. 5, January 201

    Fairness in Streaming Submodular Maximization over a Matroid Constraint

    Full text link
    Streaming submodular maximization is a natural model for the task of selecting a representative subset from a large-scale dataset. If datapoints have sensitive attributes such as gender or race, it becomes important to enforce fairness to avoid bias and discrimination. This has spurred significant interest in developing fair machine learning algorithms. Recently, such algorithms have been developed for monotone submodular maximization under a cardinality constraint. In this paper, we study the natural generalization of this problem to a matroid constraint. We give streaming algorithms as well as impossibility results that provide trade-offs between efficiency, quality and fairness. We validate our findings empirically on a range of well-known real-world applications: exemplar-based clustering, movie recommendation, and maximum coverage in social networks.Comment: Accepted to ICML 2
    • 

    corecore