54 research outputs found

    MapReduce and Streaming Algorithms for Diversity Maximization in Metric Spaces of Bounded Doubling Dimension

    Get PDF
    Given a dataset of points in a metric space and an integer kk, a diversity maximization problem requires determining a subset of kk points maximizing some diversity objective measure, e.g., the minimum or the average distance between two points in the subset. Diversity maximization is computationally hard, hence only approximate solutions can be hoped for. Although its applications are mainly in massive data analysis, most of the past research on diversity maximization focused on the sequential setting. In this work we present space and pass/round-efficient diversity maximization algorithms for the Streaming and MapReduce models and analyze their approximation guarantees for the relevant class of metric spaces of bounded doubling dimension. Like other approaches in the literature, our algorithms rely on the determination of high-quality core-sets, i.e., (much) smaller subsets of the input which contain good approximations to the optimal solution for the whole input. For a variety of diversity objective functions, our algorithms attain an (α+ϵ)(\alpha+\epsilon)-approximation ratio, for any constant ϵ>0\epsilon>0, where α\alpha is the best approximation ratio achieved by a polynomial-time, linear-space sequential algorithm for the same diversity objective. This improves substantially over the approximation ratios attainable in Streaming and MapReduce by state-of-the-art algorithms for general metric spaces. We provide extensive experimental evidence of the effectiveness of our algorithms on both real world and synthetic datasets, scaling up to over a billion points.Comment: Extended version of http://www.vldb.org/pvldb/vol10/p469-ceccarello.pdf, PVLDB Volume 10, No. 5, January 201

    Solving k-center Clustering (with Outliers) in MapReduce and Streaming, almost as Accurately as Sequentially.

    Get PDF
    Center-based clustering is a fundamental primitive for data analysis and becomes very challenging for large datasets. In this paper, we focus on the popular k-center variant which, given a set S of points from some metric space and a parameter k0, the algorithms yield solutions whose approximation ratios are a mere additive term \u3f5 away from those achievable by the best known polynomial-time sequential algorithms, a result that substantially improves upon the state of the art. Our algorithms are rather simple and adapt to the intrinsic complexity of the dataset, captured by the doubling dimension D of the metric space. Specifically, our analysis shows that the algorithms become very space-efficient for the important case of small (constant) D. These theoretical results are complemented with a set of experiments on real-world and synthetic datasets of up to over a billion points, which show that our algorithms yield better quality solutions over the state of the art while featuring excellent scalability, and that they also lend themselves to sequential implementations much faster than existing ones

    Clustering-based Algorithms for Big Data Computations

    Get PDF
    In the age of big data, the amount of information that applications need to process often exceeds the computational capabilities of single machines. To cope with this deluge of data, new computational models have been defined. The MapReduce model allows the development of distributed algorithms targeted at large clusters, where each machine can only store a small fraction of the data. In the streaming model a single processor processes on-the-fly an incoming stream of data, using only limited memory. The specific characteristics of these models combined with the necessity of processing very large datasets rule out, in many cases, the adoption of known algorithmic strategies, prompting the development of new ones. In this context, clustering, the process of grouping together elements according to some proximity measure, is a valuable tool, which allows to build succinct summaries of the input data. In this thesis we develop novel algorithms for some fundamental problems, where clustering is a key ingredient to cope with very large instances or is itself the ultimate target. First, we consider the problem of approximating the diameter of an undirected graph, a fundamental metric in graph analytics, for which the known exact algorithms are too costly to use for very large inputs. We develop a MapReduce algorithm for this problem which, for the important class of graphs of bounded doubling dimension, features a polylogarithmic approximation guarantee, uses linear memory and executes in a number of parallel rounds that can be made sublinear in the input graph's diameter. To the best of our knowledge, ours is the first parallel algorithm with these guarantees. Our algorithm leverages a novel clustering primitive to extract a concise summary of the input graph on which to compute the diameter approximation. We complement our theoretical analysis with an extensive experimental evaluation, finding that our algorithm features an approximation quality significantly better than the theoretical upper bound and high scalability. Next, we consider the problem of clustering uncertain graphs, that is, graphs where each edge has a probability of existence, specified as part of the input. These graphs, whose applications range from biology to privacy in social networks, have an exponential number of possible deterministic realizations, which impose a big-data perspective. We develop the first algorithms for clustering uncertain graphs with provable approximation guarantees which aim at maximizing the probability that nodes be connected to the centers of their assigned clusters. A preliminary suite of experiments, provides evidence that the quality of the clusterings returned by our algorithms compare very favorably with respect to previous approaches with no theoretical guarantees. Finally, we deal with the problem of diversity maximization, which is a fundamental primitive in big data analytics: given a set of points in a metric space we are asked to provide a small subset maximizing some notion of diversity. We provide efficient streaming and MapReduce algorithms with approximation guarantees that can be made arbitrarily close to the ones of the best sequential algorithms available. The algorithms crucially rely on the use of a k-center clustering primitive to extract a succinct summary of the data and their analysis is expressed in terms of the doubling dimension of the input point set. Moreover, unlike previously known algorithms, ours feature an interesting tradeoff between approximation quality and memory requirements. Our theoretical findings are supported by the first experimental analysis of diversity maximization algorithms in streaming and MapReduce, which highlights the tradeoffs of our algorithms on both real-world and synthetic datasets. Moreover, our algorithms exhibit good scalability, and a significantly better performance than the approaches proposed in previous works

    Diversity Maximization in Doubling Metrics

    Get PDF
    Diversity maximization is an important geometric optimization problem with many applications in recommender systems, machine learning or search engines among others. A typical diversification problem is as follows: Given a finite metric space (X,d) and a parameter k in N, find a subset of k elements of X that has maximum diversity. There are many functions that measure diversity. One of the most popular measures, called remote-clique, is the sum of the pairwise distances of the chosen elements. In this paper, we present novel results on three widely used diversity measures: Remote-clique, remote-star and remote-bipartition. Our main result are polynomial time approximation schemes for these three diversification problems under the assumption that the metric space is doubling. This setting has been discussed in the recent literature. The existence of such a PTAS however was left open. Our results also hold in the setting where the distances are raised to a fixed power q >= 1, giving rise to more variants of diversity functions, similar in spirit to the variations of clustering problems depending on the power applied to the pairwise distances. Finally, we provide a proof of NP-hardness for remote-clique with squared distances in doubling metric spaces

    Fully dynamic clustering and diversity maximization in doubling metrics

    Full text link
    We present approximation algorithms for some variants of center-based clustering and related problems in the fully dynamic setting, where the pointset evolves through an arbitrary sequence of insertions and deletions. Specifically, we target the following problems: kk-center (with and without outliers), matroid-center, and diversity maximization. All algorithms employ a coreset-based strategy and rely on the use of the cover tree data structure, which we crucially augment to maintain, at any time, some additional information enabling the efficient extraction of the solution for the specific problem. For all of the aforementioned problems our algorithms yield (α+ε)(\alpha+\varepsilon)-approximations, where α\alpha is the best known approximation attainable in polynomial time in the standard off-line setting (except for kk-center with zz outliers where α=2\alpha = 2 but we get a (3+ε)(3+\varepsilon)-approximation) and ε>0\varepsilon>0 is a user-provided accuracy parameter. The analysis of the algorithms is performed in terms of the doubling dimension of the underlying metric. Remarkably, and unlike previous works, the data structure and the running times of the insertion and deletion procedures do not depend in any way on the accuracy parameter ε\varepsilon and, for the two kk-center variants, on the parameter kk. For spaces of bounded doubling dimension, the running times are dramatically smaller than those that would be required to compute solutions on the entire pointset from scratch. To the best of our knowledge, ours are the first solutions for the matroid-center and diversity maximization problems in the fully dynamic setting

    Improved Approximation and Scalability for Fair Max-Min Diversification

    Get PDF
    Given an nn-point metric space (X,d)(\mathcal{X},d) where each point belongs to one of m=O(1)m=O(1) different categories or groups and a set of integers k1,…,kmk_1, \ldots, k_m, the fair Max-Min diversification problem is to select kik_i points belonging to category i∈[m]i\in [m], such that the minimum pairwise distance between selected points is maximized. The problem was introduced by Moumoulidou et al. [ICDT 2021] and is motivated by the need to down-sample large data sets in various applications so that the derived sample achieves a balance over diversity, i.e., the minimum distance between a pair of selected points, and fairness, i.e., ensuring enough points of each category are included. We prove the following results: 1. We first consider general metric spaces. We present a randomized polynomial time algorithm that returns a factor 22-approximation to the diversity but only satisfies the fairness constraints in expectation. Building upon this result, we present a 66-approximation that is guaranteed to satisfy the fairness constraints up to a factor 1−ϵ1-\epsilon for any constant ϵ\epsilon. We also present a linear time algorithm returning an m+1m+1 approximation with exact fairness. The best previous result was a 3m−13m-1 approximation. 2. We then focus on Euclidean metrics. We first show that the problem can be solved exactly in one dimension. For constant dimensions, categories and any constant ϵ>0\epsilon>0, we present a 1+ϵ1+\epsilon approximation algorithm that runs in O(nk)+2O(k)O(nk) + 2^{O(k)} time where k=k1+…+kmk=k_1+\ldots+k_m. We can improve the running time to O(nk)+poly(k)O(nk)+ poly(k) at the expense of only picking (1−ϵ)ki(1-\epsilon) k_i points from category i∈[m]i\in [m]. Finally, we present algorithms suitable to processing massive data sets including single-pass data stream algorithms and composable coresets for the distributed processing.Comment: To appear in ICDT 202
    • …
    corecore