13 research outputs found

    MapReduce and Streaming Algorithms for Diversity Maximization in Metric Spaces of Bounded Doubling Dimension

    Get PDF
    Given a dataset of points in a metric space and an integer kk, a diversity maximization problem requires determining a subset of kk points maximizing some diversity objective measure, e.g., the minimum or the average distance between two points in the subset. Diversity maximization is computationally hard, hence only approximate solutions can be hoped for. Although its applications are mainly in massive data analysis, most of the past research on diversity maximization focused on the sequential setting. In this work we present space and pass/round-efficient diversity maximization algorithms for the Streaming and MapReduce models and analyze their approximation guarantees for the relevant class of metric spaces of bounded doubling dimension. Like other approaches in the literature, our algorithms rely on the determination of high-quality core-sets, i.e., (much) smaller subsets of the input which contain good approximations to the optimal solution for the whole input. For a variety of diversity objective functions, our algorithms attain an (α+ϵ)(\alpha+\epsilon)-approximation ratio, for any constant ϵ>0\epsilon>0, where α\alpha is the best approximation ratio achieved by a polynomial-time, linear-space sequential algorithm for the same diversity objective. This improves substantially over the approximation ratios attainable in Streaming and MapReduce by state-of-the-art algorithms for general metric spaces. We provide extensive experimental evidence of the effectiveness of our algorithms on both real world and synthetic datasets, scaling up to over a billion points.Comment: Extended version of http://www.vldb.org/pvldb/vol10/p469-ceccarello.pdf, PVLDB Volume 10, No. 5, January 201

    The planar multiple obnoxious facilities location problem: A Voronoi based heuristic

    Get PDF
    Consider the situation where a given number of facilities are to be located in a convex polygon with the objective of maximizing the minimum distance between facilities and a given set of communities with the important additional condition that the facilities have to be farther than a certain distance from one another. This continuous multiple obnoxious facility location problem, which has two variants, is very complex to solve using commercial nonlinear optimizers. We propose a mathematical formulation and a heuristic approach based on Voronoi diagrams and an optimally solved binary linear program. As there are no nonlinear optimization solvers that guarantee optimality, we compare our results with a popular multi-start approach using interior point, genetic algorithm (GA), and sparse non-linear optimizer (SNOPT) solvers in Matlab. These are state of the art solvers for dealing with constrained non linear problems. Each instance is solved using 100 randomly generated starting solutions and the overall best is then selected. It was found that the proposed heuristic results are much better and were obtained in a fraction of the computer time required by the other methods.The multiple obnoxious location problem is a perfect example where all-purpose non-linear non-convex solvers perform poorly and hence the best way forward is to design and analyze heuristics that have the power and the exibility to deal with such a high level of complexity

    The Remote-Clique Problem Revisited

    Get PDF
    Given a positive integer k and a complete graph with non-negative edge weights that satisfy the triangle inequality, the remote-clique problem is to find a subset of k vertices having a maximum-weight induced subgraph. A greedy algorithm for the problem has been shown to have an approximation ratio of 4, but this analysis was not shown to be tight. In this thesis, we present an algorithm called d-Greedy Augment that generalizes this greedy algorithm (they are equivalent when d = 1). We use the technique of factor-revealing linear programs to prove that d-Greedy Augment, which has a running time of O(kdnd ), achieves an approximation ratio of (2k ? 2)/(k + d ? 2). Thus, when d = 1, d-Greedy Augment achieves an approximation ratio of 2 and runs in time O(kn), making it the fastest known 2-approximation for the remote-clique problem. Beyond proving this worst-case result, we also examine the behavior of d-Greedy Augment in practice. First, we provide some theoretical results regarding the expected case performance of d-Greedy Augment on random graphs, and second, we describe data from some experiments that test the performance of d-Greedy Augment and related heuristics

    Provable randomized rounding for minimum-similarity diversification

    Get PDF
    When searching for information in a data collection, we are often interested not only in finding relevant items, but also in assembling a diverse set, so as to explore different concepts that are present in the data. This problem has been researched extensively. However, finding a set of items with minimal pairwise similarities can be computationally challenging, and most existing works striving for quality guarantees assume that item relatedness is measured by a distance function. Given the widespread use of similarity functions in many domains, we believe this to be an important gap in the literature. In this paper we study the problem of finding a diverse set of items, when item relatedness is measured by a similarity function. We formulate the diversification task using a flexible, broadly applicable minimization objective, consisting of the sum of pairwise similarities of the selected items and a relevance penalty term. To find good solutions we adopt a randomized rounding strategy, which is challenging to analyze because of the cardinality constraint present in our formulation. Even though this obstacle can be overcome using dependent rounding, we show that it is possible to obtain provably good solutions using an independent approach, which is faster, simpler to implement and completely parallelizable. Our analysis relies on a novel bound for the ratio of Poisson-Binomial densities, which is of independent interest and has potential implications for other combinatorial-optimization problems. We leverage this result to design an efficient randomized algorithm that provides a lower-order additive approximation guarantee. We validate our method using several benchmark datasets, and show that it consistently outperforms the greedy approaches that are commonly used in the literature.Peer reviewe

    Diverse sampling of streaming data

    Get PDF
    Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2013.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (pages 49-51).This thesis addresses the problem of diverse sampling as a dispersion problem and proposes solutions that are optimized for large streaming data. Finding the optimal solution to the dispersion problem is NP-hard. Therefore, existing and proposed solutions are approximation algorithms. This work evaluates the performance of dierent algorithms in practice and compares them to the theoretical guarantees.by Aizana Turmukhametova.M. Eng

    Approximation algorithms for geometric dispersion

    Get PDF
    The most basic form of the max-sum dispersion problem (MSD) is as follows: given n points in R^q and an integer k, select a set of k points such that the sum of the pairwise distances within the set is maximal. This is a prominent diversity problem, with wide applications in web search and information retrieval, where one needs to find a small and diverse representative subset of a large dataset. The problem has recently received a great deal of attention in the computational geometry and operations research communities; and since it is NP-hard, research has focused on efficient heuristics and approximation algorithms. Several classes of distance functions have been considered in the literature. Many of the most common distances used in applications are induced by a norm in a real vector space. The focus of this thesis is on MSD over these geometric instances. We provide for it simple and fast polynomial-time approximation schemes (PTASs), as well as improved constant-factor approximation algorithms. We pay special attention to the class of negative-type distances, a class that includes Euclidean and Manhattan distances, among many others. In order to exploit the properties of this class, we apply several techniques and results from the theory of isometric embeddings. We explore the following variations of the MSD problem: matroid and matroid-intersection constraints, knapsack constraints, and the mixed-objective problem that maximizes a combination of the sum of pairwise distances with a submodular monotone function. In addition to approximation algorithms, we present a core-set for geometric instances of low dimension, and we discuss the efficient implementation of some of our algorithms for massive datasets, using the streaming and distributed models of computation

    29th International Symposium on Algorithms and Computation: ISAAC 2018, December 16-19, 2018, Jiaoxi, Yilan, Taiwan

    Get PDF
    corecore