247 research outputs found

    Incidence Geometries and the Pass Complexity of Semi-Streaming Set Cover

    Full text link
    Set cover, over a universe of size nn, may be modelled as a data-streaming problem, where the mm sets that comprise the instance are to be read one by one. A semi-streaming algorithm is allowed only O(npoly{logn,logm})O(n\, \mathrm{poly}\{\log n, \log m\}) space to process this stream. For each p1p \ge 1, we give a very simple deterministic algorithm that makes pp passes over the input stream and returns an appropriately certified (p+1)n1/(p+1)(p+1)n^{1/(p+1)}-approximation to the optimum set cover. More importantly, we proceed to show that this approximation factor is essentially tight, by showing that a factor better than 0.99n1/(p+1)/(p+1)20.99\,n^{1/(p+1)}/(p+1)^2 is unachievable for a pp-pass semi-streaming algorithm, even allowing randomisation. In particular, this implies that achieving a Θ(logn)\Theta(\log n)-approximation requires Ω(logn/loglogn)\Omega(\log n/\log\log n) passes, which is tight up to the loglogn\log\log n factor. These results extend to a relaxation of the set cover problem where we are allowed to leave an ε\varepsilon fraction of the universe uncovered: the tight bounds on the best approximation factor achievable in pp passes turn out to be Θp(min{n1/(p+1),ε1/p})\Theta_p(\min\{n^{1/(p+1)}, \varepsilon^{-1/p}\}). Our lower bounds are based on a construction of a family of high-rank incidence geometries, which may be thought of as vast generalisations of affine planes. This construction, based on algebraic techniques, appears flexible enough to find other applications and is therefore interesting in its own right.Comment: 20 page

    Unifying Sparsest Cut, Cluster Deletion, and Modularity Clustering Objectives with Correlation Clustering

    Get PDF
    Graph clustering, or community detection, is the task of identifying groups of closely related objects in a large network. In this paper we introduce a new community-detection framework called LambdaCC that is based on a specially weighted version of correlation clustering. A key component in our methodology is a clustering resolution parameter, λ\lambda, which implicitly controls the size and structure of clusters formed by our framework. We show that, by increasing this parameter, our objective effectively interpolates between two different strategies in graph clustering: finding a sparse cut and forming dense subgraphs. Our methodology unifies and generalizes a number of other important clustering quality functions including modularity, sparsest cut, and cluster deletion, and places them all within the context of an optimization problem that has been well studied from the perspective of approximation algorithms. Our approach is particularly relevant in the regime of finding dense clusters, as it leads to a 2-approximation for the cluster deletion problem. We use our approach to cluster several graphs, including large collaboration networks and social networks

    Precedence-Constrained Min Sum Set Cover

    Get PDF
    We introduce a version of the Min Sum Set Cover (MSSC) problem in which there are "AND" precedence constraints on the m sets. In the Precedence-Constrained Min Sum Set Cover (PCMSSC) problem, when interpreted as directed edges, the constraints induce an acyclic directed graph. PCMSSC models the aim of scheduling software tests to prioritize the rate of fault detection subject to dependencies between tests. Our greedy scheme for PCMSSC is similar to the approaches of Feige, Lovasz, and, Tetali for MSSC, and Chekuri and Motwani for precedence-constrained scheduling to minimize weighted completion time. With a factor-4 increase in approximation ratio, we reduce PCMSSC to the problem of finding a maximum-density precedence-closed sub-family of sets, where density is the ratio of sub-family union size to cardinality. We provide a greedy factor-sqrt m algorithm for maximizing density; on forests of in-trees, we show this algorithm finds an optimal solution. Harnessing an alternative greedy argument of Chekuri and Kumar for Maximum Coverage with Group Budget Constraints, on forests of out-trees, we design an algorithm with approximation ratio equal to maximum tree height. Finally, with a reduction from the Planted Dense Subgraph detection problem, we show that its conjectured hardness implies there is no polynomial-time algorithm for PCMSSC with approximation factor in O(m^{1/12-epsilon})

    On Optimal Arrangements of Binary Sensors

    Get PDF
    A large range of monitoring applications can benefit from binary sensor networks. Binary sensors can detect the presence or absence of a particular target in their sensing regions. They can be used to partition a monitored area and provide localization functionality. If many of these sensors are deployed to monitor an area, the area is partitioned into sub-regions: each sub-region is characterized by the sensors detecting targets within it. We aim to maximize the number of unique, distinguishable sub-regions. Our goal is an optimal placement of both omni-directional and directional static binary sensors. We compute an upper bound on the number of unique sub-regions, which grows quadratically with respect to the number of sensors. In particular, we propose arrangements of sensors within a monitored area whose number of unique sub-regions is asymptotically equivalent to the upper bound

    Maximum Coverage in Sublinear Space, Faster

    Get PDF
    Given a collection of m sets from a universe ?, the Maximum Set Coverage problem consists of finding k sets whose union has largest cardinality. This problem is NP-Hard, but the solution can be approximated by a polynomial time algorithm up to a factor 1-1/e. However, this algorithm does not scale well with the input size. In a streaming context, practical high-quality solutions are found, but with space complexity that scales linearly with respect to the size of the universe n = |?|. However, one randomized streaming algorithm has been shown to produce a 1-1/e-? approximation of the optimal solution with a space complexity that scales only poly-logarithmically with respect to m and n. In order to achieve such a low space complexity, the authors used two techniques in their multi-pass approach: - F?-sketching, allows to determine with great accuracy the number of distinct elements in a set using less space than the set itself. - Subsampling, consists of only solving the problem on a subspace of the universe. It is implemented using ?-independent hash functions. This article focuses on the sublinear-space algorithm and highlights the time cost of these two techniques, especially subsampling. We present optimizations that significantly reduce the time complexity of the algorithm. Firstly, we give some optimizations that do not alter the space complexity, number of passes and approximation quality of the original algorithm. In particular, we reanalyze the error bounds to show that the original independence factor of ?(?^{-2} k log m) can be fine-tuned to ?(k log m); we also show how F?-sketching can be removed. Secondly, we derive a new lower bound for the probability of producing a 1-1/e-? approximation using only pairwise independence: 1- (4/(c k log m)) compared to 1-(2e/(m^{ck/6})) with ?(k log m)-independence. Although the theoretical guarantees are weaker, suggesting the approximation quality would suffer, for large streams, our algorithms perform well in practice. Finally, our experimental results show that even a pairwise-independent hash-function sampler does not produce worse solution than the original algorithm, while running significantly faster by several orders of magnitude

    Maximum Coverage in Random-Arrival Streams

    Get PDF

    Tight Data Access Bounds for Private Top-kk Selection

    Full text link
    We study the top-kk selection problem under the differential privacy model: mm items are rated according to votes of a set of clients. We consider a setting in which algorithms can retrieve data via a sequence of accesses, each either a random access or a sorted access; the goal is to minimize the total number of data accesses. Our algorithm requires only O(mk)O(\sqrt{mk}) expected accesses: to our knowledge, this is the first sublinear data-access upper bound for this problem. Our analysis also shows that the well-known exponential mechanism requires only O(m)O(\sqrt{m}) expected accesses. Accompanying this, we develop the first lower bounds for the problem, in three settings: only random accesses; only sorted accesses; a sequence of accesses of either kind. We show that, to avoid Ω(m)\Omega(m) access cost, supporting *both* kinds of access is necessary, and that in this case our algorithm's access cost is optimal

    Result-Sensitive Binary Search with Noisy Information

    Get PDF
    We describe new algorithms for the predecessor problem in the Noisy Comparison Model. In this problem, given a sorted list L of n (distinct) elements and a query q, we seek the predecessor of q in L: denoted by u, the largest element less than or equal to q. In the Noisy Comparison Model, the result of a comparison between two elements is non-deterministic. Moreover, multiple comparisons of the same pair of elements might have different results: each is generated independently, and is correct with probability p > 1/2. Given an overall error tolerance Q, the cost of an algorithm is measured by the total number of noisy comparisons; these must guarantee the predecessor is returned with probability at least 1 - Q. Feige et al. showed that predecessor queries can be answered by a modified binary search with Theta(log (n/Q)) noisy comparisons. We design result-sensitive algorithms for answering predecessor queries. The query cost is related to the index, k, of the predecessor u in L. Our first algorithm answers predecessor queries with O(log ((log^{*(c)} n)/Q) + log (k/Q)) noisy comparisons, for an arbitrarily large constant c. The function log^{*(c)} n iterates c times the iterated-logarithm function, log^* n. Our second algorithm is a genuinely result-sensitive algorithm whose expected query cost is bounded by O(log (k/Q)), and is guaranteed to terminate after at most O(log((log n)/Q)) noisy comparisons. Our results strictly improve the state-of-the-art bounds when k is in omega(1) intersected with o(n^epsilon), where epsilon > 0 is some constant. Moreover, we show that our result-sensitive algorithms immediately improve not only predecessor-query algorithms, but also binary-search-like algorithms for solving key applications

    Recency Queries with Succinct Representation

    Get PDF
    In the context of the sliding-window set membership problem, and caching policies that require knowledge of item recency, we formalize the problem of Recency on a stream. Informally, the query asks, "when was the last time I saw item x?" Existing structures, such as hash tables, can support a recency query by augmenting item occurrences with timestamps. To support recency queries on a window of W items, this might require ?(W log W) bits. We propose a succinct data structure for Recency. By combining sliding-window dictionaries in a hierarchical structure, and careful design of the underlying hash tables, we achieve a data structure that returns a 1+? approximation to the recency of every item in O(log(? W)) time, in only (1+o(1))(1+?)(?+Wlog(?^(-1))) bits. Here, ? is the information-theoretic lower bound on the number of bits for a set of size W, in a universe of cardinality N
    corecore