173 research outputs found

    Weighted Reservoir Sampling from Distributed Streams

    Get PDF
    We consider message-efficient continuous random sampling from a distributed stream, where the probability of inclusion of an item in the sample is proportional to a weight associated with the item. The unweighted version, where all weights are equal, is well studied, and admits tight upper and lower bounds on message complexity. For weighted sampling with replacement, there is a simple reduction to unweighted sampling with replacement. However, in many applications the stream has only a few heavy items which may dominate a random sample when chosen with replacement. Weighted sampling \textit{without replacement} (weighted SWOR) eludes this issue, since such heavy items can be sampled at most once. In this work, we present the first message-optimal algorithm for weighted SWOR from a distributed stream. Our algorithm also has optimal space and time complexity. As an application of our algorithm for weighted SWOR, we derive the first distributed streaming algorithms for tracking \textit{heavy hitters with residual error}. Here the goal is to identify stream items that contribute significantly to the residual stream, once the heaviest items are removed. Residual heavy hitters generalize the notion of 1\ell_1 heavy hitters and are important in streams that have a skewed distribution of weights. In addition to the upper bound, we also provide a lower bound on the message complexity that is nearly tight up to a log(1/ϵ)\log(1/\epsilon) factor. Finally, we use our weighted sampling algorithm to improve the message complexity of distributed L1L_1 tracking, also known as count tracking, which is a widely studied problem in distributed streaming. We also derive a tight message lower bound, which closes the message complexity of this fundamental problem.Comment: To appear in PODS 201

    Variance-Optimal Offline and Streaming Stratified Random Sampling

    Get PDF
    Stratified random sampling (SRS) is a fundamental sampling technique that provides accurate estimates for aggregate queries using a small size sample, and has been used widely for approximate query processing. A key question in SRS is how to partition a target sample size among different strata. While Neyman allocation provides a solution that minimizes the variance of an estimate using this sample, it works under the assumption that each stratum is abundant, i.e., has a large number of data points to choose from. This assumption may not hold in general: one or more strata may be bounded, and may not contain a large number of data points, even though the total data size may be large. We first present VOILA, an offline method for allocating sample sizes to strata in a variance-optimal manner, even for the case when one or more strata may be bounded. We next consider SRS on streaming data that are continuously arriving. We show a lower bound, that any streaming algorithm for SRS must have (in the worst case) a variance that is {\Omega}(r) factor away from the optimal, where r is the number of strata. We present S-VOILA, a practical streaming algorithm for SRS that is locally variance-optimal in its allocation of sample sizes to different strata. Our result from experiments on real and synthetic data show that VOILA can have significantly (1.4 to 50.0 times) smaller variance than Neyman allocation. The streaming algorithm S-VOILA results in a variance that is typically close to VOILA, which was given the entire input beforehand

    Enumerating Maximal Bicliques from a Large Graph using MapReduce

    Get PDF
    We consider the enumeration of maximal bipartite cliques (bicliques) from a large graph, a task central to many practical data mining problems in social network analysis and bioinformatics. We present novel parallel algorithms for the MapReduce platform, and an experimental evaluation using Hadoop MapReduce. Our algorithm is based on clustering the input graph into smaller sized subgraphs, followed by processing different subgraphs in parallel. Our algorithm uses two ideas that enable it to scale to large graphs: (1) the redundancy in work between different subgraph explorations is minimized through a careful pruning of the search space, and (2) the load on different reducers is balanced through the use of an appropriate total order among the vertices. Our evaluation shows that the algorithm scales to large graphs with millions of edges and tens of mil- lions of maximal bicliques. To our knowledge, this is the first work on maximal biclique enumeration for graphs of this scale.Comment: A preliminary version of the paper was accepted at the Proceedings of the 3rd IEEE International Congress on Big Data 201

    Onion Curve: A Space Filling Curve with Near-Optimal Clustering

    Get PDF
    Space filling curves (SFCs) are widely used in the design of indexes for spatial and temporal data. Clustering is a key metric for an SFC, that measures how well the curve preserves locality in moving from higher dimensions to a single dimension. We present the {\em onion curve}, an SFC whose clustering performance is provably close to optimal for the cube and near-cube shaped query sets, irrespective of the side length of the query. We show that in contrast, the clustering performance of the widely used Hilbert curve can be far from optimal, even for cube-shaped queries. Since the clustering performance of an SFC is critical to the efficiency of multi-dimensional indexes based on the SFC, the onion curve can deliver improved performance for data structures involving multi-dimensional data.Comment: The short version is published in ICDE 1

    Scalable and Dynamic Regeneration of Big Data Volumes

    Get PDF
    A core requirement of database engine testing is the ability to create synthetic versions of the customer’s data warehouse at the vendor site. A rich body of work exists on synthetic database regeneration, but suffers critical limitations with regard to: (a) maintaining statistical fidelity to the client’s query processing, and/or (b) scaling to large data volumes. In this paper, we present HYDRA, a workload-dependent database regenerator that leverages a declarative approach to data regeneration to assure volumetric similarity, a crucial aspect of statistical fidelity, and materially improves on the prior art by adding scale, dynamism and functionality. Specifically, Hydra uses an optimized linear programming (LP) formulation based on a novel regionpartitioning approach. This spatial strategy drastically reduces the LP complexity, enabling it to handle query workloads on which contemporary techniques fail. Second, Hydra incorporates deterministic post-LP processing algorithms that provide high efficiency and improved accuracy. Third, Hydra introduces the concept of dynamic regeneration by constructing a minuscule database summary that can on-the-fly regenerate databases of arbitrary size during query execution, while obeying volumetric specifications derived from the query workload. A detailed experimental evaluation on standard OLAP benchmarks demonstrates that Hydra can efficiently and dynamically regenerate large warehouses that accurately mimic the desired statistical characteristics

    Incremental Maintenance of Maximal Cliques in a Dynamic Graph

    Full text link
    We consider the maintenance of the set of all maximal cliques in a dynamic graph that is changing through the addition or deletion of edges. We present nearly tight bounds on the magnitude of change in the set of maximal cliques, as well as the first change-sensitive algorithms for clique maintenance, whose runtime is proportional to the magnitude of the change in the set of maximal cliques. We present experimental results showing these algorithms are efficient in practice and are faster than prior work by two to three orders of magnitude.Comment: 18 pages, 8 figure

    Concurrent counting is harder than queuing

    Get PDF
    We compare the complexities of two fundamental distributed coordination problems, distributed counting and distributed queuing, in a concurrent setting. In both distributed counting and queuing, processors in a distributed system issue operations which are organized into a total order. In counting, each participating processor receives the rank of its operation in the total order, where as in queuing, a processor receives the identity of its predecessor in the total order. Many coordination applications can be solved using either distributed counting or queuing, and it is useful to know which of counting or queuing is the easier problem. Our results show that concurrent counting is harder than concurrent queuing on a variety of processor interconnection topologies, including high and low diameter graphs. For all these topologies, we show that the concurrent delay complexity of a particular solution to queuing, the arrow protocol, is asymptotically smaller than a lower bound on the complexity of any solution to counting

    EvoMiner: Frequent Subtree Mining in Phylogenetic Databases

    Get PDF
    The problem of mining collections of trees to identify common patterns, called frequent subtrees (FSTs), arises often when trying to interpret the results of phylogenetic analysis. FST mining generalizes the well-known maximum agreement subtree problem. Here we present EvoMiner, a new algorithm for mining frequent subtrees in collections of phylogenetic trees. EvoMiner is an Apriori-like level-wise method, which uses a novel phylogeny-specific constant-time candidate generation scheme, an efficient fingerprinting-based technique for downward closure, and a lowest common ancestor based support counting step that requires neither costly subtree operations nor database traversal. Our algorithm achieves speed-ups of up to 100 times or more over Phylominer, the current state-of-the-art algorithm for mining phylogenetic trees. EvoMiner can also work in depth first enumeration mode, to use less memory at the expense of speed. We demonstrate the utility of FST mining as a way to extract meaningful phylogenetic information from collections of trees when compared to maximum agreement subtrees and majority rule trees --- two commonly used approaches in phylogenetic analysis for extracting consensus information from a collection of trees over a common leaf set
    corecore