Search CORE

1,573 research outputs found

Weighted Reservoir Sampling from Distributed Streams

Author: Jayaram Rajesh
Sharma Gokarna
Tirthapura Srikanta
Tirthapura Srikanta
Woodruff David
Publication venue
Publication date: 01/01/2019
Field of study

We consider message-efficient continuous random sampling from a distributed stream, where the probability of inclusion of an item in the sample is proportional to a weight associated with the item. The unweighted version, where all weights are equal, is well studied, and admits tight upper and lower bounds on message complexity. For weighted sampling with replacement, there is a simple reduction to unweighted sampling with replacement. However, in many applications the stream has only a few heavy items which may dominate a random sample when chosen with replacement. Weighted sampling \textit{without replacement} (weighted SWOR) eludes this issue, since such heavy items can be sampled at most once. In this work, we present the first message-optimal algorithm for weighted SWOR from a distributed stream. Our algorithm also has optimal space and time complexity. As an application of our algorithm for weighted SWOR, we derive the first distributed streaming algorithms for tracking \textit{heavy hitters with residual error}. Here the goal is to identify stream items that contribute significantly to the residual stream, once the heaviest items are removed. Residual heavy hitters generalize the notion of

\ell_1

heavy hitters and are important in streams that have a skewed distribution of weights. In addition to the upper bound, we also provide a lower bound on the message complexity that is nearly tight up to a

\log(1/\epsilon)

factor. Finally, we use our weighted sampling algorithm to improve the message complexity of distributed

L_1

tracking, also known as count tracking, which is a widely studied problem in distributed streaming. We also derive a tight message lower bound, which closes the message complexity of this fundamental problem.Comment: To appear in PODS 201

arXiv.org e-Print Archive

Digital Repository @ Iowa State University (ISU)

Crossref

Variance-Optimal Offline and Streaming Stratified Random Sampling

Author: Nguyen Trong
Shih Ming-Hung
Srivastava Divesh
Tirthapura Srikanta
Tirthapura Srikanta
Xu Bojian
Publication venue
Publication date: 01/01/2018
Field of study

Stratified random sampling (SRS) is a fundamental sampling technique that provides accurate estimates for aggregate queries using a small size sample, and has been used widely for approximate query processing. A key question in SRS is how to partition a target sample size among different strata. While Neyman allocation provides a solution that minimizes the variance of an estimate using this sample, it works under the assumption that each stratum is abundant, i.e., has a large number of data points to choose from. This assumption may not hold in general: one or more strata may be bounded, and may not contain a large number of data points, even though the total data size may be large. We first present VOILA, an offline method for allocating sample sizes to strata in a variance-optimal manner, even for the case when one or more strata may be bounded. We next consider SRS on streaming data that are continuously arriving. We show a lower bound, that any streaming algorithm for SRS must have (in the worst case) a variance that is {\Omega}(r) factor away from the optimal, where r is the number of strata. We present S-VOILA, a practical streaming algorithm for SRS that is locally variance-optimal in its allocation of sample sizes to different strata. Our result from experiments on real and synthetic data show that VOILA can have significantly (1.4 to 50.0 times) smaller variance than Neyman allocation. The streaming algorithm S-VOILA results in a variance that is typically close to VOILA, which was given the entire input beforehand

arXiv.org e-Print Archive

Digital Repository @ Iowa State University (ISU)

Enumerating Maximal Bicliques from a Large Graph using MapReduce

Author: Mukherjee Arko Provo
Tirthapura Srikanta
Publication venue
Publication date: 01/01/2014
Field of study

We consider the enumeration of maximal bipartite cliques (bicliques) from a large graph, a task central to many practical data mining problems in social network analysis and bioinformatics. We present novel parallel algorithms for the MapReduce platform, and an experimental evaluation using Hadoop MapReduce. Our algorithm is based on clustering the input graph into smaller sized subgraphs, followed by processing different subgraphs in parallel. Our algorithm uses two ideas that enable it to scale to large graphs: (1) the redundancy in work between different subgraph explorations is minimized through a careful pruning of the search space, and (2) the load on different reducers is balanced through the use of an appropriate total order among the vertices. Our evaluation shows that the algorithm scales to large graphs with millions of edges and tens of mil- lions of maximal bicliques. To our knowledge, this is the first work on maximal biclique enumeration for graphs of this scale.Comment: A preliminary version of the paper was accepted at the Proceedings of the 3rd IEEE International Congress on Big Data 201

arXiv.org e-Print Archive

Digital Repository @ Iowa State University (ISU)

Crossref

Onion Curve: A Space Filling Curve with Near-Optimal Clustering

Author: Nguyen Cuong
Tirthapura Srikanta
Xu Pan
Publication venue
Publication date: 01/01/2018
Field of study

Space filling curves (SFCs) are widely used in the design of indexes for spatial and temporal data. Clustering is a key metric for an SFC, that measures how well the curve preserves locality in moving from higher dimensions to a single dimension. We present the {\em onion curve}, an SFC whose clustering performance is provably close to optimal for the cube and near-cube shaped query sets, irrespective of the side length of the query. We show that in contrast, the clustering performance of the widely used Hilbert curve can be far from optimal, even for cube-shaped queries. Since the clustering performance of an SFC is critical to the efficiency of multi-dimensional indexes based on the SFC, the onion curve can deliver improved performance for data structures involving multi-dimensional data.Comment: The short version is published in ICDE 1

arXiv.org e-Print Archive

Digital Repository @ Iowa State University (ISU)

Crossref

Incremental Maintenance of Maximal Cliques in a Dynamic Graph

Author: Das Apurba
Svendsen Michael
Tirthapura Srikanta
Publication venue
Publication date: 17/03/2018
Field of study

We consider the maintenance of the set of all maximal cliques in a dynamic graph that is changing through the addition or deletion of edges. We present nearly tight bounds on the magnitude of change in the set of maximal cliques, as well as the first change-sensitive algorithms for clique maintenance, whose runtime is proportional to the magnitude of the change in the set of maximal cliques. We present experimental results showing these algorithms are efficient in practice and are faster than prior work by two to three orders of magnitude.Comment: 18 pages, 8 figure

arXiv.org e-Print Archive

Digital Repository @ Iowa State University (ISU)

ScholarBank@NUS