1,573 research outputs found
Weighted Reservoir Sampling from Distributed Streams
We consider message-efficient continuous random sampling from a distributed
stream, where the probability of inclusion of an item in the sample is
proportional to a weight associated with the item. The unweighted version,
where all weights are equal, is well studied, and admits tight upper and lower
bounds on message complexity. For weighted sampling with replacement, there is
a simple reduction to unweighted sampling with replacement. However, in many
applications the stream has only a few heavy items which may dominate a random
sample when chosen with replacement. Weighted sampling \textit{without
replacement} (weighted SWOR) eludes this issue, since such heavy items can be
sampled at most once.
In this work, we present the first message-optimal algorithm for weighted
SWOR from a distributed stream. Our algorithm also has optimal space and time
complexity. As an application of our algorithm for weighted SWOR, we derive the
first distributed streaming algorithms for tracking \textit{heavy hitters with
residual error}. Here the goal is to identify stream items that contribute
significantly to the residual stream, once the heaviest items are removed.
Residual heavy hitters generalize the notion of heavy hitters and are
important in streams that have a skewed distribution of weights. In addition to
the upper bound, we also provide a lower bound on the message complexity that
is nearly tight up to a factor. Finally, we use our weighted
sampling algorithm to improve the message complexity of distributed
tracking, also known as count tracking, which is a widely studied problem in
distributed streaming. We also derive a tight message lower bound, which closes
the message complexity of this fundamental problem.Comment: To appear in PODS 201
Variance-Optimal Offline and Streaming Stratified Random Sampling
Stratified random sampling (SRS) is a fundamental sampling technique that
provides accurate estimates for aggregate queries using a small size sample,
and has been used widely for approximate query processing. A key question in
SRS is how to partition a target sample size among different strata. While
Neyman allocation provides a solution that minimizes the variance of an
estimate using this sample, it works under the assumption that each stratum is
abundant, i.e., has a large number of data points to choose from. This
assumption may not hold in general: one or more strata may be bounded, and may
not contain a large number of data points, even though the total data size may
be large.
We first present VOILA, an offline method for allocating sample sizes to
strata in a variance-optimal manner, even for the case when one or more strata
may be bounded. We next consider SRS on streaming data that are continuously
arriving. We show a lower bound, that any streaming algorithm for SRS must have
(in the worst case) a variance that is {\Omega}(r) factor away from the
optimal, where r is the number of strata. We present S-VOILA, a practical
streaming algorithm for SRS that is locally variance-optimal in its allocation
of sample sizes to different strata. Our result from experiments on real and
synthetic data show that VOILA can have significantly (1.4 to 50.0 times)
smaller variance than Neyman allocation. The streaming algorithm S-VOILA
results in a variance that is typically close to VOILA, which was given the
entire input beforehand
Enumerating Maximal Bicliques from a Large Graph using MapReduce
We consider the enumeration of maximal bipartite cliques (bicliques) from a
large graph, a task central to many practical data mining problems in social
network analysis and bioinformatics. We present novel parallel algorithms for
the MapReduce platform, and an experimental evaluation using Hadoop MapReduce.
Our algorithm is based on clustering the input graph into smaller sized
subgraphs, followed by processing different subgraphs in parallel. Our
algorithm uses two ideas that enable it to scale to large graphs: (1) the
redundancy in work between different subgraph explorations is minimized through
a careful pruning of the search space, and (2) the load on different reducers
is balanced through the use of an appropriate total order among the vertices.
Our evaluation shows that the algorithm scales to large graphs with millions of
edges and tens of mil- lions of maximal bicliques. To our knowledge, this is
the first work on maximal biclique enumeration for graphs of this scale.Comment: A preliminary version of the paper was accepted at the Proceedings of
the 3rd IEEE International Congress on Big Data 201
Onion Curve: A Space Filling Curve with Near-Optimal Clustering
Space filling curves (SFCs) are widely used in the design of indexes for
spatial and temporal data. Clustering is a key metric for an SFC, that measures
how well the curve preserves locality in moving from higher dimensions to a
single dimension. We present the {\em onion curve}, an SFC whose clustering
performance is provably close to optimal for the cube and near-cube shaped
query sets, irrespective of the side length of the query. We show that in
contrast, the clustering performance of the widely used Hilbert curve can be
far from optimal, even for cube-shaped queries. Since the clustering
performance of an SFC is critical to the efficiency of multi-dimensional
indexes based on the SFC, the onion curve can deliver improved performance for
data structures involving multi-dimensional data.Comment: The short version is published in ICDE 1
Incremental Maintenance of Maximal Cliques in a Dynamic Graph
We consider the maintenance of the set of all maximal cliques in a dynamic
graph that is changing through the addition or deletion of edges. We present
nearly tight bounds on the magnitude of change in the set of maximal cliques,
as well as the first change-sensitive algorithms for clique maintenance, whose
runtime is proportional to the magnitude of the change in the set of maximal
cliques. We present experimental results showing these algorithms are efficient
in practice and are faster than prior work by two to three orders of magnitude.Comment: 18 pages, 8 figure
- …