14,605 research outputs found
Weighted Reservoir Sampling from Distributed Streams
We consider message-efficient continuous random sampling from a distributed
stream, where the probability of inclusion of an item in the sample is
proportional to a weight associated with the item. The unweighted version,
where all weights are equal, is well studied, and admits tight upper and lower
bounds on message complexity. For weighted sampling with replacement, there is
a simple reduction to unweighted sampling with replacement. However, in many
applications the stream has only a few heavy items which may dominate a random
sample when chosen with replacement. Weighted sampling \textit{without
replacement} (weighted SWOR) eludes this issue, since such heavy items can be
sampled at most once.
In this work, we present the first message-optimal algorithm for weighted
SWOR from a distributed stream. Our algorithm also has optimal space and time
complexity. As an application of our algorithm for weighted SWOR, we derive the
first distributed streaming algorithms for tracking \textit{heavy hitters with
residual error}. Here the goal is to identify stream items that contribute
significantly to the residual stream, once the heaviest items are removed.
Residual heavy hitters generalize the notion of heavy hitters and are
important in streams that have a skewed distribution of weights. In addition to
the upper bound, we also provide a lower bound on the message complexity that
is nearly tight up to a factor. Finally, we use our weighted
sampling algorithm to improve the message complexity of distributed
tracking, also known as count tracking, which is a widely studied problem in
distributed streaming. We also derive a tight message lower bound, which closes
the message complexity of this fundamental problem.Comment: To appear in PODS 201
Communication-Efficient (Weighted) Reservoir Sampling from Fully Distributed Data Streams
We consider communication-efficient weighted and unweighted (uniform) random
sampling from distributed data streams presented as a sequence of mini-batches
of items. This is a natural model for distributed streaming computation, and
our goal is to showcase its usefulness. We present and analyze fully
distributed, communication-efficient algorithms for both versions of the
problem. An experimental evaluation of weighted reservoir sampling on up to 256
nodes (5120 processors) shows good speedups, while theoretical analysis
promises further scaling to much larger machines.Comment: A previous version of this paper was titled "Communication-Efficient
(Weighted) Reservoir Sampling
Communication-Efficient Weighted Reservoir Sampling from Fully Distributed Data Streams
We consider weighted random sampling from distributed data streams presented as a sequence of mini-batches of items. This is a natural model for distributed streaming computation, and our goal is to showcase its usefulness. We present and analyze a fully distributed, communication-efficient algorithm for weighted reservoir sampling in this model. An experimental evaluation on up to 256 nodes (5120 processors) shows good speedups, while theoretical analysis promises further scaling to much larger machines
Stream Aggregation Through Order Sampling
This is paper introduces a new single-pass reservoir weighted-sampling stream
aggregation algorithm, Priority-Based Aggregation (PBA). While order sampling
is a powerful and e cient method for weighted sampling from a stream of
uniquely keyed items, there is no current algorithm that realizes the benefits
of order sampling in the context of stream aggregation over non-unique keys. A
naive approach to order sample regardless of key then aggregate the results is
hopelessly inefficient. In distinction, our proposed algorithm uses a single
persistent random variable across the lifetime of each key in the cache, and
maintains unbiased estimates of the key aggregates that can be queried at any
point in the stream. The basic approach can be supplemented with a Sample and
Hold pre-sampling stage with a sampling rate adaptation controlled by PBA. This
approach represents a considerable reduction in computational complexity
compared with the state of the art in adapting Sample and Hold to operate with
a fixed cache size. Concerning statistical properties, we prove that PBA
provides unbiased estimates of the true aggregates. We analyze the
computational complexity of PBA and its variants, and provide a detailed
evaluation of its accuracy on synthetic and trace data. Weighted relative error
is reduced by 40% to 65% at sampling rates of 5% to 17%, relative to Adaptive
Sample and Hold; there is also substantial improvement for rank queriesComment: 10 page
Adaptive Threshold Sampling and Estimation
Sampling is a fundamental problem in both computer science and statistics. A
number of issues arise when designing a method based on sampling. These include
statistical considerations such as constructing a good sampling design and
ensuring there are good, tractable estimators for the quantities of interest as
well as computational considerations such as designing fast algorithms for
streaming data and ensuring the sample fits within memory constraints.
Unfortunately, existing sampling methods are only able to address all of these
issues in limited scenarios.
We develop a framework that can be used to address these issues in a broad
range of scenarios. In particular, it addresses the problem of drawing and
using samples under some memory budget constraint. This problem can be
challenging since the memory budget forces samples to be drawn
non-independently and consequently, makes computation of resulting estimators
difficult.
At the core of the framework is the notion of a data adaptive thresholding
scheme where the threshold effectively allows one to treat the non-independent
sample as if it were drawn independently. We provide sufficient conditions for
a thresholding scheme to allow this and provide ways to build and compose such
schemes.
Furthermore, we provide fast algorithms to efficiently sample under these
thresholding schemes
- …