Search CORE

14,605 research outputs found

Weighted Reservoir Sampling from Distributed Streams

Author: Jayaram Rajesh
Sharma Gokarna
Tirthapura Srikanta
Woodruff David P.
Publication venue
Publication date: 01/01/2019
Field of study

We consider message-efficient continuous random sampling from a distributed stream, where the probability of inclusion of an item in the sample is proportional to a weight associated with the item. The unweighted version, where all weights are equal, is well studied, and admits tight upper and lower bounds on message complexity. For weighted sampling with replacement, there is a simple reduction to unweighted sampling with replacement. However, in many applications the stream has only a few heavy items which may dominate a random sample when chosen with replacement. Weighted sampling \textit{without replacement} (weighted SWOR) eludes this issue, since such heavy items can be sampled at most once. In this work, we present the first message-optimal algorithm for weighted SWOR from a distributed stream. Our algorithm also has optimal space and time complexity. As an application of our algorithm for weighted SWOR, we derive the first distributed streaming algorithms for tracking \textit{heavy hitters with residual error}. Here the goal is to identify stream items that contribute significantly to the residual stream, once the heaviest items are removed. Residual heavy hitters generalize the notion of

\ell_1

heavy hitters and are important in streams that have a skewed distribution of weights. In addition to the upper bound, we also provide a lower bound on the message complexity that is nearly tight up to a

\log(1/\epsilon)

factor. Finally, we use our weighted sampling algorithm to improve the message complexity of distributed

L_1

tracking, also known as count tracking, which is a widely studied problem in distributed streaming. We also derive a tight message lower bound, which closes the message complexity of this fundamental problem.Comment: To appear in PODS 201

arXiv.org e-Print Archive

Digital Repository @ Iowa State University (ISU)

Crossref

Communication-Efficient (Weighted) Reservoir Sampling from Fully Distributed Data Streams

Author: Hübschle-Schneider Lorenz
Sanders Peter
Publication venue
Publication date: 01/01/2020
Field of study

We consider communication-efficient weighted and unweighted (uniform) random sampling from distributed data streams presented as a sequence of mini-batches of items. This is a natural model for distributed streaming computation, and our goal is to showcase its usefulness. We present and analyze fully distributed, communication-efficient algorithms for both versions of the problem. An experimental evaluation of weighted reservoir sampling on up to 256 nodes (5120 processors) shows good speedups, while theoretical analysis promises further scaling to much larger machines.Comment: A previous version of this paper was titled "Communication-Efficient (Weighted) Reservoir Sampling

arXiv.org e-Print Archive

KITopen

Communication-Efficient Weighted Reservoir Sampling from Fully Distributed Data Streams

Author: Hübschle-Schneider Lorenz
Sanders Peter
Publication venue: Association for Computing Machinery
Publication date: 01/01/2020
Field of study

We consider weighted random sampling from distributed data streams presented as a sequence of mini-batches of items. This is a natural model for distributed streaming computation, and our goal is to showcase its usefulness. We present and analyze a fully distributed, communication-efficient algorithm for weighted reservoir sampling in this model. An experimental evaluation on up to 256 nodes (5120 processors) shows good speedups, while theoretical analysis promises further scaling to much larger machines

KITopen

Stream Aggregation Through Order Sampling

Author: Ahmed Nesreen
Duffield Nick
Xia Liangzhen
Xu Yunhong
Yu Minlan
Publication venue
Publication date: 01/11/2017
Field of study

This is paper introduces a new single-pass reservoir weighted-sampling stream aggregation algorithm, Priority-Based Aggregation (PBA). While order sampling is a powerful and e cient method for weighted sampling from a stream of uniquely keyed items, there is no current algorithm that realizes the benefits of order sampling in the context of stream aggregation over non-unique keys. A naive approach to order sample regardless of key then aggregate the results is hopelessly inefficient. In distinction, our proposed algorithm uses a single persistent random variable across the lifetime of each key in the cache, and maintains unbiased estimates of the key aggregates that can be queried at any point in the stream. The basic approach can be supplemented with a Sample and Hold pre-sampling stage with a sampling rate adaptation controlled by PBA. This approach represents a considerable reduction in computational complexity compared with the state of the art in adapting Sample and Hold to operate with a fixed cache size. Concerning statistical properties, we prove that PBA provides unbiased estimates of the true aggregates. We analyze the computational complexity of PBA and its variants, and provide a detailed evaluation of its accuracy on synthetic and trace data. Weighted relative error is reduced by 40% to 65% at sampling rates of 5% to 17%, relative to Adaptive Sample and Hold; there is also substantial improvement for rank queriesComment: 10 page

arXiv.org e-Print Archive

Crossref

Adaptive Threshold Sampling and Estimation

Author: Ting Daniel
Publication venue
Publication date: 16/08/2017
Field of study

Sampling is a fundamental problem in both computer science and statistics. A number of issues arise when designing a method based on sampling. These include statistical considerations such as constructing a good sampling design and ensuring there are good, tractable estimators for the quantities of interest as well as computational considerations such as designing fast algorithms for streaming data and ensuring the sample fits within memory constraints. Unfortunately, existing sampling methods are only able to address all of these issues in limited scenarios. We develop a framework that can be used to address these issues in a broad range of scenarios. In particular, it addresses the problem of drawing and using samples under some memory budget constraint. This problem can be challenging since the memory budget forces samples to be drawn non-independently and consequently, makes computation of resulting estimators difficult. At the core of the framework is the notion of a data adaptive thresholding scheme where the threshold effectively allows one to treat the non-independent sample as if it were drawn independently. We provide sufficient conditions for a thresholding scheme to allow this and provide ways to build and compose such schemes. Furthermore, we provide fast algorithms to efficiently sample under these thresholding schemes

arXiv.org e-Print Archive