Search CORE

19 research outputs found

Continuous sampling from distributed streams

Author: Babcock B.
Cormode G.
Cormode G.
Gibbons P.
Graham Cormode
Huang L.
Ke Yi
Muthukrishnan S.
Qin Zhang
S. Muthukrishnan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Randomized Algorithms for Tracking Distributed Count, Frequencies, and Ranks

Author: Huang Zengfeng
Yi Ke
Zhang Qin
Publication venue
Publication date: 02/12/2011
Field of study

We show that randomization can lead to significant improvements for a few fundamental problems in distributed tracking. Our basis is the {\em count-tracking} problem, where there are

k

players, each holding a counter

n_i

that gets incremented over time, and the goal is to track an \eps-approximation of their sum

n=\sum_i n_i

continuously at all times, using minimum communication. While the deterministic communication complexity of the problem is \Theta(k/\eps \cdot \log N), where

N

is the final value of

n

when the tracking finishes, we show that with randomization, the communication cost can be reduced to \Theta(\sqrt{k}/\eps \cdot \log N). Our algorithm is simple and uses only O(1) space at each player, while the lower bound holds even assuming each player has infinite computing power. Then, we extend our techniques to two related distributed tracking problems: {\em frequency-tracking} and {\em rank-tracking}, and obtain similar improvements over previous deterministic algorithms. Both problems are of central importance in large data monitoring and analysis, and have been extensively studied in the literature.Comment: 19 pages, 1 figur

arXiv.org e-Print Archive

Hong Kong University of Science and Technology Institutional Repository

Communication-Efficient Weighted Reservoir Sampling from Fully Distributed Data Streams

Author: Hübschle-Schneider Lorenz
Sanders Peter
Publication venue: Association for Computing Machinery
Publication date: 01/01/2020
Field of study

We consider weighted random sampling from distributed data streams presented as a sequence of mini-batches of items. This is a natural model for distributed streaming computation, and our goal is to showcase its usefulness. We present and analyze a fully distributed, communication-efficient algorithm for weighted reservoir sampling in this model. An experimental evaluation on up to 256 nodes (5120 processors) shows good speedups, while theoretical analysis promises further scaling to much larger machines

KITopen

Weighted Reservoir Sampling from Distributed Streams

Author: Jayaram Rajesh
Sharma Gokarna
Tirthapura Srikanta
Tirthapura Srikanta
Woodruff David
Publication venue
Publication date: 01/01/2019
Field of study

We consider message-efficient continuous random sampling from a distributed stream, where the probability of inclusion of an item in the sample is proportional to a weight associated with the item. The unweighted version, where all weights are equal, is well studied, and admits tight upper and lower bounds on message complexity. For weighted sampling with replacement, there is a simple reduction to unweighted sampling with replacement. However, in many applications the stream has only a few heavy items which may dominate a random sample when chosen with replacement. Weighted sampling \textit{without replacement} (weighted SWOR) eludes this issue, since such heavy items can be sampled at most once. In this work, we present the first message-optimal algorithm for weighted SWOR from a distributed stream. Our algorithm also has optimal space and time complexity. As an application of our algorithm for weighted SWOR, we derive the first distributed streaming algorithms for tracking \textit{heavy hitters with residual error}. Here the goal is to identify stream items that contribute significantly to the residual stream, once the heaviest items are removed. Residual heavy hitters generalize the notion of

\ell_1

heavy hitters and are important in streams that have a skewed distribution of weights. In addition to the upper bound, we also provide a lower bound on the message complexity that is nearly tight up to a

\log(1/\epsilon)

factor. Finally, we use our weighted sampling algorithm to improve the message complexity of distributed

L_1

tracking, also known as count tracking, which is a widely studied problem in distributed streaming. We also derive a tight message lower bound, which closes the message complexity of this fundamental problem.Comment: To appear in PODS 201

arXiv.org e-Print Archive

Digital Repository @ Iowa State University (ISU)

Crossref

Communication-Efficient (Weighted) Reservoir Sampling from Fully Distributed Data Streams

Author: Hübschle-Schneider Lorenz
Sanders Peter
Publication venue
Publication date: 01/01/2020
Field of study

We consider communication-efficient weighted and unweighted (uniform) random sampling from distributed data streams presented as a sequence of mini-batches of items. This is a natural model for distributed streaming computation, and our goal is to showcase its usefulness. We present and analyze fully distributed, communication-efficient algorithms for both versions of the problem. An experimental evaluation of weighted reservoir sampling on up to 256 nodes (5120 processors) shows good speedups, while theoretical analysis promises further scaling to much larger machines.Comment: A previous version of this paper was titled "Communication-Efficient (Weighted) Reservoir Sampling

arXiv.org e-Print Archive

KITopen

Distinct random sampling from a distributed stream

Author: Chung Yung-Yu
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2013
Field of study

We consider continuous maintenance of a random sample of distinct elements from a massive data stream, whose input elements are observed at multiple distributed sites that communicate via a central coordinator. At any point, when a query is received at the coordinator, it responds with a random sample from the set of all distinct elements observed at the different sites so far. We present the first algorithms for distinct random sampling on distributed streams. We also present a lower bound on the expected number of messages that must be transmitted by any distributed algorithm, showing that our algorithm is message optimal to within a factor of four. We present extensions to sliding windows, and detailed experimental results showing the performance of our algorithm on real-world data sets

Digital Repository @ Iowa State University (ISU)

Frequency Estimation Under Multiparty Differential Privacy: One-shot and Streaming

Author: Cormode Graham
Huang Ziyue
Qiu Yuan
Yi Ke
Publication venue
Publication date: 29/05/2021
Field of study

We study the fundamental problem of frequency estimation under both privacy and communication constraints, where the data is distributed among

k

parties. We consider two application scenarios: (1) one-shot, where the data is static and the aggregator conducts a one-time computation; and (2) streaming, where each party receives a stream of items over time and the aggregator continuously monitors the frequencies. We adopt the model of multiparty differential privacy (MDP), which is more general than local differential privacy (LDP) and (centralized) differential privacy. Our protocols achieve optimality (up to logarithmic factors) permissible by the more stringent of the two constraints. In particular, when specialized to the

\varepsilon

-LDP model, our protocol achieves an error of

\sqrt{k}/(e^{\Theta(\varepsilon)}-1)

using

O(k\max\{ \varepsilon, \frac{1}{\varepsilon} \})

bits of communication and

O(k \log u)

bits of public randomness, where

u

is the size of the domain

arXiv.org e-Print Archive

Warwick Research Archives Portal Repository