19 research outputs found
Communication-Efficient Weighted Reservoir Sampling from Fully Distributed Data Streams
We consider weighted random sampling from distributed data streams presented as a sequence of mini-batches of items. This is a natural model for distributed streaming computation, and our goal is to showcase its usefulness. We present and analyze a fully distributed, communication-efficient algorithm for weighted reservoir sampling in this model. An experimental evaluation on up to 256 nodes (5120 processors) shows good speedups, while theoretical analysis promises further scaling to much larger machines
Algorithms for Provisioning Queries and Analytics
Provisioning is a technique for avoiding repeated expensive computations in
what-if analysis. Given a query, an analyst formulates hypotheticals, each
retaining some of the tuples of a database instance, possibly overlapping, and
she wishes to answer the query under scenarios, where a scenario is defined by
a subset of the hypotheticals that are "turned on". We say that a query admits
compact provisioning if given any database instance and any hypotheticals,
one can create a poly-size (in ) sketch that can then be used to answer the
query under any of the possible scenarios without accessing the
original instance.
In this paper, we focus on provisioning complex queries that combine
relational algebra (the logical component), grouping, and statistics/analytics
(the numerical component). We first show that queries that compute quantiles or
linear regression (as well as simpler queries that compute count and
sum/average of positive values) can be compactly provisioned to provide
(multiplicative) approximate answers to an arbitrary precision. In contrast,
exact provisioning for each of these statistics requires the sketch size to be
exponential in . We then establish that for any complex query whose logical
component is a positive relational algebra query, as long as the numerical
component can be compactly provisioned, the complex query itself can be
compactly provisioned. On the other hand, introducing negation or recursion in
the logical component again requires the sketch size to be exponential in .
While our positive results use algorithms that do not access the original
instance after a scenario is known, we prove our lower bounds even for the case
when, knowing the scenario, limited access to the instance is allowed
Efficient protocols for distributed classification and optimization
pre-printA recent paper [1] proposes a general model for distributed learning that bounds the communication required for learning classifiers with e error on linearly separable data adversarially distributed across nodes. In this work, we develop key improvements and extensions to this basic model. Our first result is a two-party multiplicative-weight-update based protocol that uses O(d2 log1=e) words of communication to classify distributed data in arbitrary dimension d, e- optimally. This extends to classification over k nodes with O(kd2 log1=e) words of communication. Our proposed protocol is simple to implement and is considerably more efficient than baselines compared, as demonstrated by our empirical results. In addition, we show how to solve fixed-dimensional and high-dimensional linear programming with small communication in a distributed setting where constraints may be distributed across nodes. Our techniques make use of a novel connection from multipass streaming, as well as adapting the multiplicative-weight-update framework more generally to a distributed setting
Distinct random sampling from a distributed stream
We consider continuous maintenance of a random sample of distinct elements from a massive data stream, whose input elements are observed at multiple distributed sites that communicate via a central coordinator. At any point, when a query is received at the coordinator, it responds with a random sample from the set of all distinct elements observed at the different sites so far. We present the first algorithms for distinct random sampling on distributed streams. We also present a lower bound on the expected number of messages that must be transmitted by any distributed algorithm, showing that our algorithm is message optimal to within a factor of four. We present extensions to sliding windows, and detailed experimental results showing the performance of our algorithm on real-world data sets
Space-Efficient Estimation of Statistics Over Sub-Sampled Streams
In many stream monitoring situations, the data arrival rate is so high that it is not even possible to observe each element of the stream. The most common solution is to subsample the data stream and use the sample to infer properties and estimate aggregates of the original stream. However, in many cases, the estimation of aggregates on the original stream cannot be accomplished through simply estimating them on the sampled stream, followed by a normalization. We present algorithms for estimating frequency moments, support size, entropy, and heavy hitters of the original stream, through a single pass over the sampled stream