Search CORE

19 research outputs found

Communication-Efficient Weighted Reservoir Sampling from Fully Distributed Data Streams

Author: Hübschle-Schneider Lorenz
Sanders Peter
Publication venue: Association for Computing Machinery
Publication date: 01/01/2020
Field of study

We consider weighted random sampling from distributed data streams presented as a sequence of mini-batches of items. This is a natural model for distributed streaming computation, and our goal is to showcase its usefulness. We present and analyze a fully distributed, communication-efficient algorithm for weighted reservoir sampling in this model. An experimental evaluation on up to 256 nodes (5120 processors) shows good speedups, while theoretical analysis promises further scaling to much larger machines

KITopen

Algorithms for Provisioning Queries and Analytics

Author: Assadi Sepehr
Khanna Sanjeev
Li Yang
Tannen Val
Publication venue
Publication date: 18/12/2015
Field of study

Provisioning is a technique for avoiding repeated expensive computations in what-if analysis. Given a query, an analyst formulates

k

hypotheticals, each retaining some of the tuples of a database instance, possibly overlapping, and she wishes to answer the query under scenarios, where a scenario is defined by a subset of the hypotheticals that are "turned on". We say that a query admits compact provisioning if given any database instance and any

k

hypotheticals, one can create a poly-size (in

k

) sketch that can then be used to answer the query under any of the

2^{k}

possible scenarios without accessing the original instance. In this paper, we focus on provisioning complex queries that combine relational algebra (the logical component), grouping, and statistics/analytics (the numerical component). We first show that queries that compute quantiles or linear regression (as well as simpler queries that compute count and sum/average of positive values) can be compactly provisioned to provide (multiplicative) approximate answers to an arbitrary precision. In contrast, exact provisioning for each of these statistics requires the sketch size to be exponential in

k

. We then establish that for any complex query whose logical component is a positive relational algebra query, as long as the numerical component can be compactly provisioned, the complex query itself can be compactly provisioned. On the other hand, introducing negation or recursion in the logical component again requires the sketch size to be exponential in

k

. While our positive results use algorithms that do not access the original instance after a scenario is known, we prove our lower bounds even for the case when, knowing the scenario, limited access to the instance is allowed

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Efficient protocols for distributed classification and optimization

Author: Daumé III Hal.
Venkatasubramanian Suresh
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2012
Field of study

pre-printA recent paper [1] proposes a general model for distributed learning that bounds the communication required for learning classifiers with e error on linearly separable data adversarially distributed across nodes. In this work, we develop key improvements and extensions to this basic model. Our first result is a two-party multiplicative-weight-update based protocol that uses O(d2 log1=e) words of communication to classify distributed data in arbitrary dimension d, e- optimally. This extends to classification over k nodes with O(kd2 log1=e) words of communication. Our proposed protocol is simple to implement and is considerably more efficient than baselines compared, as demonstrated by our empirical results. In addition, we show how to solve fixed-dimensional and high-dimensional linear programming with small communication in a distributed setting where constraints may be distributed across nodes. Our techniques make use of a novel connection from multipass streaming, as well as adapting the multiplicative-weight-update framework more generally to a distributed setting

The University of Utah: J. Willard Marriott Digital Library

Distinct random sampling from a distributed stream

Author: Chung Yung-Yu
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2013
Field of study

We consider continuous maintenance of a random sample of distinct elements from a massive data stream, whose input elements are observed at multiple distributed sites that communicate via a central coordinator. At any point, when a query is received at the coordinator, it responds with a random sample from the set of all distinct elements observed at the different sites so far. We present the first algorithms for distinct random sampling on distributed streams. We also present a lower bound on the expected number of messages that must be transmitted by any distributed algorithm, showing that our algorithm is message optimal to within a factor of four. We present extensions to sliding windows, and detailed experimental results showing the performance of our algorithm on real-world data sets

Digital Repository @ Iowa State University (ISU)

Space-Efficient Estimation of Statistics Over Sub-Sampled Streams

Author: IBM Almaden
McGregor Andrew
Pavan A.
Tirthapura Srikanta
Publication venue: Iowa State University Digital Repository
Publication date: 01/02/2016
Field of study

In many stream monitoring situations, the data arrival rate is so high that it is not even possible to observe each element of the stream. The most common solution is to subsample the data stream and use the sample to infer properties and estimate aggregates of the original stream. However, in many cases, the estimation of aggregates on the original stream cannot be accomplished through simply estimating them on the sampled stream, followed by a normalization. We present algorithms for estimating frequency moments, support size, entropy, and heavy hitters of the original stream, through a single pass over the sampled stream

Digital Repository @ Iowa State University (ISU)