4,535 research outputs found
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window
The past decade has witnessed many interesting algorithms for maintaining
statistics over a data stream. This paper initiates a theoretical study of
algorithms for monitoring distributed data streams over a time-based sliding
window (which contains a variable number of items and possibly out-of-order
items). The concern is how to minimize the communication between individual
streams and the root, while allowing the root, at any time, to be able to
report the global statistics of all streams within a given error bound. This
paper presents communication-efficient algorithms for three classical
statistics, namely, basic counting, frequent items and quantiles. The
worst-case communication cost over a window is bits for basic counting and words for the remainings, where is the number of distributed
data streams, is the total number of items in the streams that arrive or
expire in the window, and is the desired error bound. Matching
and nearly matching lower bounds are also obtained.Comment: 12 pages, to appear in the 27th International Symposium on
Theoretical Aspects of Computer Science (STACS), 201
Distributed Query Monitoring through Convex Analysis: Towards Composable Safe Zones
Continuous tracking of complex data analytics queries over high-speed distributed streams is becoming increasingly important. Query tracking can be reduced to continuous monitoring of a condition over the global stream. Communication-efficient monitoring relies on locally processing stream data at the sites where it is generated, by deriving site-local conditions which collectively guarantee the global condition. Recently proposed geometric techniques offer a generic approach for splitting an arbitrary global condition into local geometric monitoring constraints (known as "Safe Zones"); still, their application to various problem domains has so far been based on heuristics and lacking a principled, compositional methodology. In this paper, we present the first known formal results on the difficult problem of effective Safe Zone (SZ) design for complex query monitoring over distributed streams. Exploiting tools from convex analysis, our approach relies on an algebraic representation of SZs which allows us to: (1) Formally define the notion of a "good" SZ for distributed monitoring problems; and, most importantly, (2) Tackle and solve the important problem of systematically composing SZs for monitored conditions expressed as Boolean formulas over simpler conditions (for which SZs are known); furthermore, we prove that, under broad assumptions, the composed SZ is good if the component SZs are good. Our results are, therefore, a first step towards a principled compositional solution to SZ design for distributed query monitoring. Finally, we discuss a number of important applications for our SZ design algorithms, also demonstrating how earlier geometric techniques can be seen as special cases of our framework
Optimal Tracking of Distributed Heavy Hitters and Quantiles
We consider the the problem of tracking heavy hitters and quantiles in the
distributed streaming model. The heavy hitters and quantiles are two important
statistics for characterizing a data distribution. Let be a multiset of
elements, drawn from the universe . For a given , the -heavy hitters are those elements of whose frequency in
is at least ; the -quantile of is an element of
such that at most elements of are smaller than and at most
elements of are greater than . Suppose the elements of
are received at remote {\em sites} over time, and each of the sites has a
two-way communication channel to a designated {\em coordinator}, whose goal is
to track the set of -heavy hitters and the -quantile of
approximately at all times with minimum communication. We give tracking
algorithms with worst-case communication cost O(k/\eps \cdot \log n) for both
problems, where is the total number of items in , and \eps is the
approximation error. This substantially improves upon the previous known
algorithms. We also give matching lower bounds on the communication costs for
both problems, showing that our algorithms are optimal. We also consider a more
general version of the problem where we simultaneously track the
-quantiles for all .Comment: 10 pages, 1 figur
- …