4,535 research outputs found

    Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window

    Get PDF
    The past decade has witnessed many interesting algorithms for maintaining statistics over a data stream. This paper initiates a theoretical study of algorithms for monitoring distributed data streams over a time-based sliding window (which contains a variable number of items and possibly out-of-order items). The concern is how to minimize the communication between individual streams and the root, while allowing the root, at any time, to be able to report the global statistics of all streams within a given error bound. This paper presents communication-efficient algorithms for three classical statistics, namely, basic counting, frequent items and quantiles. The worst-case communication cost over a window is O(kϵlogϵNk)O(\frac{k} {\epsilon} \log \frac{\epsilon N}{k}) bits for basic counting and O(kϵlogNk)O(\frac{k}{\epsilon} \log \frac{N}{k}) words for the remainings, where kk is the number of distributed data streams, NN is the total number of items in the streams that arrive or expire in the window, and ϵ<1\epsilon < 1 is the desired error bound. Matching and nearly matching lower bounds are also obtained.Comment: 12 pages, to appear in the 27th International Symposium on Theoretical Aspects of Computer Science (STACS), 201

    Distributed Query Monitoring through Convex Analysis: Towards Composable Safe Zones

    Get PDF
    Continuous tracking of complex data analytics queries over high-speed distributed streams is becoming increasingly important. Query tracking can be reduced to continuous monitoring of a condition over the global stream. Communication-efficient monitoring relies on locally processing stream data at the sites where it is generated, by deriving site-local conditions which collectively guarantee the global condition. Recently proposed geometric techniques offer a generic approach for splitting an arbitrary global condition into local geometric monitoring constraints (known as "Safe Zones"); still, their application to various problem domains has so far been based on heuristics and lacking a principled, compositional methodology. In this paper, we present the first known formal results on the difficult problem of effective Safe Zone (SZ) design for complex query monitoring over distributed streams. Exploiting tools from convex analysis, our approach relies on an algebraic representation of SZs which allows us to: (1) Formally define the notion of a "good" SZ for distributed monitoring problems; and, most importantly, (2) Tackle and solve the important problem of systematically composing SZs for monitored conditions expressed as Boolean formulas over simpler conditions (for which SZs are known); furthermore, we prove that, under broad assumptions, the composed SZ is good if the component SZs are good. Our results are, therefore, a first step towards a principled compositional solution to SZ design for distributed query monitoring. Finally, we discuss a number of important applications for our SZ design algorithms, also demonstrating how earlier geometric techniques can be seen as special cases of our framework

    Optimal Tracking of Distributed Heavy Hitters and Quantiles

    Full text link
    We consider the the problem of tracking heavy hitters and quantiles in the distributed streaming model. The heavy hitters and quantiles are two important statistics for characterizing a data distribution. Let AA be a multiset of elements, drawn from the universe U={1,...,u}U=\{1,...,u\}. For a given 0ϕ10 \le \phi \le 1, the ϕ\phi-heavy hitters are those elements of AA whose frequency in AA is at least ϕA\phi |A|; the ϕ\phi-quantile of AA is an element xx of UU such that at most ϕA\phi|A| elements of AA are smaller than AA and at most (1ϕ)A(1-\phi)|A| elements of AA are greater than xx. Suppose the elements of AA are received at kk remote {\em sites} over time, and each of the sites has a two-way communication channel to a designated {\em coordinator}, whose goal is to track the set of ϕ\phi-heavy hitters and the ϕ\phi-quantile of AA approximately at all times with minimum communication. We give tracking algorithms with worst-case communication cost O(k/\eps \cdot \log n) for both problems, where nn is the total number of items in AA, and \eps is the approximation error. This substantially improves upon the previous known algorithms. We also give matching lower bounds on the communication costs for both problems, showing that our algorithms are optimal. We also consider a more general version of the problem where we simultaneously track the ϕ\phi-quantiles for all 0ϕ10 \le \phi \le 1.Comment: 10 pages, 1 figur
    corecore