387 research outputs found

    Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window

    Get PDF
    The past decade has witnessed many interesting algorithms for maintaining statistics over a data stream. This paper initiates a theoretical study of algorithms for monitoring distributed data streams over a time-based sliding window (which contains a variable number of items and possibly out-of-order items). The concern is how to minimize the communication between individual streams and the root, while allowing the root, at any time, to be able to report the global statistics of all streams within a given error bound. This paper presents communication-efficient algorithms for three classical statistics, namely, basic counting, frequent items and quantiles. The worst-case communication cost over a window is O(kϵlogϵNk)O(\frac{k} {\epsilon} \log \frac{\epsilon N}{k}) bits for basic counting and O(kϵlogNk)O(\frac{k}{\epsilon} \log \frac{N}{k}) words for the remainings, where kk is the number of distributed data streams, NN is the total number of items in the streams that arrive or expire in the window, and ϵ<1\epsilon < 1 is the desired error bound. Matching and nearly matching lower bounds are also obtained.Comment: 12 pages, to appear in the 27th International Symposium on Theoretical Aspects of Computer Science (STACS), 201

    Boosting the Basic Counting on Distributed Streams

    Get PDF
    We revisit the classic basic counting problem in the distributed streaming model that was studied by Gibbons and Tirthapura (GT). In the solution for maintaining an (ϵ,δ)(\epsilon,\delta)-estimate, as what GT's method does, we make the following new contributions: (1) For a bit stream of size nn, where each bit has a probability at least γ\gamma to be 1, we exponentially reduced the average total processing time from GT's Θ(nlog(1/δ))\Theta(n \log(1/\delta)) to O((1/(γϵ2))(log2n)log(1/δ))O((1/(\gamma\epsilon^2))(\log^2 n) \log(1/\delta)), thus providing the first sublinear-time streaming algorithm for this problem. (2) In addition to an overall much faster processing speed, our method provides a new tradeoff that a lower accuracy demand (a larger value for ϵ\epsilon) promises a faster processing speed, whereas GT's processing speed is Θ(nlog(1/δ))\Theta(n \log(1/\delta)) in any case and for any ϵ\epsilon. (3) The worst-case total time cost of our method matches GT's Θ(nlog(1/δ))\Theta(n\log(1/\delta)), which is necessary but rarely occurs in our method. (4) The space usage overhead in our method is a lower order term compared with GT's space usage and occurs only O(logn)O(\log n) times during the stream processing and is too negligible to be detected by the operating system in practice. We further validate these solid theoretical results with experiments on both real-world and synthetic data, showing that our method is faster than GT's by a factor of several to several thousands depending on the stream size and accuracy demands, without any detectable space usage overhead. Our method is based on a faster sampling technique that we design for boosting GT's method and we believe this technique can be of other interest.Comment: 32 page

    Time-decaying Sketches for Robust Aggregation of Sensor Data

    Get PDF
    We present a new sketch for summarizing network data. The sketch has the following properties which make it useful in communication-efficient aggregation in distributed streaming scenarios, such as sensor networks: the sketch is duplicate insensitive, i.e., reinsertions of the same data will not affect the sketch and hence the estimates of aggregates. Unlike previous duplicate-insensitive sketches for sensor data aggregation [S. Nath et al., Synposis diffusion for robust aggregation in sensor networks, in Proceedings of the 2nd International Conference on Embedded Network Sensor Systems, (2004), pp. 250–262], [J. Considine et al., Approximate aggregation techniques for sensor databases, in Proceedings of the 20th International Conference on Data Engineering (ICDE), 2004, pp. 449–460], it is also time decaying, so that the weight of a data item in the sketch can decrease with time according to a user-specified decay function. The sketch can give provably approximate guarantees for various aggregates of data, including the sum, median, quantiles, and frequent elements. The size of the sketch and the time taken to update it are both polylogarithmic in the size of the relevant data. Further, multiple sketches computed over distributed data can be combined without loss of accuracy. To our knowledge, this is the first sketch that combines all the above properties

    Identifying Correlated Heavy-Hitters in a Two-Dimensional Data Stream

    Full text link
    We consider online mining of correlated heavy-hitters from a data stream. Given a stream of two-dimensional data, a correlated aggregate query first extracts a substream by applying a predicate along a primary dimension, and then computes an aggregate along a secondary dimension. Prior work on identifying heavy-hitters in streams has almost exclusively focused on identifying heavy-hitters on a single dimensional stream, and these yield little insight into the properties of heavy-hitters along other dimensions. In typical applications however, an analyst is interested not only in identifying heavy-hitters, but also in understanding further properties such as: what other items appear frequently along with a heavy-hitter, or what is the frequency distribution of items that appear along with the heavy-hitters. We consider queries of the following form: In a stream S of (x, y) tuples, on the substream H of all x values that are heavy-hitters, maintain those y values that occur frequently with the x values in H. We call this problem as Correlated Heavy-Hitters (CHH). We formulate an approximate formulation of CHH identification, and present an algorithm for tracking CHHs on a data stream. The algorithm is easy to implement and uses workspace which is orders of magnitude smaller than the stream itself. We present provable guarantees on the maximum error, as well as detailed experimental results that demonstrate the space-accuracy trade-off

    An evaluation of streaming algorithms for distinct counting over a sliding window

    Get PDF
    Counting the number of distinct elements in a data stream (distinct counting) is a fundamental aggregation task in database query processing, query optimization, and network monitoring. On a stream of elements, it is commonly needed to compute an aggregate over only the most recent elements, leading to the problem of distinct counting over a “sliding window” of the stream. We present a detailed experimental study of the performance of different algorithms for distinct counting over a sliding window. We observe that the performance of an algorithm depends on the basic method used, as well as aspects such as the hash function, the mix of query and updates, and the method used to boost accuracy. We compare the performance of prominent algorithms and evaluate the influence of these factors, leading to practical recommendations for implementation. To the best of our knowledge, this is the first detailed experimental study of distinct counting over a sliding window
    corecore