97,373 research outputs found

    An evaluation of streaming algorithms for distinct counting over a sliding window

    Get PDF
    Counting the number of distinct elements in a data stream (distinct counting) is a fundamental aggregation task in database query processing, query optimization, and network monitoring. On a stream of elements, it is commonly needed to compute an aggregate over only the most recent elements, leading to the problem of distinct counting over a “sliding window” of the stream. We present a detailed experimental study of the performance of different algorithms for distinct counting over a sliding window. We observe that the performance of an algorithm depends on the basic method used, as well as aspects such as the hash function, the mix of query and updates, and the method used to boost accuracy. We compare the performance of prominent algorithms and evaluate the influence of these factors, leading to practical recommendations for implementation. To the best of our knowledge, this is the first detailed experimental study of distinct counting over a sliding window

    Differentially Private Continual Releases of Streaming Frequency Moment Estimations

    Get PDF
    The streaming model of computation is a popular approach for working with large-scale data. In this setting, there is a stream of items and the goal is to compute the desired quantities (usually data statistics) while making a single pass through the stream and using as little space as possible. Motivated by the importance of data privacy, we develop differentially private streaming algorithms under the continual release setting, where the union of outputs of the algorithm at every timestamp must be differentially private. Specifically, we study the fundamental ?_p (p ? [0,+?)) frequency moment estimation problem under this setting, and give an ?-DP algorithm that achieves (1+?)-relative approximation (? ? ? (0,1)) with polylog(Tn) additive error and uses polylog(Tn)? max(1, n^{1-2/p}) space, where T is the length of the stream and n is the size of the universe of elements. Our space is near optimal up to poly-logarithmic factors even in the non-private setting. To obtain our results, we first reduce several primitives under the differentially private continual release model, such as counting distinct elements, heavy hitters and counting low frequency elements, to the simpler, counting/summing problems in the same setting. Based on these primitives, we develop a differentially private continual release level set estimation approach to address the ?_p frequency moment estimation problem. We also provide a simple extension of our results to the harder sliding window model, where the statistics must be maintained over the past W data items

    Distinct counting with a self-learning bitmap

    Full text link
    Counting the number of distinct elements (cardinality) in a dataset is a fundamental problem in database management. In recent years, due to many of its modern applications, there has been significant interest to address the distinct counting problem in a data stream setting, where each incoming data can be seen only once and cannot be stored for long periods of time. Many probabilistic approaches based on either sampling or sketching have been proposed in the computer science literature, that only require limited computing and memory resources. However, the performances of these methods are not scale-invariant, in the sense that their relative root mean square estimation errors (RRMSE) depend on the unknown cardinalities. This is not desirable in many applications where cardinalities can be very dynamic or inhomogeneous and many cardinalities need to be estimated. In this paper, we develop a novel approach, called self-learning bitmap (S-bitmap) that is scale-invariant for cardinalities in a specified range. S-bitmap uses a binary vector whose entries are updated from 0 to 1 by an adaptive sampling process for inferring the unknown cardinality, where the sampling rates are reduced sequentially as more and more entries change from 0 to 1. We prove rigorously that the S-bitmap estimate is not only unbiased but scale-invariant. We demonstrate that to achieve a small RRMSE value of ϵ\epsilon or less, our approach requires significantly less memory and consumes similar or less operations than state-of-the-art methods for many common practice cardinality scales. Both simulation and experimental studies are reported.Comment: Journal of the American Statistical Association (accepted

    FingerPrint Based Duplicate Detection in Streamed Data

    Get PDF
    In computing, duplicate data detection refers to identifying duplicate copies of repeating data. Identifying duplicate data items in streamed data and eliminating them before storing, is a complex job. This paper proposes a novel data structure for duplicate detection using a variant of stable Bloom filter named as FingerPrint Stable Bloom Filter (FP-SBF). The proposed approach uses counting Bloom filter with fingerprint bits along with an optimization mechanism for duplicate detection. FP-SBF uses d-left hashing which reduces the computational time and decreases the false positives as well as false negatives. FP-SBF can process unbounded data in single pass, using k hash functions, and successfully differentiate between duplicate and distinct elements in O(k+1) time, independent of the size of incoming data. The performance of FP-SBF has been compared with various Bloom Filters used for stream data duplication detection and it has been theoretically and experimentally proved that the proposed approach efficiently detects the duplicates in streaming data with less memory requirements

    Boosting the Basic Counting on Distributed Streams

    Get PDF
    We revisit the classic basic counting problem in the distributed streaming model that was studied by Gibbons and Tirthapura (GT). In the solution for maintaining an (ϵ,δ)(\epsilon,\delta)-estimate, as what GT's method does, we make the following new contributions: (1) For a bit stream of size nn, where each bit has a probability at least γ\gamma to be 1, we exponentially reduced the average total processing time from GT's Θ(nlog(1/δ))\Theta(n \log(1/\delta)) to O((1/(γϵ2))(log2n)log(1/δ))O((1/(\gamma\epsilon^2))(\log^2 n) \log(1/\delta)), thus providing the first sublinear-time streaming algorithm for this problem. (2) In addition to an overall much faster processing speed, our method provides a new tradeoff that a lower accuracy demand (a larger value for ϵ\epsilon) promises a faster processing speed, whereas GT's processing speed is Θ(nlog(1/δ))\Theta(n \log(1/\delta)) in any case and for any ϵ\epsilon. (3) The worst-case total time cost of our method matches GT's Θ(nlog(1/δ))\Theta(n\log(1/\delta)), which is necessary but rarely occurs in our method. (4) The space usage overhead in our method is a lower order term compared with GT's space usage and occurs only O(logn)O(\log n) times during the stream processing and is too negligible to be detected by the operating system in practice. We further validate these solid theoretical results with experiments on both real-world and synthetic data, showing that our method is faster than GT's by a factor of several to several thousands depending on the stream size and accuracy demands, without any detectable space usage overhead. Our method is based on a faster sampling technique that we design for boosting GT's method and we believe this technique can be of other interest.Comment: 32 page
    corecore