97,373 research outputs found
An evaluation of streaming algorithms for distinct counting over a sliding window
Counting the number of distinct elements in a data stream (distinct counting) is a fundamental aggregation task in database query processing, query optimization, and network monitoring. On a stream of elements, it is commonly needed to compute an aggregate over only the most recent elements, leading to the problem of distinct counting over a “sliding window” of the stream. We present a detailed experimental study of the performance of different algorithms for distinct counting over a sliding window. We observe that the performance of an algorithm depends on the basic method used, as well as aspects such as the hash function, the mix of query and updates, and the method used to boost accuracy. We compare the performance of prominent algorithms and evaluate the influence of these factors, leading to practical recommendations for implementation. To the best of our knowledge, this is the first detailed experimental study of distinct counting over a sliding window
Differentially Private Continual Releases of Streaming Frequency Moment Estimations
The streaming model of computation is a popular approach for working with large-scale data. In this setting, there is a stream of items and the goal is to compute the desired quantities (usually data statistics) while making a single pass through the stream and using as little space as possible.
Motivated by the importance of data privacy, we develop differentially private streaming algorithms under the continual release setting, where the union of outputs of the algorithm at every timestamp must be differentially private. Specifically, we study the fundamental ?_p (p ? [0,+?)) frequency moment estimation problem under this setting, and give an ?-DP algorithm that achieves (1+?)-relative approximation (? ? ? (0,1)) with polylog(Tn) additive error and uses polylog(Tn)? max(1, n^{1-2/p}) space, where T is the length of the stream and n is the size of the universe of elements. Our space is near optimal up to poly-logarithmic factors even in the non-private setting.
To obtain our results, we first reduce several primitives under the differentially private continual release model, such as counting distinct elements, heavy hitters and counting low frequency elements, to the simpler, counting/summing problems in the same setting. Based on these primitives, we develop a differentially private continual release level set estimation approach to address the ?_p frequency moment estimation problem.
We also provide a simple extension of our results to the harder sliding window model, where the statistics must be maintained over the past W data items
Distinct counting with a self-learning bitmap
Counting the number of distinct elements (cardinality) in a dataset is a
fundamental problem in database management. In recent years, due to many of its
modern applications, there has been significant interest to address the
distinct counting problem in a data stream setting, where each incoming data
can be seen only once and cannot be stored for long periods of time. Many
probabilistic approaches based on either sampling or sketching have been
proposed in the computer science literature, that only require limited
computing and memory resources. However, the performances of these methods are
not scale-invariant, in the sense that their relative root mean square
estimation errors (RRMSE) depend on the unknown cardinalities. This is not
desirable in many applications where cardinalities can be very dynamic or
inhomogeneous and many cardinalities need to be estimated. In this paper, we
develop a novel approach, called self-learning bitmap (S-bitmap) that is
scale-invariant for cardinalities in a specified range. S-bitmap uses a binary
vector whose entries are updated from 0 to 1 by an adaptive sampling process
for inferring the unknown cardinality, where the sampling rates are reduced
sequentially as more and more entries change from 0 to 1. We prove rigorously
that the S-bitmap estimate is not only unbiased but scale-invariant. We
demonstrate that to achieve a small RRMSE value of or less, our
approach requires significantly less memory and consumes similar or less
operations than state-of-the-art methods for many common practice cardinality
scales. Both simulation and experimental studies are reported.Comment: Journal of the American Statistical Association (accepted
FingerPrint Based Duplicate Detection in Streamed Data
In computing, duplicate data detection refers to identifying duplicate copies of repeating data. Identifying duplicate data items in streamed data and eliminating them before storing, is a complex job. This paper proposes a novel data structure for duplicate detection using a variant of stable Bloom filter named as FingerPrint Stable Bloom Filter (FP-SBF). The proposed approach uses counting Bloom filter with fingerprint bits along with an optimization mechanism for duplicate detection. FP-SBF uses d-left hashing which reduces the computational time and decreases the false positives as well as false negatives. FP-SBF can process unbounded data in single pass, using k hash functions, and successfully differentiate between duplicate and distinct elements in O(k+1) time, independent of the size of incoming data. The performance of FP-SBF has been compared with various Bloom Filters used for stream data duplication detection and it has been theoretically and experimentally proved that the proposed approach efficiently detects the duplicates in streaming data with less memory requirements
Boosting the Basic Counting on Distributed Streams
We revisit the classic basic counting problem in the distributed streaming
model that was studied by Gibbons and Tirthapura (GT). In the solution for
maintaining an -estimate, as what GT's method does, we make
the following new contributions: (1) For a bit stream of size , where each
bit has a probability at least to be 1, we exponentially reduced the
average total processing time from GT's to
, thus providing the first
sublinear-time streaming algorithm for this problem. (2) In addition to an
overall much faster processing speed, our method provides a new tradeoff that a
lower accuracy demand (a larger value for ) promises a faster
processing speed, whereas GT's processing speed is
in any case and for any . (3) The worst-case total time cost of our
method matches GT's , which is necessary but rarely
occurs in our method. (4) The space usage overhead in our method is a lower
order term compared with GT's space usage and occurs only times
during the stream processing and is too negligible to be detected by the
operating system in practice. We further validate these solid theoretical
results with experiments on both real-world and synthetic data, showing that
our method is faster than GT's by a factor of several to several thousands
depending on the stream size and accuracy demands, without any detectable space
usage overhead. Our method is based on a faster sampling technique that we
design for boosting GT's method and we believe this technique can be of other
interest.Comment: 32 page
- …