Search CORE

97,373 research outputs found

An evaluation of streaming algorithms for distinct counting over a sliding window

Author: Singh Sneha Aman
Tirthapura Srikanta
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2015
Field of study

Counting the number of distinct elements in a data stream (distinct counting) is a fundamental aggregation task in database query processing, query optimization, and network monitoring. On a stream of elements, it is commonly needed to compute an aggregate over only the most recent elements, leading to the problem of distinct counting over a “sliding window” of the stream. We present a detailed experimental study of the performance of different algorithms for distinct counting over a sliding window. We observe that the performance of an algorithm depends on the basic method used, as well as aspects such as the hash function, the mix of query and updates, and the method used to boost accuracy. We compare the performance of prominent algorithms and evaluate the influence of these factors, leading to practical recommendations for implementation. To the best of our knowledge, this is the first detailed experimental study of distinct counting over a sliding window

Directory of Open Access Journals

Frontiers - Publisher Connector

Differentially Private Continual Releases of Streaming Frequency Moment Estimations

Author: Epasto Alessandro
Mao Jieming
Medina Andres Munoz
Mirrokni Vahab
Vassilvitskii Sergei
Zhong Peilin
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 14th Innovations in Theoretical Computer Science Conference (ITCS 2023)
Publication date: 01/01/2023
Field of study

The streaming model of computation is a popular approach for working with large-scale data. In this setting, there is a stream of items and the goal is to compute the desired quantities (usually data statistics) while making a single pass through the stream and using as little space as possible. Motivated by the importance of data privacy, we develop differentially private streaming algorithms under the continual release setting, where the union of outputs of the algorithm at every timestamp must be differentially private. Specifically, we study the fundamental ?_p (p ? [0,+?)) frequency moment estimation problem under this setting, and give an ?-DP algorithm that achieves (1+?)-relative approximation (? ? ? (0,1)) with polylog(Tn) additive error and uses polylog(Tn)? max(1, n^{1-2/p}) space, where T is the length of the stream and n is the size of the universe of elements. Our space is near optimal up to poly-logarithmic factors even in the non-private setting. To obtain our results, we first reduce several primitives under the differentially private continual release model, such as counting distinct elements, heavy hitters and counting low frequency elements, to the simpler, counting/summing problems in the same setting. Based on these primitives, we develop a differentially private continual release level set estimation approach to address the ?_p frequency moment estimation problem. We also provide a simple extension of our results to the harder sliding window model, where the statistics must be maintained over the past W data items

Dagstuhl Research Online Publication Server

Distinct counting with a self-learning bitmap

Author: Cao Jin
Chen Aiyou
Nguyen Tuan
Shepp Larry
Publication venue
Publication date: 08/07/2011
Field of study

Counting the number of distinct elements (cardinality) in a dataset is a fundamental problem in database management. In recent years, due to many of its modern applications, there has been significant interest to address the distinct counting problem in a data stream setting, where each incoming data can be seen only once and cannot be stored for long periods of time. Many probabilistic approaches based on either sampling or sketching have been proposed in the computer science literature, that only require limited computing and memory resources. However, the performances of these methods are not scale-invariant, in the sense that their relative root mean square estimation errors (RRMSE) depend on the unknown cardinalities. This is not desirable in many applications where cardinalities can be very dynamic or inhomogeneous and many cardinalities need to be estimated. In this paper, we develop a novel approach, called self-learning bitmap (S-bitmap) that is scale-invariant for cardinalities in a specified range. S-bitmap uses a binary vector whose entries are updated from 0 to 1 by an adaptive sampling process for inferring the unknown cardinality, where the sampling rates are reduced sequentially as more and more entries change from 0 to 1. We prove rigorously that the S-bitmap estimate is not only unbiased but scale-invariant. We demonstrate that to achieve a small RRMSE value of

\epsilon

or less, our approach requires significantly less memory and consumes similar or less operations than state-of-the-art methods for many common practice cardinality scales. Both simulation and experimental studies are reported.Comment: Journal of the American Statistical Association (accepted

arXiv.org e-Print Archive

FingerPrint Based Duplicate Detection in Streamed Data

Author: Batra Shalini
Singh Amritpal
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 04/02/2019
Field of study

In computing, duplicate data detection refers to identifying duplicate copies of repeating data. Identifying duplicate data items in streamed data and eliminating them before storing, is a complex job. This paper proposes a novel data structure for duplicate detection using a variant of stable Bloom filter named as FingerPrint Stable Bloom Filter (FP-SBF). The proposed approach uses counting Bloom filter with fingerprint bits along with an optimization mechanism for duplicate detection. FP-SBF uses d-left hashing which reduces the computational time and decreases the false positives as well as false negatives. FP-SBF can process unbounded data in single pass, using k hash functions, and successfully differentiate between duplicate and distinct elements in O(k+1) time, independent of the size of incoming data. The performance of FP-SBF has been compared with various Bloom Filters used for stream data duplication detection and it has been theoretically and experimentally proved that the proposed approach efficiently detects the duplicates in streaming data with less memory requirements

Boosting the Basic Counting on Distributed Streams

Author: Xu Bojian
Publication venue
Publication date: 29/11/2013
Field of study

We revisit the classic basic counting problem in the distributed streaming model that was studied by Gibbons and Tirthapura (GT). In the solution for maintaining an

(\epsilon,\delta)

-estimate, as what GT's method does, we make the following new contributions: (1) For a bit stream of size

n

, where each bit has a probability at least

\gamma

to be 1, we exponentially reduced the average total processing time from GT's

\Theta(n \log(1/\delta))

O((1/(\gamma\epsilon^2))(\log^2 n) \log(1/\delta))

, thus providing the first sublinear-time streaming algorithm for this problem. (2) In addition to an overall much faster processing speed, our method provides a new tradeoff that a lower accuracy demand (a larger value for

\epsilon

) promises a faster processing speed, whereas GT's processing speed is

\Theta(n \log(1/\delta))

in any case and for any

\epsilon

. (3) The worst-case total time cost of our method matches GT's

\Theta(n\log(1/\delta))

, which is necessary but rarely occurs in our method. (4) The space usage overhead in our method is a lower order term compared with GT's space usage and occurs only

O(\log n)

times during the stream processing and is too negligible to be detected by the operating system in practice. We further validate these solid theoretical results with experiments on both real-world and synthetic data, showing that our method is faster than GT's by a factor of several to several thousands depending on the stream size and accuracy demands, without any detectable space usage overhead. Our method is based on a faster sampling technique that we design for boosting GT's method and we believe this technique can be of other interest.Comment: 32 page

arXiv.org e-Print Archive

CiteSeerX