Search CORE

19 research outputs found

Estimating Entropy of Data Streams Using Compressed Counting

Author: Li Ping
Publication venue
Publication date: 01/01/2009
Field of study

The Shannon entropy is a widely used summary statistic, for example, network traffic measurement, anomaly detection, neural computations, spike trains, etc. This study focuses on estimating Shannon entropy of data streams. It is known that Shannon entropy can be approximated by Reenyi entropy or Tsallis entropy, which are both functions of the p-th frequency moments and approach Shannon entropy as p->1. Compressed Counting (CC) is a new method for approximating the p-th frequency moments of data streams. Our contributions include: 1) We prove that Renyi entropy is (much) better than Tsallis entropy for approximating Shannon entropy. 2) We propose the optimal quantile estimator for CC, which considerably improves the previous estimators. 3) Our experiments demonstrate that CC is indeed highly effective approximating the moments and entropies. We also demonstrate the crucial importance of utilizing the variance-bias trade-off

arXiv.org e-Print Archive

CiteSeerX

On Practical Algorithms for Entropy Estimation and the Improved Sample Complexity of Compressed Counting

Author: Li Ping
Publication venue
Publication date: 01/01/2010
Field of study

Estimating the p-th frequency moment of data stream is a very heavily studied problem. The problem is actually trivial when p = 1, assuming the strict Turnstile model. The sample complexity of our proposed algorithm is essentially O(1) near p=1. This is a very large improvement over the previously believed O(1/eps^2) bound. The proposed algorithm makes the long-standing problem of entropy estimation an easy task, as verified by the experiments included in the appendix

arXiv.org e-Print Archive

CiteSeerX

Continuous Monitoring of l_p Norms in Data Streams

Author: Blasiok Jaroslaw
Ding Jian
Nelson Jelani
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2017)
Publication date: 01/01/2017
Field of study

In insertion-only streaming, one sees a sequence of indices a_1, a_2, ..., a_m in [n]. The stream defines a sequence of m frequency vectors x(1), ..., x(m) each in R^n, where x(t) is the frequency vector of items after seeing the first t indices in the stream. Much work in the streaming literature focuses on estimating some function f(x(m)). Many applications though require obtaining estimates at time t of f(x(t)), for every t in [m]. Naively this guarantee is obtained by devising an algorithm with failure probability less than 1/m, then performing a union bound over all stream updates to guarantee that all m estimates are simultaneously accurate with good probability. When f(x) is some l_p norm of x, recent works have shown that this union bound is wasteful and better space complexity is possible for the continuous monitoring problem, with the strongest known results being for p=2. In this work, we improve the state of the art for all 0<p<2, which we obtain via a novel analysis of Indyk\u27s p-stable sketch

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Frequency Estimation in Data Streams: Learning the Optimal Hashing Scheme

Author: Bertsimas Dimitris
Digalakis Jr Vassilis
Publication venue
Publication date: 02/06/2021
Field of study

We present a novel approach for the problem of frequency estimation in data streams that is based on optimization and machine learning. Contrary to state-of-the-art streaming frequency estimation algorithms, which heavily rely on random hashing to maintain the frequency distribution of the data steam using limited storage, the proposed approach exploits an observed stream prefix to near-optimally hash elements and compress the target frequency distribution. We develop an exact mixed-integer linear optimization formulation, which enables us to compute optimal or near-optimal hashing schemes for elements seen in the observed stream prefix; then, we use machine learning to hash unseen elements. Further, we develop an efficient block coordinate descent algorithm, which, as we empirically show, produces high quality solutions, and, in a special case, we are able to solve the proposed formulation exactly in linear time using dynamic programming. We empirically evaluate the proposed approach both on synthetic datasets and on real-world search query data. We show that the proposed approach outperforms existing approaches by one to two orders of magnitude in terms of its average (per element) estimation error and by 45-90% in terms of its expected magnitude of estimation error.Comment: Submitted to IEEE Transactions on Knowledge and Data Engineering on 07/2020. Revised on 05/202

arXiv.org e-Print Archive

DSpace@MIT

Recommended from our members

Sketching and Streaming Entropy via Approximation Theory

Author: Harvey Nicholas Scott
Nelson Jelani
Onak Krzysztof
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 16/03/2015
Field of study

We conclude a sequence of work by giving near-optimal sketching and streaming algorithms for estimating Shannon entropy in the most general streaming model, with arbitrary insertions and deletions. This improves on prior results that obtain suboptimal space bounds in the general model, and near-optimal bounds in the insertion-only model without sketching. Our high-level approach is simple: we give algorithms to estimate Renyi and Tsallis entropy, and use them to extrapolate an estimate of Shannon entropy. The accuracy of our estimates is proven using approximation theory arguments and extremal properties of Chebyshev polynomials, a technique which may be useful for other problems. Our work also yields the best-known and near-optimal additive approximations for entropy, and hence also for conditional entropy and mutual information.Engineering and Applied Science

Harvard University - DASH