19 research outputs found

    Estimating Entropy of Data Streams Using Compressed Counting

    Full text link
    The Shannon entropy is a widely used summary statistic, for example, network traffic measurement, anomaly detection, neural computations, spike trains, etc. This study focuses on estimating Shannon entropy of data streams. It is known that Shannon entropy can be approximated by Reenyi entropy or Tsallis entropy, which are both functions of the p-th frequency moments and approach Shannon entropy as p->1. Compressed Counting (CC) is a new method for approximating the p-th frequency moments of data streams. Our contributions include: 1) We prove that Renyi entropy is (much) better than Tsallis entropy for approximating Shannon entropy. 2) We propose the optimal quantile estimator for CC, which considerably improves the previous estimators. 3) Our experiments demonstrate that CC is indeed highly effective approximating the moments and entropies. We also demonstrate the crucial importance of utilizing the variance-bias trade-off

    On Practical Algorithms for Entropy Estimation and the Improved Sample Complexity of Compressed Counting

    Full text link
    Estimating the p-th frequency moment of data stream is a very heavily studied problem. The problem is actually trivial when p = 1, assuming the strict Turnstile model. The sample complexity of our proposed algorithm is essentially O(1) near p=1. This is a very large improvement over the previously believed O(1/eps^2) bound. The proposed algorithm makes the long-standing problem of entropy estimation an easy task, as verified by the experiments included in the appendix

    Continuous Monitoring of l_p Norms in Data Streams

    Get PDF
    In insertion-only streaming, one sees a sequence of indices a_1, a_2, ..., a_m in [n]. The stream defines a sequence of m frequency vectors x(1), ..., x(m) each in R^n, where x(t) is the frequency vector of items after seeing the first t indices in the stream. Much work in the streaming literature focuses on estimating some function f(x(m)). Many applications though require obtaining estimates at time t of f(x(t)), for every t in [m]. Naively this guarantee is obtained by devising an algorithm with failure probability less than 1/m, then performing a union bound over all stream updates to guarantee that all m estimates are simultaneously accurate with good probability. When f(x) is some l_p norm of x, recent works have shown that this union bound is wasteful and better space complexity is possible for the continuous monitoring problem, with the strongest known results being for p=2. In this work, we improve the state of the art for all 0<p<2, which we obtain via a novel analysis of Indyk\u27s p-stable sketch

    Frequency Estimation in Data Streams: Learning the Optimal Hashing Scheme

    Full text link
    We present a novel approach for the problem of frequency estimation in data streams that is based on optimization and machine learning. Contrary to state-of-the-art streaming frequency estimation algorithms, which heavily rely on random hashing to maintain the frequency distribution of the data steam using limited storage, the proposed approach exploits an observed stream prefix to near-optimally hash elements and compress the target frequency distribution. We develop an exact mixed-integer linear optimization formulation, which enables us to compute optimal or near-optimal hashing schemes for elements seen in the observed stream prefix; then, we use machine learning to hash unseen elements. Further, we develop an efficient block coordinate descent algorithm, which, as we empirically show, produces high quality solutions, and, in a special case, we are able to solve the proposed formulation exactly in linear time using dynamic programming. We empirically evaluate the proposed approach both on synthetic datasets and on real-world search query data. We show that the proposed approach outperforms existing approaches by one to two orders of magnitude in terms of its average (per element) estimation error and by 45-90% in terms of its expected magnitude of estimation error.Comment: Submitted to IEEE Transactions on Knowledge and Data Engineering on 07/2020. Revised on 05/202
    corecore