213 research outputs found

    On Practical Algorithms for Entropy Estimation and the Improved Sample Complexity of Compressed Counting

    Full text link
    Estimating the p-th frequency moment of data stream is a very heavily studied problem. The problem is actually trivial when p = 1, assuming the strict Turnstile model. The sample complexity of our proposed algorithm is essentially O(1) near p=1. This is a very large improvement over the previously believed O(1/eps^2) bound. The proposed algorithm makes the long-standing problem of entropy estimation an easy task, as verified by the experiments included in the appendix

    Estimating Entropy of Data Streams Using Compressed Counting

    Full text link
    The Shannon entropy is a widely used summary statistic, for example, network traffic measurement, anomaly detection, neural computations, spike trains, etc. This study focuses on estimating Shannon entropy of data streams. It is known that Shannon entropy can be approximated by Reenyi entropy or Tsallis entropy, which are both functions of the p-th frequency moments and approach Shannon entropy as p->1. Compressed Counting (CC) is a new method for approximating the p-th frequency moments of data streams. Our contributions include: 1) We prove that Renyi entropy is (much) better than Tsallis entropy for approximating Shannon entropy. 2) We propose the optimal quantile estimator for CC, which considerably improves the previous estimators. 3) Our experiments demonstrate that CC is indeed highly effective approximating the moments and entropies. We also demonstrate the crucial importance of utilizing the variance-bias trade-off

    Algorithmic Techniques for Processing Data Streams

    Get PDF
    We give a survey at some algorithmic techniques for processing data streams. After covering the basic methods of sampling and sketching, we present more evolved procedures that resort on those basic ones. In particular, we examine algorithmic schemes for similarity mining, the concept of group testing, and techniques for clustering and summarizing data streams

    A Simple Proof of a New Set Disjointness with Applications to Data Streams

    Get PDF

    Differentially Private Fractional Frequency Moments Estimation with Polylogarithmic Space

    Get PDF
    We prove that Fp sketch, a well-celebrated streaming algorithm for frequency moments estimation, is differentially private as is when p ∈ (0, 1]. Fp sketch uses only polylogarithmic space, exponentially better than existing DP baselines and only worse than the optimal non-private baseline by a logarithmic factor. The evaluation shows that Fp sketch can achieve reasonable accuracy with differential privacy guarantee. The evaluation code is included in the supplementary material

    The Sketching Complexity of Graph and Hypergraph Counting

    Full text link
    Subgraph counting is a fundamental primitive in graph processing, with applications in social network analysis (e.g., estimating the clustering coefficient of a graph), database processing and other areas. The space complexity of subgraph counting has been studied extensively in the literature, but many natural settings are still not well understood. In this paper we revisit the subgraph (and hypergraph) counting problem in the sketching model, where the algorithm's state as it processes a stream of updates to the graph is a linear function of the stream. This model has recently received a lot of attention in the literature, and has become a standard model for solving dynamic graph streaming problems. In this paper we give a tight bound on the sketching complexity of counting the number of occurrences of a small subgraph HH in a bounded degree graph GG presented as a stream of edge updates. Specifically, we show that the space complexity of the problem is governed by the fractional vertex cover number of the graph HH. Our subgraph counting algorithm implements a natural vertex sampling approach, with sampling probabilities governed by the vertex cover of HH. Our main technical contribution lies in a new set of Fourier analytic tools that we develop to analyze multiplayer communication protocols in the simultaneous communication model, allowing us to prove a tight lower bound. We believe that our techniques are likely to find applications in other settings. Besides giving tight bounds for all graphs HH, both our algorithm and lower bounds extend to the hypergraph setting, albeit with some loss in space complexity

    Adversarially Robust Coloring for Graph Streams

    Get PDF
    • …