41,163 research outputs found

    On Estimating the First Frequency Moment of Data Streams

    Full text link
    Estimating the first moment of a data stream defined as F_1 = \sum_{i \in \{1, 2, \ldots, n\}} \abs{f_i} to within 1Β±Ο΅1 \pm \epsilon-relative error with high probability is a basic and influential problem in data stream processing. A tight space bound of O(Ο΅βˆ’2log⁑(mM))O(\epsilon^{-2} \log (mM)) is known from the work of [Kane-Nelson-Woodruff-SODA10]. However, all known algorithms for this problem require per-update stream processing time of Ξ©(Ο΅βˆ’2)\Omega(\epsilon^{-2}), with the only exception being the algorithm of [Ganguly-Cormode-RANDOM07] that requires per-update processing time of O(log⁑2(mM)(log⁑n))O(\log^2(mM)(\log n)) albeit with sub-optimal space O(Ο΅βˆ’3log⁑2(mM))O(\epsilon^{-3}\log^2(mM)). In this paper, we present an algorithm for estimating F1F_1 that achieves near-optimality in both space and update processing time. The space requirement is O(Ο΅βˆ’2(log⁑n+(logβ‘Ο΅βˆ’1)log⁑(mM)))O(\epsilon^{-2}(\log n + (\log \epsilon^{-1})\log(mM))) and the per-update processing time is O((log⁑n)log⁑(Ο΅βˆ’1))O( (\log n)\log (\epsilon^{-1})).Comment: 12 page

    On Practical Algorithms for Entropy Estimation and the Improved Sample Complexity of Compressed Counting

    Full text link
    Estimating the p-th frequency moment of data stream is a very heavily studied problem. The problem is actually trivial when p = 1, assuming the strict Turnstile model. The sample complexity of our proposed algorithm is essentially O(1) near p=1. This is a very large improvement over the previously believed O(1/eps^2) bound. The proposed algorithm makes the long-standing problem of entropy estimation an easy task, as verified by the experiments included in the appendix

    Estimating Entropy of Data Streams Using Compressed Counting

    Full text link
    The Shannon entropy is a widely used summary statistic, for example, network traffic measurement, anomaly detection, neural computations, spike trains, etc. This study focuses on estimating Shannon entropy of data streams. It is known that Shannon entropy can be approximated by Reenyi entropy or Tsallis entropy, which are both functions of the p-th frequency moments and approach Shannon entropy as p->1. Compressed Counting (CC) is a new method for approximating the p-th frequency moments of data streams. Our contributions include: 1) We prove that Renyi entropy is (much) better than Tsallis entropy for approximating Shannon entropy. 2) We propose the optimal quantile estimator for CC, which considerably improves the previous estimators. 3) Our experiments demonstrate that CC is indeed highly effective approximating the moments and entropies. We also demonstrate the crucial importance of utilizing the variance-bias trade-off

    Recursive Sketching For Frequency Moments

    Full text link
    In a ground-breaking paper, Indyk and Woodruff (STOC 05) showed how to compute FkF_k (for k>2k>2) in space complexity O(\mbox{\em poly-log}(n,m)\cdot n^{1-\frac2k}), which is optimal up to (large) poly-logarithmic factors in nn and mm, where mm is the length of the stream and nn is the upper bound on the number of distinct elements in a stream. The best known lower bound for large moments is Ξ©(log⁑(n)n1βˆ’2k)\Omega(\log(n)n^{1-\frac2k}). A follow-up work of Bhuvanagiri, Ganguly, Kesh and Saha (SODA 2006) reduced the poly-logarithmic factors of Indyk and Woodruff to O(log⁑2(m)β‹…(log⁑n+log⁑m)β‹…n1βˆ’2k)O(\log^2(m)\cdot (\log n+ \log m)\cdot n^{1-{2\over k}}). Further reduction of poly-log factors has been an elusive goal since 2006, when Indyk and Woodruff method seemed to hit a natural "barrier." Using our simple recursive sketch, we provide a different yet simple approach to obtain a O(log⁑(m)log⁑(nm)β‹…(log⁑log⁑n)4β‹…n1βˆ’2k)O(\log(m)\log(nm)\cdot (\log\log n)^4\cdot n^{1-{2\over k}}) algorithm for constant Ο΅\epsilon (our bound is, in fact, somewhat stronger, where the (log⁑log⁑n)(\log\log n) term can be replaced by any constant number of log⁑\log iterations instead of just two or three, thus approaching logβˆ—nlog^*n. Our bound also works for non-constant Ο΅\epsilon (for details see the body of the paper). Further, our algorithm requires only 44-wise independence, in contrast to existing methods that use pseudo-random generators for computing large frequency moments

    Evaluation of the utility of sediment data in NASQAN (National Stream Quality Accounting Network)

    Get PDF
    Monthly suspended sediment discharge measurements, made by the USGS as part of the National Stream Quality Accounting Network (NASQAN), are analysed to assess the adequacy in terms of spatial coverage, temporal sampling frequency, accuracy of measurements, as well as in determining the sediment yield in the nation's rivers. It is concluded that the spatial distribution of NASQAN stations is reasonable but necessarily judgemental. The temporal variations of sediment data contain much higher frequencies than monthly. Sampling error is found to be minor when compared with other causes of data scatter which can be substantial. The usefulness of the monthly measurements of sediment transport is enhanced when combined with the daily measurements of water discharge. Increasing the sampling frequency moderately would not materially improve the accuracy of sediment yield determinations
    • …
    corecore