Search CORE

41,163 research outputs found

On Estimating the First Frequency Moment of Data Streams

Author: Ganguly Sumit
Kar Purushottam
Publication venue
Publication date: 01/01/2010
Field of study

Estimating the first moment of a data stream defined as F_1 = \sum_{i \in \{1, 2, \ldots, n\}} \abs{f_i} to within

1 \pm \epsilon

-relative error with high probability is a basic and influential problem in data stream processing. A tight space bound of

O(\epsilon^{-2} \log (mM))

is known from the work of [Kane-Nelson-Woodruff-SODA10]. However, all known algorithms for this problem require per-update stream processing time of

\Omega(\epsilon^{-2})

, with the only exception being the algorithm of [Ganguly-Cormode-RANDOM07] that requires per-update processing time of

O(\log^2(mM)(\log n))

albeit with sub-optimal space

O(\epsilon^{-3}\log^2(mM))

. In this paper, we present an algorithm for estimating

F_1

that achieves near-optimality in both space and update processing time. The space requirement is

O(\epsilon^{-2}(\log n + (\log \epsilon^{-1})\log(mM)))

and the per-update processing time is

O( (\log n)\log (\epsilon^{-1}))

.Comment: 12 page

arXiv.org e-Print Archive

CiteSeerX

On Practical Algorithms for Entropy Estimation and the Improved Sample Complexity of Compressed Counting

Author: Li Ping
Publication venue
Publication date: 01/01/2010
Field of study

Estimating the p-th frequency moment of data stream is a very heavily studied problem. The problem is actually trivial when p = 1, assuming the strict Turnstile model. The sample complexity of our proposed algorithm is essentially O(1) near p=1. This is a very large improvement over the previously believed O(1/eps^2) bound. The proposed algorithm makes the long-standing problem of entropy estimation an easy task, as verified by the experiments included in the appendix

arXiv.org e-Print Archive

CiteSeerX

Estimating Entropy of Data Streams Using Compressed Counting

Author: Li Ping
Publication venue
Publication date: 01/01/2009
Field of study

The Shannon entropy is a widely used summary statistic, for example, network traffic measurement, anomaly detection, neural computations, spike trains, etc. This study focuses on estimating Shannon entropy of data streams. It is known that Shannon entropy can be approximated by Reenyi entropy or Tsallis entropy, which are both functions of the p-th frequency moments and approach Shannon entropy as p->1. Compressed Counting (CC) is a new method for approximating the p-th frequency moments of data streams. Our contributions include: 1) We prove that Renyi entropy is (much) better than Tsallis entropy for approximating Shannon entropy. 2) We propose the optimal quantile estimator for CC, which considerably improves the previous estimators. 3) Our experiments demonstrate that CC is indeed highly effective approximating the moments and entropies. We also demonstrate the crucial importance of utilizing the variance-bias trade-off

arXiv.org e-Print Archive

CiteSeerX

Recursive Sketching For Frequency Moments

Author: Braverman Vladimir
Ostrovsky Rafail
Publication venue
Publication date: 11/11/2010
Field of study

In a ground-breaking paper, Indyk and Woodruff (STOC 05) showed how to compute

F_k

(for

k>2

) in space complexity O(\mbox{\em poly-log}(n,m)\cdot n^{1-\frac2k}), which is optimal up to (large) poly-logarithmic factors in

n

and

m

, where

m

is the length of the stream and

n

is the upper bound on the number of distinct elements in a stream. The best known lower bound for large moments is

\Omega(\log(n)n^{1-\frac2k})

. A follow-up work of Bhuvanagiri, Ganguly, Kesh and Saha (SODA 2006) reduced the poly-logarithmic factors of Indyk and Woodruff to

O(\log^2(m)\cdot (\log n+ \log m)\cdot n^{1-{2\over k}})

. Further reduction of poly-log factors has been an elusive goal since 2006, when Indyk and Woodruff method seemed to hit a natural "barrier." Using our simple recursive sketch, we provide a different yet simple approach to obtain a

O(\log(m)\log(nm)\cdot (\log\log n)^4\cdot n^{1-{2\over k}})

algorithm for constant

\epsilon

(our bound is, in fact, somewhat stronger, where the

(\log\log n)

term can be replaced by any constant number of

\log

iterations instead of just two or three, thus approaching

log^*n

. Our bound also works for non-constant

\epsilon

(for details see the body of the paper). Further, our algorithm requires only

4

-wise independence, in contrast to existing methods that use pseudo-random generators for computing large frequency moments

arXiv.org e-Print Archive

CiteSeerX

Evaluation of the utility of sediment data in NASQAN (National Stream Quality Accounting Network)

Author: Brooks Norman H.
Koh Robert C. Y.
Taylor Brent D.
Vanoni Vito A.
Publication venue: 'California Institute of Technology Library'
Publication date: 01/06/1983
Field of study

Monthly suspended sediment discharge measurements, made by the USGS as part of the National Stream Quality Accounting Network (NASQAN), are analysed to assess the adequacy in terms of spatial coverage, temporal sampling frequency, accuracy of measurements, as well as in determining the sediment yield in the nation's rivers. It is concluded that the spatial distribution of NASQAN stations is reasonable but necessarily judgemental. The temporal variations of sediment data contain much higher frequencies than monthly. Sampling error is found to be minor when compared with other causes of data scatter which can be substantial. The usefulness of the monthly measurements of sediment transport is enhanced when combined with the daily measurements of water discharge. Increasing the sampling frequency moderately would not materially improve the accuracy of sediment yield determinations

Caltech Authors