41,163 research outputs found
On Estimating the First Frequency Moment of Data Streams
Estimating the first moment of a data stream defined as F_1 = \sum_{i \in
\{1, 2, \ldots, n\}} \abs{f_i} to within -relative error with
high probability is a basic and influential problem in data stream processing.
A tight space bound of is known from the work of
[Kane-Nelson-Woodruff-SODA10]. However, all known algorithms for this problem
require per-update stream processing time of , with the
only exception being the algorithm of [Ganguly-Cormode-RANDOM07] that requires
per-update processing time of albeit with sub-optimal
space . In this paper, we present an algorithm for
estimating that achieves near-optimality in both space and update
processing time. The space requirement is and the per-update processing time is .Comment: 12 page
On Practical Algorithms for Entropy Estimation and the Improved Sample Complexity of Compressed Counting
Estimating the p-th frequency moment of data stream is a very heavily studied
problem. The problem is actually trivial when p = 1, assuming the strict
Turnstile model. The sample complexity of our proposed algorithm is essentially
O(1) near p=1. This is a very large improvement over the previously believed
O(1/eps^2) bound. The proposed algorithm makes the long-standing problem of
entropy estimation an easy task, as verified by the experiments included in the
appendix
Estimating Entropy of Data Streams Using Compressed Counting
The Shannon entropy is a widely used summary statistic, for example, network
traffic measurement, anomaly detection, neural computations, spike trains, etc.
This study focuses on estimating Shannon entropy of data streams. It is known
that Shannon entropy can be approximated by Reenyi entropy or Tsallis entropy,
which are both functions of the p-th frequency moments and approach Shannon
entropy as p->1.
Compressed Counting (CC) is a new method for approximating the p-th frequency
moments of data streams. Our contributions include:
1) We prove that Renyi entropy is (much) better than Tsallis entropy for
approximating Shannon entropy.
2) We propose the optimal quantile estimator for CC, which considerably
improves the previous estimators.
3) Our experiments demonstrate that CC is indeed highly effective
approximating the moments and entropies. We also demonstrate the crucial
importance of utilizing the variance-bias trade-off
Recursive Sketching For Frequency Moments
In a ground-breaking paper, Indyk and Woodruff (STOC 05) showed how to
compute (for ) in space complexity O(\mbox{\em poly-log}(n,m)\cdot
n^{1-\frac2k}), which is optimal up to (large) poly-logarithmic factors in
and , where is the length of the stream and is the upper bound on
the number of distinct elements in a stream. The best known lower bound for
large moments is . A follow-up work of
Bhuvanagiri, Ganguly, Kesh and Saha (SODA 2006) reduced the poly-logarithmic
factors of Indyk and Woodruff to . Further reduction of poly-log factors has been an elusive
goal since 2006, when Indyk and Woodruff method seemed to hit a natural
"barrier." Using our simple recursive sketch, we provide a different yet simple
approach to obtain a algorithm for constant (our bound is, in fact, somewhat
stronger, where the term can be replaced by any constant number
of iterations instead of just two or three, thus approaching .
Our bound also works for non-constant (for details see the body of
the paper). Further, our algorithm requires only -wise independence, in
contrast to existing methods that use pseudo-random generators for computing
large frequency moments
Evaluation of the utility of sediment data in NASQAN (National Stream Quality Accounting Network)
Monthly suspended sediment discharge measurements, made by the USGS as part of the National Stream Quality Accounting Network (NASQAN), are analysed to assess the adequacy in terms of spatial coverage, temporal sampling frequency, accuracy of measurements, as well as in determining the sediment yield in the nation's rivers.
It is concluded that the spatial distribution of NASQAN stations is reasonable but necessarily judgemental. The temporal variations of sediment data contain much higher frequencies than monthly. Sampling error is found to be minor when compared with other causes of data scatter which can be substantial. The usefulness of the monthly measurements of sediment transport is enhanced when combined with the daily measurements of water discharge. Increasing the sampling frequency moderately would not materially improve the accuracy of sediment yield determinations
- β¦