1 research outputs found
A Very Efficient Scheme for Estimating Entropy of Data Streams Using Compressed Counting
Compressed Counting (CC)} was recently proposed for approximating the
th frequency moments of data streams, for . Under the
relaxed strict-Turnstile model, CC dramatically improves the standard algorithm
based on symmetric stable random projections}, especially as . A
direct application of CC is to estimate the entropy, which is an important
summary statistic in Web/network measurement and often serves a crucial
"feature" for data mining. The R\'enyi entropy and the Tsallis entropy are
functions of the th frequency moments; and both approach the Shannon
entropy as . A recent theoretical work suggested using the
th frequency moment to approximate the Shannon entropy with
and very small (e.g., ).
In this study, we experiment using CC to estimate frequency moments, R\'enyi
entropy, Tsallis entropy, and Shannon entropy, on real Web crawl data. We
demonstrate the variance-bias trade-off in estimating Shannon entropy and
provide practical recommendations. In particular, our experiments enable us to
draw some important conclusions:
(1) As , CC dramatically improves {\em symmetric stable random
projections} in estimating frequency moments, R\'enyi entropy, Tsallis entropy,
and Shannon entropy. The improvements appear to approach "infinity."
(2) Using {\em symmetric stable random projections} and
with very small does not provide a practical algorithm because the
required sample size is enormous