2 research outputs found

    On Approximating Frequency Moments of Data Streams with Skewed Projections

    Full text link
    We propose skewed stable random projections for approximating the pth frequency moments of dynamic data streams (0<p<=2), which has been frequently studied in theoretical computer science and database communities. Our method significantly (or even infinitely when p->1) improves previous methods based on (symmetric) stable random projections. Our proposed method is applicable to data streams that are (a) insertion only (the cash-register model); or (b) always non-negative (the strict Turnstile model), or (c) eventually non-negative at check points. This is only a minor restriction for practical applications. Our method works particularly well when p = 1+/- \Delta and \Delta is small, which is a practically important scenario. For example, \Delta may be the decay rate or interest rate, which are usually small. Of course, when \Delta = 0, one can compute the 1th frequent moment (i.e., the sum) essentially error-free using a simple couter. Our method may be viewed as a ``genearlized counter'' in that it can count the total value in the future, taking in account of the effect of decaying or interest accruement. In a summary, our contributions are two-fold. (A) This is the first propsal of skewed stable random projections. (B) Based on first principle, we develop various statistical estimators for skewed stable distributions, including their variances and error (tail) probability bounds, and consequently the sample complexity bounds

    A Very Efficient Scheme for Estimating Entropy of Data Streams Using Compressed Counting

    Full text link
    Compressed Counting (CC)} was recently proposed for approximating the α\alphath frequency moments of data streams, for 0<α≤20<\alpha \leq 2. Under the relaxed strict-Turnstile model, CC dramatically improves the standard algorithm based on symmetric stable random projections}, especially as α→1\alpha\to 1. A direct application of CC is to estimate the entropy, which is an important summary statistic in Web/network measurement and often serves a crucial "feature" for data mining. The R\'enyi entropy and the Tsallis entropy are functions of the α\alphath frequency moments; and both approach the Shannon entropy as α→1\alpha\to 1. A recent theoretical work suggested using the α\alphath frequency moment to approximate the Shannon entropy with α=1+δ\alpha=1+\delta and very small ∣δ∣|\delta| (e.g., <10−4<10^{-4}). In this study, we experiment using CC to estimate frequency moments, R\'enyi entropy, Tsallis entropy, and Shannon entropy, on real Web crawl data. We demonstrate the variance-bias trade-off in estimating Shannon entropy and provide practical recommendations. In particular, our experiments enable us to draw some important conclusions: (1) As α→1\alpha\to 1, CC dramatically improves {\em symmetric stable random projections} in estimating frequency moments, R\'enyi entropy, Tsallis entropy, and Shannon entropy. The improvements appear to approach "infinity." (2) Using {\em symmetric stable random projections} and α=1+δ\alpha = 1+\delta with very small ∣δ∣|\delta| does not provide a practical algorithm because the required sample size is enormous
    corecore