26,620 research outputs found

    The Adaptive Sampling Revisited

    Full text link
    The problem of estimating the number nn of distinct keys of a large collection of NN data is well known in computer science. A classical algorithm is the adaptive sampling (AS). nn can be estimated by R.2DR.2^D, where RR is the final bucket (cache) size and DD is the final depth at the end of the process. Several new interesting questions can be asked about AS (some of them were suggested by P.Flajolet and popularized by J.Lumbroso). The distribution of W=log(R2D/n)W=\log (R2^D/n) is known, we rederive this distribution in a simpler way. We provide new results on the moments of DD and WW. We also analyze the final cache size RR distribution. We consider colored keys: assume that among the nn distinct keys, nCn_C do have color CC. We show how to estimate p=nCnp=\frac{n_C}{n}. We also study colored keys with some multiplicity given by some distribution function. We want to estimate mean an variance of this distribution. Finally, we consider the case where neither colors nor multiplicities are known. There we want to estimate the related parameters. An appendix is devoted to the case where the hashing function provides bits with probability different from 1/21/2

    Fully decentralized computation of aggregates over data streams

    Get PDF
    In several emerging applications, data is collected in massive streams at several distributed points of observation. A basic and challenging task is to allow every node to monitor a neighbourhood of interest by issuing continuous aggregate queries on the streams observed in its vicinity. This class of algorithms is fully decentralized and diffusive in nature: collecting all data at few central nodes of the network is unfeasible in networks of low capability devices or in the presence of massive data sets. The main difficulty in designing diffusive algorithms is to cope with duplicate detections. These arise both from the observation of the same event at several nodes of the network and/or receipt of the same aggregated information along multiple paths of diffusion. In this paper, we consider fully decentralized algorithms that answer locally continuous aggregate queries on the number of distinct events, total number of events and the second frequency moment in the scenario outlined above. The proposed algorithms use in the worst case or on realistic distributions sublinear space at every node. We also propose strategies that minimize the communication needed to update the aggregates when new events are observed. We experimentally evaluate for the efficiency and accuracy of our algorithms on realistic simulated scenarios

    Boosting the Accuracy of Differentially-Private Histograms Through Consistency

    Full text link
    We show that it is possible to significantly improve the accuracy of a general class of histogram queries while satisfying differential privacy. Our approach carefully chooses a set of queries to evaluate, and then exploits consistency constraints that should hold over the noisy output. In a post-processing phase, we compute the consistent input most likely to have produced the noisy output. The final output is differentially-private and consistent, but in addition, it is often much more accurate. We show, both theoretically and experimentally, that these techniques can be used for estimating the degree sequence of a graph very precisely, and for computing a histogram that can support arbitrary range queries accurately.Comment: 15 pages, 7 figures, minor revisions to previous versio

    Estimation for Monotone Sampling: Competitiveness and Customization

    Full text link
    Random samples are lossy summaries which allow queries posed over the data to be approximated by applying an appropriate estimator to the sample. The effectiveness of sampling, however, hinges on estimator selection. The choice of estimators is subjected to global requirements, such as unbiasedness and range restrictions on the estimate value, and ideally, we seek estimators that are both efficient to derive and apply and {\em admissible} (not dominated, in terms of variance, by other estimators). Nevertheless, for a given data domain, sampling scheme, and query, there are many admissible estimators. We study the choice of admissible nonnegative and unbiased estimators for monotone sampling schemes. Monotone sampling schemes are implicit in many applications of massive data set analysis. Our main contribution is general derivations of admissible estimators with desirable properties. We present a construction of {\em order-optimal} estimators, which minimize variance according to {\em any} specified priorities over the data domain. Order-optimality allows us to customize the derivation to common patterns that we can learn or observe in the data. When we prioritize lower values (e.g., more similar data sets when estimating difference), we obtain the L^* estimator, which is the unique monotone admissible estimator. We show that the L^* estimator is 4-competitive and dominates the classic Horvitz-Thompson estimator. These properties make the L^* estimator a natural default choice. We also present the U^* estimator, which prioritizes large values (e.g., less similar data sets). Our estimator constructions are both easy to apply and possess desirable properties, allowing us to make the most from our summarized data.Comment: 28 pages; Improved write up, presentation in the context of the more general monotone sampling formulation (instead of coordinated sampling). Bounds on universal ratio removed to make the paper more focused, since it is mainly of theoretical interes
    corecore