26,620 research outputs found
The Adaptive Sampling Revisited
The problem of estimating the number of distinct keys of a large
collection of data is well known in computer science. A classical algorithm
is the adaptive sampling (AS). can be estimated by , where is
the final bucket (cache) size and is the final depth at the end of the
process. Several new interesting questions can be asked about AS (some of them
were suggested by P.Flajolet and popularized by J.Lumbroso). The distribution
of is known, we rederive this distribution in a simpler way.
We provide new results on the moments of and . We also analyze the final
cache size distribution. We consider colored keys: assume that among the
distinct keys, do have color . We show how to estimate
. We also study colored keys with some multiplicity given by
some distribution function. We want to estimate mean an variance of this
distribution. Finally, we consider the case where neither colors nor
multiplicities are known. There we want to estimate the related parameters. An
appendix is devoted to the case where the hashing function provides bits with
probability different from
Fully decentralized computation of aggregates over data streams
In several emerging applications, data is collected in massive streams at several distributed points of observation. A basic and challenging task is to allow every node to monitor a neighbourhood of interest by issuing continuous aggregate queries on the streams observed in its vicinity. This class of algorithms is fully decentralized and diffusive in nature: collecting all data at few central nodes of the network is unfeasible in networks of low capability devices or in the presence of massive data sets. The main difficulty in designing diffusive algorithms is to cope with duplicate detections. These arise both from the observation of the same event at several nodes of the network and/or receipt of the same aggregated information along multiple paths of diffusion. In this paper, we consider fully decentralized algorithms that answer locally continuous aggregate queries on the number of distinct events, total number of events and the second frequency moment in the scenario outlined above. The proposed algorithms use in the worst case or on realistic distributions sublinear space at every node. We also propose strategies that minimize the communication needed to update the aggregates when new events are observed. We experimentally evaluate for the efficiency and accuracy of our algorithms on realistic simulated scenarios
Boosting the Accuracy of Differentially-Private Histograms Through Consistency
We show that it is possible to significantly improve the accuracy of a
general class of histogram queries while satisfying differential privacy. Our
approach carefully chooses a set of queries to evaluate, and then exploits
consistency constraints that should hold over the noisy output. In a
post-processing phase, we compute the consistent input most likely to have
produced the noisy output. The final output is differentially-private and
consistent, but in addition, it is often much more accurate. We show, both
theoretically and experimentally, that these techniques can be used for
estimating the degree sequence of a graph very precisely, and for computing a
histogram that can support arbitrary range queries accurately.Comment: 15 pages, 7 figures, minor revisions to previous versio
Estimation for Monotone Sampling: Competitiveness and Customization
Random samples are lossy summaries which allow queries posed over the data to
be approximated by applying an appropriate estimator to the sample. The
effectiveness of sampling, however, hinges on estimator selection. The choice
of estimators is subjected to global requirements, such as unbiasedness and
range restrictions on the estimate value, and ideally, we seek estimators that
are both efficient to derive and apply and {\em admissible} (not dominated, in
terms of variance, by other estimators). Nevertheless, for a given data domain,
sampling scheme, and query, there are many admissible estimators. We study the
choice of admissible nonnegative and unbiased estimators for monotone sampling
schemes. Monotone sampling schemes are implicit in many applications of massive
data set analysis. Our main contribution is general derivations of admissible
estimators with desirable properties. We present a construction of {\em
order-optimal} estimators, which minimize variance according to {\em any}
specified priorities over the data domain. Order-optimality allows us to
customize the derivation to common patterns that we can learn or observe in the
data. When we prioritize lower values (e.g., more similar data sets when
estimating difference), we obtain the L estimator, which is the unique
monotone admissible estimator. We show that the L estimator is
4-competitive and dominates the classic Horvitz-Thompson estimator. These
properties make the L estimator a natural default choice. We also present
the U estimator, which prioritizes large values (e.g., less similar data
sets). Our estimator constructions are both easy to apply and possess desirable
properties, allowing us to make the most from our summarized data.Comment: 28 pages; Improved write up, presentation in the context of the more
general monotone sampling formulation (instead of coordinated sampling).
Bounds on universal ratio removed to make the paper more focused, since it is
mainly of theoretical interes
- …