31 research outputs found
What you can do with Coordinated Samples
Sample coordination, where similar instances have similar samples, was
proposed by statisticians four decades ago as a way to maximize overlap in
repeated surveys. Coordinated sampling had been since used for summarizing
massive data sets.
The usefulness of a sampling scheme hinges on the scope and accuracy within
which queries posed over the original data can be answered from the sample. We
aim here to gain a fundamental understanding of the limits and potential of
coordination. Our main result is a precise characterization, in terms of simple
properties of the estimated function, of queries for which estimators with
desirable properties exist. We consider unbiasedness, nonnegativity, finite
variance, and bounded estimates.
Since generally a single estimator can not be optimal (minimize variance
simultaneously) for all data, we propose {\em variance competitiveness}, which
means that the expectation of the square on any data is not too far from the
minimum one possible for the data. Surprisingly perhaps, we show how to
construct, for any function for which an unbiased nonnegative estimator exists,
a variance competitive estimator.Comment: 4 figures, 21 pages, Extended Abstract appeared in RANDOM 201
Adaptive Threshold Sampling and Estimation
Sampling is a fundamental problem in both computer science and statistics. A
number of issues arise when designing a method based on sampling. These include
statistical considerations such as constructing a good sampling design and
ensuring there are good, tractable estimators for the quantities of interest as
well as computational considerations such as designing fast algorithms for
streaming data and ensuring the sample fits within memory constraints.
Unfortunately, existing sampling methods are only able to address all of these
issues in limited scenarios.
We develop a framework that can be used to address these issues in a broad
range of scenarios. In particular, it addresses the problem of drawing and
using samples under some memory budget constraint. This problem can be
challenging since the memory budget forces samples to be drawn
non-independently and consequently, makes computation of resulting estimators
difficult.
At the core of the framework is the notion of a data adaptive thresholding
scheme where the threshold effectively allows one to treat the non-independent
sample as if it were drawn independently. We provide sufficient conditions for
a thresholding scheme to allow this and provide ways to build and compose such
schemes.
Furthermore, we provide fast algorithms to efficiently sample under these
thresholding schemes