34,570 research outputs found
Estimation of Distribution Overlap of Urn Models
A classical problem in statistics is estimating the expected coverage of a
sample, which has had applications in gene expression, microbial ecology,
optimization, and even numismatics. Here we consider a related extension of
this problem to random samples of two discrete distributions. Specifically, we
estimate what we call the dissimilarity probability of a sample, i.e., the
probability of a draw from one distribution not being observed in k draws from
another distribution. We show our estimator of dissimilarity to be a
U-statistic and a uniformly minimum variance unbiased estimator of
dissimilarity over the largest appropriate range of k. Furthermore, despite the
non-Markovian nature of our estimator when applied sequentially over k, we show
it converges uniformly in probability to the dissimilarity parameter, and we
present criteria when it is approximately normally distributed and admits a
consistent jackknife estimator of its variance. As proof of concept, we analyze
V35 16S rRNA data to discern between various microbial environments. Other
potential applications concern any situation where dissimilarity of two
discrete distributions may be of interest. For instance, in SELEX experiments,
each urn could represent a random RNA pool and each draw a possible solution to
a particular binding site problem over that pool. The dissimilarity of these
pools is then related to the probability of finding binding site solutions in
one pool that are absent in the other.Comment: 27 pages, 4 figure
Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation
We introduce and study a new data sketch for processing massive datasets. It
addresses two common problems: 1) computing a sum given arbitrary filter
conditions and 2) identifying the frequent items or heavy hitters in a data
set. For the former, the sketch provides unbiased estimates with state of the
art accuracy. It handles the challenging scenario when the data is
disaggregated so that computing the per unit metric of interest requires an
expensive aggregation. For example, the metric of interest may be total clicks
per user while the raw data is a click stream with multiple rows per user. Thus
the sketch is suitable for use in a wide range of applications including
computing historical click through rates for ad prediction, reporting user
metrics from event streams, and measuring network traffic for IP flows.
We prove and empirically show the sketch has good properties for both the
disaggregated subset sum estimation and frequent item problems. On i.i.d. data,
it not only picks out the frequent items but gives strongly consistent
estimates for the proportion of each frequent item. The resulting sketch
asymptotically draws a probability proportional to size sample that is optimal
for estimating sums over the data. For non i.i.d. data, we show that it
typically does much better than random sampling for the frequent item problem
and never does worse. For subset sum estimation, we show that even for
pathological sequences, the variance is close to that of an optimal sampling
design. Empirically, despite the disadvantage of operating on disaggregated
data, our method matches or bests priority sampling, a state of the art method
for pre-aggregated data and performs orders of magnitude better on skewed data
compared to uniform sampling. We propose extensions to the sketch that allow it
to be used in combining multiple data sets, in distributed systems, and for
time decayed aggregation
Shrinkage Estimation in Multilevel Normal Models
This review traces the evolution of theory that started when Charles Stein in
1955 [In Proc. 3rd Berkeley Sympos. Math. Statist. Probab. I (1956) 197--206,
Univ. California Press] showed that using each separate sample mean from
Normal populations to estimate its own population mean can be
improved upon uniformly for every possible . The
dominating estimators, referred to here as being "Model-I minimax," can be
found by shrinking the sample means toward any constant vector. Admissible
minimax shrinkage estimators were derived by Stein and others as posterior
means based on a random effects model, "Model-II" here, wherein the
values have their own distributions. Section 2 centers on Figure 2, which
organizes a wide class of priors on the unknown Level-II hyperparameters that
have been proved to yield admissible Model-I minimax shrinkage estimators in
the "equal variance case." Putting a flat prior on the Level-II variance is
unique in this class for its scale-invariance and for its conjugacy, and it
induces Stein's harmonic prior (SHP) on .Comment: Published in at http://dx.doi.org/10.1214/11-STS363 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Small Area Shrinkage Estimation
The need for small area estimates is increasingly felt in both the public and
private sectors in order to formulate their strategic plans. It is now widely
recognized that direct small area survey estimates are highly unreliable owing
to large standard errors and coefficients of variation. The reason behind this
is that a survey is usually designed to achieve a specified level of accuracy
at a higher level of geography than that of small areas. Lack of additional
resources makes it almost imperative to use the same data to produce small area
estimates. For example, if a survey is designed to estimate per capita income
for a state, the same survey data need to be used to produce similar estimates
for counties, subcounties and census divisions within that state. Thus, by
necessity, small area estimation needs explicit, or at least implicit, use of
models to link these areas. Improved small area estimates are found by
"borrowing strength" from similar neighboring areas.Comment: Published in at http://dx.doi.org/10.1214/11-STS374 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Identification of and correction for publication bias
Some empirical results are more likely to be published than others. Such
selective publication leads to biased estimates and distorted inference. This
paper proposes two approaches for identifying the conditional probability of
publication as a function of a study's results, the first based on systematic
replication studies and the second based on meta-studies. For known conditional
publication probabilities, we propose median-unbiased estimators and associated
confidence sets that correct for selective publication. We apply our methods to
recent large-scale replication studies in experimental economics and
psychology, and to meta-studies of the effects of minimum wages and de-worming
programs
Nonparametric survival analysis of epidemic data
This paper develops nonparametric methods for the survival analysis of
epidemic data based on contact intervals. The contact interval from person i to
person j is the time between the onset of infectiousness in i and infectious
contact from i to j, where we define infectious contact as a contact sufficient
to infect a susceptible individual. We show that the Nelson-Aalen estimator
produces an unbiased estimate of the contact interval cumulative hazard
function when who-infects-whom is observed. When who-infects-whom is not
observed, we average the Nelson-Aalen estimates from all transmission networks
consistent with the observed data using an EM algorithm. This converges to a
nonparametric MLE of the contact interval cumulative hazard function that we
call the marginal Nelson-Aalen estimate. We study the behavior of these methods
in simulations and use them to analyze household surveillance data from the
2009 influenza A(H1N1) pandemic. In an appendix, we show that these methods
extend chain-binomial models to continuous time.Comment: 30 pages, 6 figure
- …