34,570 research outputs found

    Estimation of Distribution Overlap of Urn Models

    Get PDF
    A classical problem in statistics is estimating the expected coverage of a sample, which has had applications in gene expression, microbial ecology, optimization, and even numismatics. Here we consider a related extension of this problem to random samples of two discrete distributions. Specifically, we estimate what we call the dissimilarity probability of a sample, i.e., the probability of a draw from one distribution not being observed in k draws from another distribution. We show our estimator of dissimilarity to be a U-statistic and a uniformly minimum variance unbiased estimator of dissimilarity over the largest appropriate range of k. Furthermore, despite the non-Markovian nature of our estimator when applied sequentially over k, we show it converges uniformly in probability to the dissimilarity parameter, and we present criteria when it is approximately normally distributed and admits a consistent jackknife estimator of its variance. As proof of concept, we analyze V35 16S rRNA data to discern between various microbial environments. Other potential applications concern any situation where dissimilarity of two discrete distributions may be of interest. For instance, in SELEX experiments, each urn could represent a random RNA pool and each draw a possible solution to a particular binding site problem over that pool. The dissimilarity of these pools is then related to the probability of finding binding site solutions in one pool that are absent in the other.Comment: 27 pages, 4 figure

    Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation

    Full text link
    We introduce and study a new data sketch for processing massive datasets. It addresses two common problems: 1) computing a sum given arbitrary filter conditions and 2) identifying the frequent items or heavy hitters in a data set. For the former, the sketch provides unbiased estimates with state of the art accuracy. It handles the challenging scenario when the data is disaggregated so that computing the per unit metric of interest requires an expensive aggregation. For example, the metric of interest may be total clicks per user while the raw data is a click stream with multiple rows per user. Thus the sketch is suitable for use in a wide range of applications including computing historical click through rates for ad prediction, reporting user metrics from event streams, and measuring network traffic for IP flows. We prove and empirically show the sketch has good properties for both the disaggregated subset sum estimation and frequent item problems. On i.i.d. data, it not only picks out the frequent items but gives strongly consistent estimates for the proportion of each frequent item. The resulting sketch asymptotically draws a probability proportional to size sample that is optimal for estimating sums over the data. For non i.i.d. data, we show that it typically does much better than random sampling for the frequent item problem and never does worse. For subset sum estimation, we show that even for pathological sequences, the variance is close to that of an optimal sampling design. Empirically, despite the disadvantage of operating on disaggregated data, our method matches or bests priority sampling, a state of the art method for pre-aggregated data and performs orders of magnitude better on skewed data compared to uniform sampling. We propose extensions to the sketch that allow it to be used in combining multiple data sets, in distributed systems, and for time decayed aggregation

    Shrinkage Estimation in Multilevel Normal Models

    Full text link
    This review traces the evolution of theory that started when Charles Stein in 1955 [In Proc. 3rd Berkeley Sympos. Math. Statist. Probab. I (1956) 197--206, Univ. California Press] showed that using each separate sample mean from k≥3k\ge3 Normal populations to estimate its own population mean μi\mu_i can be improved upon uniformly for every possible μ=(μ1,...,μk)′\mu=(\mu_1,...,\mu_k)'. The dominating estimators, referred to here as being "Model-I minimax," can be found by shrinking the sample means toward any constant vector. Admissible minimax shrinkage estimators were derived by Stein and others as posterior means based on a random effects model, "Model-II" here, wherein the μi\mu_i values have their own distributions. Section 2 centers on Figure 2, which organizes a wide class of priors on the unknown Level-II hyperparameters that have been proved to yield admissible Model-I minimax shrinkage estimators in the "equal variance case." Putting a flat prior on the Level-II variance is unique in this class for its scale-invariance and for its conjugacy, and it induces Stein's harmonic prior (SHP) on μi\mu_i.Comment: Published in at http://dx.doi.org/10.1214/11-STS363 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Small Area Shrinkage Estimation

    Full text link
    The need for small area estimates is increasingly felt in both the public and private sectors in order to formulate their strategic plans. It is now widely recognized that direct small area survey estimates are highly unreliable owing to large standard errors and coefficients of variation. The reason behind this is that a survey is usually designed to achieve a specified level of accuracy at a higher level of geography than that of small areas. Lack of additional resources makes it almost imperative to use the same data to produce small area estimates. For example, if a survey is designed to estimate per capita income for a state, the same survey data need to be used to produce similar estimates for counties, subcounties and census divisions within that state. Thus, by necessity, small area estimation needs explicit, or at least implicit, use of models to link these areas. Improved small area estimates are found by "borrowing strength" from similar neighboring areas.Comment: Published in at http://dx.doi.org/10.1214/11-STS374 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Identification of and correction for publication bias

    Full text link
    Some empirical results are more likely to be published than others. Such selective publication leads to biased estimates and distorted inference. This paper proposes two approaches for identifying the conditional probability of publication as a function of a study's results, the first based on systematic replication studies and the second based on meta-studies. For known conditional publication probabilities, we propose median-unbiased estimators and associated confidence sets that correct for selective publication. We apply our methods to recent large-scale replication studies in experimental economics and psychology, and to meta-studies of the effects of minimum wages and de-worming programs

    Nonparametric survival analysis of epidemic data

    Full text link
    This paper develops nonparametric methods for the survival analysis of epidemic data based on contact intervals. The contact interval from person i to person j is the time between the onset of infectiousness in i and infectious contact from i to j, where we define infectious contact as a contact sufficient to infect a susceptible individual. We show that the Nelson-Aalen estimator produces an unbiased estimate of the contact interval cumulative hazard function when who-infects-whom is observed. When who-infects-whom is not observed, we average the Nelson-Aalen estimates from all transmission networks consistent with the observed data using an EM algorithm. This converges to a nonparametric MLE of the contact interval cumulative hazard function that we call the marginal Nelson-Aalen estimate. We study the behavior of these methods in simulations and use them to analyze household surveillance data from the 2009 influenza A(H1N1) pandemic. In an appendix, we show that these methods extend chain-binomial models to continuous time.Comment: 30 pages, 6 figure
    • …
    corecore