3,874 research outputs found

    An Exact Algorithm for the Stratification Problem with Proportional Allocation

    Full text link
    We report a new optimal resolution for the statistical stratification problem under proportional sampling allocation among strata. Consider a finite population of N units, a random sample of n units selected from this population and a number L of strata. Thus, we have to define which units belong to each stratum so as to minimize the variance of a total estimator for one desired variable of interest in each stratum,and consequently reduce the overall variance for such quantity. In order to solve this problem, an exact algorithm based on the concept of minimal path in a graph is proposed and assessed. Computational results using real data from IBGE (Brazilian Central Statistical Office) are provided

    Classifier Risk Estimation under Limited Labeling Resources

    Full text link
    In this paper we propose strategies for estimating performance of a classifier when labels cannot be obtained for the whole test set. The number of test instances which can be labeled is very small compared to the whole test data size. The goal then is to obtain a precise estimate of classifier performance using as little labeling resource as possible. Specifically, we try to answer, how to select a subset of the large test set for labeling such that the performance of a classifier estimated on this subset is as close as possible to the one on the whole test set. We propose strategies based on stratified sampling for selecting this subset. We show that these strategies can reduce the variance in estimation of classifier accuracy by a significant amount compared to simple random sampling (over 65% in several cases). Hence, our proposed methods are much more precise compared to random sampling for accuracy estimation under restricted labeling resources. The reduction in number of samples required (compared to random sampling) to estimate the classifier accuracy with only 1% error is high as 60% in some cases.Comment: PAKDD 201

    On adaptive stratification

    Full text link
    This paper investigates the use of stratified sampling as a variance reduction technique for approximating integrals over large dimensional spaces. The accuracy of this method critically depends on the choice of the space partition, the strata, which should be ideally fitted to thesubsets where the functions to integrate is nearly constant, and on the allocation of the number of samples within each strata. When the dimension is large and the function to integrate is complex, finding such partitions and allocating the sample is a highly non-trivial problem. In this work, we investigate a novel method to improve the efficiency of the estimator "on the fly", by jointly sampling and adapting the strata and the allocation within the strata. The accuracy of estimators when this method is used is examined in detail, in the so-called asymptotic regime (i.e. when both the number of samples and the number of strata are large). We illustrate the use of the method for the computation of the price of path-dependent options in models with both constant and stochastic volatility. The use of this adaptive technique yields variance reduction by factors sometimes larger than 1000 compared to classical Monte Carlo estimators

    Estimating Node Influenceability in Social Networks

    Full text link
    Influence analysis is a fundamental problem in social network analysis and mining. The important applications of the influence analysis in social network include influence maximization for viral marketing, finding the most influential nodes, online advertising, etc. For many of these applications, it is crucial to evaluate the influenceability of a node. In this paper, we study the problem of evaluating influenceability of nodes in social network based on the widely used influence spread model, namely, the independent cascade model. Since this problem is #P-complete, most existing work is based on Naive Monte-Carlo (\nmc) sampling. However, the \nmc estimator typically results in a large variance, which significantly reduces its effectiveness. To overcome this problem, we propose two families of new estimators based on the idea of stratified sampling. We first present two basic stratified sampling (\bss) estimators, namely \bssi estimator and \bssii estimator, which partition the entire population into 2r2^r and r+1r+1 strata by choosing rr edges respectively. Second, to further reduce the variance, we find that both \bssi and \bssii estimators can be recursively performed on each stratum, thus we propose two recursive stratified sampling (\rss) estimators, namely \rssi estimator and \rssii estimator. Theoretically, all of our estimators are shown to be unbiased and their variances are significantly smaller than the variance of the \nmc estimator. Finally, our extensive experimental results on both synthetic and real datasets demonstrate the efficiency and accuracy of our new estimators

    Minimum variance stratification of a finite population

    No full text
    This paper considers the combined problem of allocation and stratification in order to minimise the variance of the expansion estimator of a total, taking into account that the population is finite. The proof of necessary minimum variance conditions utilises the Kuhn-Tucker Theorem. Stratified simple random sampling with non-negligible sampling fractions is an important design in sample surveys. We go beyond limiting assumptions that have often been used in the past, such as that the stratification equals the study variable or that the sampling fractions are small. We discuss what difference the sampling fractions will make for stratification. In particular, in many surveys the sampling fraction equals one for some strata. The main theorem of this paper is applied to two populations with different characteristics, one of them being a business population and the other one a small population of 284 Swedish municipalities. We study empirically the sensitivity of deviations from the optimal solution

    Matching on-the-fly in Sequential Experiments for Higher Power and Efficiency

    Full text link
    We propose a dynamic allocation procedure that increases power and efficiency when measuring an average treatment effect in sequential randomized trials. Subjects arrive iteratively and are either randomized or paired via a matching criterion to a previously randomized subject and administered the alternate treatment. We develop estimators for the average treatment effect that combine information from both the matched pairs and unmatched subjects as well as an exact test. Simulations illustrate the method's higher efficiency and power over competing allocation procedures in both controlled scenarios and historical experimental data.Comment: 20 pages, 1 algorithm, 2 figures, 8 table

    Adaptive stratified sampling for non-smooth problems

    Full text link
    Science and engineering problems subject to uncertainty are frequently both computationally expensive and feature nonsmooth parameter dependence, making standard Monte Carlo too slow, and excluding efficient use of accelerated uncertainty quantification methods relying on strict smoothness assumptions. To remedy these challenges, we propose an adaptive stratification method suitable for nonsmooth problems and with significantly reduced variance compared to Monte Carlo sampling. The stratification is iteratively refined and samples are added sequentially to satisfy an allocation criterion combining the benefits of proportional and optimal sampling. Theoretical estimates are provided for the expected performance and probability of failure to correctly estimate essential statistics. We devise a practical adaptive stratification method with strata of the same kind of geometrical shapes, cost-effective refinement satisfying a greedy variance reduction criterion. Numerical experiments corroborate the theoretical findings and exhibit speedups of up to three orders of magnitude compared to standard Monte Carlo sampling.Comment: 37 pages, 12 figure

    Partitioned Sampling of Public Opinions Based on Their Social Dynamics

    Full text link
    Public opinion polling is usually done by random sampling from the entire population, treating individual opinions as independent. In the real world, individuals' opinions are often correlated, e.g., among friends in a social network. In this paper, we explore the idea of partitioned sampling, which partitions individuals with high opinion similarities into groups and then samples every group separately to obtain an accurate estimate of the population opinion. We rigorously formulate the above idea as an optimization problem. We then show that the simple partitions which contain only one sample in each group are always better, and reduce finding the optimal simple partition to a well-studied Min-r-Partition problem. We adapt an approximation algorithm and a heuristic algorithm to solve the optimization problem. Moreover, to obtain opinion similarity efficiently, we adapt a well-known opinion evolution model to characterize social interactions, and provide an exact computation of opinion similarities based on the model. We use both synthetic and real-world datasets to demonstrate that the partitioned sampling method results in significant improvement in sampling quality and it is robust when some opinion similarities are inaccurate or even missing

    Learning to Sample: Counting with Complex Queries

    Full text link
    We study the problem of efficiently estimating counts for queries involving complex filters, such as user-defined functions, or predicates involving self-joins and correlated subqueries. For such queries, traditional sampling techniques may not be applicable due to the complexity of the filter preventing sampling over joins, and sampling after the join may not be feasible due to the cost of computing the full join. The other natural approach of training and using an inexpensive classifier to estimate the count instead of the expensive predicate suffers from the difficulties in training a good classifier and giving meaningful confidence intervals. In this paper we propose a new method of learning to sample where we combine the best of both worlds by using sampling in two phases. First, we use samples to learn a probabilistic classifier, and then use the classifier to design a stratified sampling method to obtain the final estimates. We theoretically analyze algorithms for obtaining an optimal stratification, and compare our approach with a suite of natural alternatives like quantification learning, weighted and stratified sampling, and other techniques from the literature. We also provide extensive experiments in diverse use cases using multiple real and synthetic datasets to evaluate the quality, efficiency, and robustness of our approach

    Conditional inference with a complex sampling: exact computations and Monte Carlo estimations

    Full text link
    In survey statistics, the usual technique for estimating a population total consists in summing appropriately weighted variable values for the units in the sample. Different weighting systems exit: sampling weights, GREG weights or calibration weights for example. In this article, we propose to use the inverse of conditional inclusion probabilities as weighting system. We study examples where an auxiliary information enables to perform an a posteriori stratification of the population. We show that, in these cases, exact computations of the conditional weights are possible. When the auxiliary information consists in the knowledge of a quantitative variable for all the units of the population, then we show that the conditional weights can be estimated via Monte-Carlo simulations. This method is applied to outlier and strata-Jumper adjustments
    • …
    corecore