3,874 research outputs found
An Exact Algorithm for the Stratification Problem with Proportional Allocation
We report a new optimal resolution for the statistical stratification problem
under proportional sampling allocation among strata. Consider a finite
population of N units, a random sample of n units selected from this population
and a number L of strata. Thus, we have to define which units belong to each
stratum so as to minimize the variance of a total estimator for one desired
variable of interest in each stratum,and consequently reduce the overall
variance for such quantity. In order to solve this problem, an exact algorithm
based on the concept of minimal path in a graph is proposed and assessed.
Computational results using real data from IBGE (Brazilian Central Statistical
Office) are provided
Classifier Risk Estimation under Limited Labeling Resources
In this paper we propose strategies for estimating performance of a
classifier when labels cannot be obtained for the whole test set. The number of
test instances which can be labeled is very small compared to the whole test
data size. The goal then is to obtain a precise estimate of classifier
performance using as little labeling resource as possible. Specifically, we try
to answer, how to select a subset of the large test set for labeling such that
the performance of a classifier estimated on this subset is as close as
possible to the one on the whole test set. We propose strategies based on
stratified sampling for selecting this subset. We show that these strategies
can reduce the variance in estimation of classifier accuracy by a significant
amount compared to simple random sampling (over 65% in several cases). Hence,
our proposed methods are much more precise compared to random sampling for
accuracy estimation under restricted labeling resources. The reduction in
number of samples required (compared to random sampling) to estimate the
classifier accuracy with only 1% error is high as 60% in some cases.Comment: PAKDD 201
On adaptive stratification
This paper investigates the use of stratified sampling as a variance
reduction technique for approximating integrals over large dimensional spaces.
The accuracy of this method critically depends on the choice of the space
partition, the strata, which should be ideally fitted to thesubsets where the
functions to integrate is nearly constant, and on the allocation of the number
of samples within each strata. When the dimension is large and the function to
integrate is complex, finding such partitions and allocating the sample is a
highly non-trivial problem. In this work, we investigate a novel method to
improve the efficiency of the estimator "on the fly", by jointly sampling and
adapting the strata and the allocation within the strata. The accuracy of
estimators when this method is used is examined in detail, in the so-called
asymptotic regime (i.e. when both the number of samples and the number of
strata are large). We illustrate the use of the method for the computation of
the price of path-dependent options in models with both constant and stochastic
volatility. The use of this adaptive technique yields variance reduction by
factors sometimes larger than 1000 compared to classical Monte Carlo
estimators
Estimating Node Influenceability in Social Networks
Influence analysis is a fundamental problem in social network analysis and
mining. The important applications of the influence analysis in social network
include influence maximization for viral marketing, finding the most
influential nodes, online advertising, etc. For many of these applications, it
is crucial to evaluate the influenceability of a node. In this paper, we study
the problem of evaluating influenceability of nodes in social network based on
the widely used influence spread model, namely, the independent cascade model.
Since this problem is #P-complete, most existing work is based on Naive
Monte-Carlo (\nmc) sampling. However, the \nmc estimator typically results in a
large variance, which significantly reduces its effectiveness. To overcome this
problem, we propose two families of new estimators based on the idea of
stratified sampling. We first present two basic stratified sampling (\bss)
estimators, namely \bssi estimator and \bssii estimator, which partition the
entire population into and strata by choosing edges
respectively. Second, to further reduce the variance, we find that both \bssi
and \bssii estimators can be recursively performed on each stratum, thus we
propose two recursive stratified sampling (\rss) estimators, namely \rssi
estimator and \rssii estimator. Theoretically, all of our estimators are shown
to be unbiased and their variances are significantly smaller than the variance
of the \nmc estimator. Finally, our extensive experimental results on both
synthetic and real datasets demonstrate the efficiency and accuracy of our new
estimators
Minimum variance stratification of a finite population
This paper considers the combined problem of allocation and stratification in order to minimise the variance of the expansion estimator of a total, taking into account that the population is finite. The proof of necessary minimum variance conditions utilises the Kuhn-Tucker Theorem. Stratified simple random sampling with non-negligible sampling fractions is an important design in sample surveys. We go beyond limiting assumptions that have often been used in the past, such as that the stratification equals the study variable or that the sampling fractions are small. We discuss what difference the sampling fractions will make for stratification. In particular, in many surveys the sampling fraction equals one for some strata. The main theorem of this paper is applied to two populations with different characteristics, one of them being a business population and the other one a small population of 284 Swedish municipalities. We study empirically the sensitivity of deviations from the optimal solution
Matching on-the-fly in Sequential Experiments for Higher Power and Efficiency
We propose a dynamic allocation procedure that increases power and efficiency
when measuring an average treatment effect in sequential randomized trials.
Subjects arrive iteratively and are either randomized or paired via a matching
criterion to a previously randomized subject and administered the alternate
treatment. We develop estimators for the average treatment effect that combine
information from both the matched pairs and unmatched subjects as well as an
exact test. Simulations illustrate the method's higher efficiency and power
over competing allocation procedures in both controlled scenarios and
historical experimental data.Comment: 20 pages, 1 algorithm, 2 figures, 8 table
Adaptive stratified sampling for non-smooth problems
Science and engineering problems subject to uncertainty are frequently both
computationally expensive and feature nonsmooth parameter dependence, making
standard Monte Carlo too slow, and excluding efficient use of accelerated
uncertainty quantification methods relying on strict smoothness assumptions. To
remedy these challenges, we propose an adaptive stratification method suitable
for nonsmooth problems and with significantly reduced variance compared to
Monte Carlo sampling. The stratification is iteratively refined and samples are
added sequentially to satisfy an allocation criterion combining the benefits of
proportional and optimal sampling. Theoretical estimates are provided for the
expected performance and probability of failure to correctly estimate essential
statistics. We devise a practical adaptive stratification method with strata of
the same kind of geometrical shapes, cost-effective refinement satisfying a
greedy variance reduction criterion. Numerical experiments corroborate the
theoretical findings and exhibit speedups of up to three orders of magnitude
compared to standard Monte Carlo sampling.Comment: 37 pages, 12 figure
Partitioned Sampling of Public Opinions Based on Their Social Dynamics
Public opinion polling is usually done by random sampling from the entire
population, treating individual opinions as independent. In the real world,
individuals' opinions are often correlated, e.g., among friends in a social
network. In this paper, we explore the idea of partitioned sampling, which
partitions individuals with high opinion similarities into groups and then
samples every group separately to obtain an accurate estimate of the population
opinion. We rigorously formulate the above idea as an optimization problem. We
then show that the simple partitions which contain only one sample in each
group are always better, and reduce finding the optimal simple partition to a
well-studied Min-r-Partition problem. We adapt an approximation algorithm and a
heuristic algorithm to solve the optimization problem. Moreover, to obtain
opinion similarity efficiently, we adapt a well-known opinion evolution model
to characterize social interactions, and provide an exact computation of
opinion similarities based on the model. We use both synthetic and real-world
datasets to demonstrate that the partitioned sampling method results in
significant improvement in sampling quality and it is robust when some opinion
similarities are inaccurate or even missing
Learning to Sample: Counting with Complex Queries
We study the problem of efficiently estimating counts for queries involving
complex filters, such as user-defined functions, or predicates involving
self-joins and correlated subqueries. For such queries, traditional sampling
techniques may not be applicable due to the complexity of the filter preventing
sampling over joins, and sampling after the join may not be feasible due to the
cost of computing the full join. The other natural approach of training and
using an inexpensive classifier to estimate the count instead of the expensive
predicate suffers from the difficulties in training a good classifier and
giving meaningful confidence intervals. In this paper we propose a new method
of learning to sample where we combine the best of both worlds by using
sampling in two phases. First, we use samples to learn a probabilistic
classifier, and then use the classifier to design a stratified sampling method
to obtain the final estimates. We theoretically analyze algorithms for
obtaining an optimal stratification, and compare our approach with a suite of
natural alternatives like quantification learning, weighted and stratified
sampling, and other techniques from the literature. We also provide extensive
experiments in diverse use cases using multiple real and synthetic datasets to
evaluate the quality, efficiency, and robustness of our approach
Conditional inference with a complex sampling: exact computations and Monte Carlo estimations
In survey statistics, the usual technique for estimating a population total
consists in summing appropriately weighted variable values for the units in the
sample. Different weighting systems exit: sampling weights, GREG weights or
calibration weights for example. In this article, we propose to use the inverse
of conditional inclusion probabilities as weighting system. We study examples
where an auxiliary information enables to perform an a posteriori
stratification of the population. We show that, in these cases, exact
computations of the conditional weights are possible. When the auxiliary
information consists in the knowledge of a quantitative variable for all the
units of the population, then we show that the conditional weights can be
estimated via Monte-Carlo simulations. This method is applied to outlier and
strata-Jumper adjustments
- …