30,505 research outputs found
One-step Estimation of Networked Population Size: Respondent-Driven Capture-Recapture with Anonymity
Population size estimates for hidden and hard-to-reach populations are
particularly important when members are known to suffer from disproportion
health issues or to pose health risks to the larger ambient population in which
they are embedded. Efforts to derive size estimates are often frustrated by a
range of factors that preclude conventional survey strategies, including social
stigma associated with group membership or members' involvement in illegal
activities.
This paper extends prior research on the problem of network population size
estimation, building on established survey/sampling methodologies commonly used
with hard-to-reach groups. Three novel one-step, network-based population size
estimators are presented, to be used in the context of uniform random sampling,
respondent-driven sampling, and when networks exhibit significant clustering
effects. Provably sufficient conditions for the consistency of these estimators
(in large configuration networks) are given. Simulation experiments across a
wide range of synthetic network topologies validate the performance of the
estimators, which are seen to perform well on a real-world location-based
social networking data set with significant clustering. Finally, the proposed
schemes are extended to allow them to be used in settings where participant
anonymity is required. Systematic experiments show favorable tradeoffs between
anonymity guarantees and estimator performance.
Taken together, we demonstrate that reasonable population estimates can be
derived from anonymous respondent driven samples of 250-750 individuals, within
ambient populations of 5,000-40,000. The method thus represents a novel and
cost-effective means for health planners and those agencies concerned with
health and disease surveillance to estimate the size of hidden populations.
Limitations and future work are discussed in the concluding section
Sampling and Inference for Beta Neutral-to-the-Left Models of Sparse Networks
Empirical evidence suggests that heavy-tailed degree distributions occurring
in many real networks are well-approximated by power laws with exponents
that may take values either less than and greater than two. Models based on
various forms of exchangeability are able to capture power laws with , and admit tractable inference algorithms; we draw on previous results to
show that cannot be generated by the forms of exchangeability used
in existing random graph models. Preferential attachment models generate power
law exponents greater than two, but have been of limited use as statistical
models due to the inherent difficulty of performing inference in
non-exchangeable models. Motivated by this gap, we design and implement
inference algorithms for a recently proposed class of models that generates
of all possible values. We show that although they are not exchangeable,
these models have probabilistic structure amenable to inference. Our methods
make a large class of previously intractable models useful for statistical
inference.Comment: Accepted for publication in the proceedings of Conference on
Uncertainty in Artificial Intelligence (UAI) 201
Latent demographic profile estimation in hard-to-reach groups
The sampling frame in most social science surveys excludes members of certain
groups, known as hard-to-reach groups. These groups, or subpopulations, may be
difficult to access (the homeless, e.g.), camouflaged by stigma (individuals
with HIV/AIDS), or both (commercial sex workers). Even basic demographic
information about these groups is typically unknown, especially in many
developing nations. We present statistical models which leverage social network
structure to estimate demographic characteristics of these subpopulations using
Aggregated relational data (ARD), or questions of the form "How many X's do you
know?" Unlike other network-based techniques for reaching these groups, ARD
require no special sampling strategy and are easily incorporated into standard
surveys. ARD also do not require respondents to reveal their own group
membership. We propose a Bayesian hierarchical model for estimating the
demographic characteristics of hard-to-reach groups, or latent demographic
profiles, using ARD. We propose two estimation techniques. First, we propose a
Markov-chain Monte Carlo algorithm for existing data or cases where the full
posterior distribution is of interest. For cases when new data can be
collected, we propose guidelines and, based on these guidelines, propose a
simple estimate motivated by a missing data approach. Using data from McCarty
et al. [Human Organization 60 (2001) 28-39], we estimate the age and gender
profiles of six hard-to-reach groups, such as individuals who have HIV, women
who were raped, and homeless persons. We also evaluate our simple estimates
using simulation studies.Comment: Published in at http://dx.doi.org/10.1214/12-AOAS569 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Generalized least squares can overcome the critical threshold in respondent-driven sampling
In order to sample marginalized and/or hard-to-reach populations,
respondent-driven sampling (RDS) and similar techniques reach their
participants via peer referral. Under a Markov model for RDS, previous research
has shown that if the typical participant refers too many contacts, then the
variance of common estimators does not decay like , where is the
sample size. This implies that confidence intervals will be far wider than
under a typical sampling design. Here we show that generalized least squares
(GLS) can effectively reduce the variance of RDS estimates. In particular, a
theoretical analysis indicates that the variance of the GLS estimator is
. We then derive two classes of feasible GLS estimators. The first
class is based upon a Degree Corrected Stochastic Blockmodel for the underlying
social network. The second class is based upon a rank-two model. It might be of
independent interest that in both model classes, the theoretical results show
that it is possible to estimate the spectral properties of the population
network from the sampled observations. Simulations on empirical social networks
show that the feasible GLS (fGLS) estimators can have drastically smaller error
and rarely increase the error. A diagnostic plot helps to identify where fGLS
will aid estimation. The fGLS estimators continue to outperform standard
estimators even when they are built from a misspecified model and when there is
preferential recruitment.Comment: Submitte
Outward Influence and Cascade Size Estimation in Billion-scale Networks
Estimating cascade size and nodes' influence is a fundamental task in social,
technological, and biological networks. Yet this task is extremely challenging
due to the sheer size and the structural heterogeneity of networks. We
investigate a new influence measure, termed outward influence (OI), defined as
the (expected) number of nodes that a subset of nodes will activate,
excluding the nodes in S. Thus, OI equals, the de facto standard measure,
influence spread of S minus |S|. OI is not only more informative for nodes with
small influence, but also, critical in designing new effective sampling and
statistical estimation methods.
Based on OI, we propose SIEA/SOIEA, novel methods to estimate influence
spread/outward influence at scale and with rigorous theoretical guarantees. The
proposed methods are built on two novel components 1) IICP an important
sampling method for outward influence, and 2) RSA, a robust mean estimation
method that minimize the number of samples through analyzing variance and range
of random variables. Compared to the state-of-the art for influence estimation,
SIEA is times faster in theory and up to several orders of
magnitude faster in practice. For the first time, influence of nodes in the
networks of billions of edges can be estimated with high accuracy within a few
minutes. Our comprehensive experiments on real-world networks also give
evidence against the popular practice of using a fixed number, e.g. 10K or 20K,
of samples to compute the "ground truth" for influence spread.Comment: 16 pages, SIGMETRICS 201
- …