    Network Structure and Biased Variance Estimation in Respondent Driven Sampling

    This paper explores bias in the estimation of sampling variance in Respondent Driven Sampling (RDS). Prior methodological work on RDS has focused on its problematic assumptions and the biases and inefficiencies of its estimators of the population mean. Nonetheless, researchers have given only slight attention to the topic of estimating sampling variance in RDS, despite the importance of variance estimation for the construction of confidence intervals and hypothesis tests. In this paper, we show that the estimators of RDS sampling variance rely on a critical assumption that the network is First Order Markov (FOM) with respect to the dependent variable of interest. We demonstrate, through intuitive examples, mathematical generalizations, and computational experiments that current RDS variance estimators will always underestimate the population sampling variance of RDS in empirical networks that do not conform to the FOM assumption. Analysis of 215 observed university and school networks from Facebook and Add Health indicates that the FOM assumption is violated in every empirical network we analyze, and that these violations lead to substantially biased RDS estimators of sampling variance. We propose and test two alternative variance estimators that show some promise for reducing biases, but which also illustrate the limits of estimating sampling variance with only partial information on the underlying population social network

    Generalized least squares can overcome the critical threshold in respondent-driven sampling

    In order to sample marginalized and/or hard-to-reach populations, respondent-driven sampling (RDS) and similar techniques reach their participants via peer referral. Under a Markov model for RDS, previous research has shown that if the typical participant refers too many contacts, then the variance of common estimators does not decay like O(n1)O(n^{-1}), where nn is the sample size. This implies that confidence intervals will be far wider than under a typical sampling design. Here we show that generalized least squares (GLS) can effectively reduce the variance of RDS estimates. In particular, a theoretical analysis indicates that the variance of the GLS estimator is O(n1)O(n^{-1}). We then derive two classes of feasible GLS estimators. The first class is based upon a Degree Corrected Stochastic Blockmodel for the underlying social network. The second class is based upon a rank-two model. It might be of independent interest that in both model classes, the theoretical results show that it is possible to estimate the spectral properties of the population network from the sampled observations. Simulations on empirical social networks show that the feasible GLS (fGLS) estimators can have drastically smaller error and rarely increase the error. A diagnostic plot helps to identify where fGLS will aid estimation. The fGLS estimators continue to outperform standard estimators even when they are built from a misspecified model and when there is preferential recruitment.Comment: Submitte

    J Surv Stat Methodol

    One-step Estimation of Networked Population Size: Respondent-Driven Capture-Recapture with Anonymity

    Population size estimates for hidden and hard-to-reach populations are particularly important when members are known to suffer from disproportion health issues or to pose health risks to the larger ambient population in which they are embedded. Efforts to derive size estimates are often frustrated by a range of factors that preclude conventional survey strategies, including social stigma associated with group membership or members' involvement in illegal activities. This paper extends prior research on the problem of network population size estimation, building on established survey/sampling methodologies commonly used with hard-to-reach groups. Three novel one-step, network-based population size estimators are presented, to be used in the context of uniform random sampling, respondent-driven sampling, and when networks exhibit significant clustering effects. Provably sufficient conditions for the consistency of these estimators (in large configuration networks) are given. Simulation experiments across a wide range of synthetic network topologies validate the performance of the estimators, which are seen to perform well on a real-world location-based social networking data set with significant clustering. Finally, the proposed schemes are extended to allow them to be used in settings where participant anonymity is required. Systematic experiments show favorable tradeoffs between anonymity guarantees and estimator performance. Taken together, we demonstrate that reasonable population estimates can be derived from anonymous respondent driven samples of 250-750 individuals, within ambient populations of 5,000-40,000. The method thus represents a novel and cost-effective means for health planners and those agencies concerned with health and disease surveillance to estimate the size of hidden populations. Limitations and future work are discussed in the concluding section

    Social network clustering and the spread of HIV/AIDS among persons who inject drugs in 2 cities in the Philippines

    Introduction: The Philippines has seen rapid increases in HIV prevalence among people who inject drugs. We study 2 neighboring cities where a linked HIV epidemic differed in timing of onset and levels of prevalence. In Cebu, prevalence rose rapidly from below 1% to 54% between 2009 and 2011 and remained high through 2013. In nearby Mandaue, HIV remained below 4% through 2011 then rose rapidly to 38% by 2013. Objectives: We hypothesize that infection prevalence differences in these cities may owe to aspects of social network structure, specifically levels of network clustering. Building on previous research, we hypothesize that higher levels of network clustering are associated with greater epidemic potential. Methods: Data were collected with respondent-driven sampling among men who inject drugs in Cebu and Mandaue in 2013. We first examine sample composition using estimators for population means. We then apply new estimators of network clustering in respondent-driven sampling data to examine associations with HIV prevalence. Results: Samples in both cities were comparable in composition by age, education, and injection locations. Dyadic needle-sharing levels were also similar between the 2 cities, but network clustering in the needle-sharing network differed dramatically. We found higher clustering in Cebu than Mandaue, consistent with expectations that higher clustering is associated with faster epidemic spread. Conclusions: This article is the first to apply estimators of network clustering to empirical respondent-driven samples, and it offers suggestive evidence that researchers should pay greater attention to network structure's role in HIV transmission dynamics

    Estimating hidden population sizes with venue-based sampling: Extensions of the generalized network scale-up estimator

    Background: Researchers use a variety of population size estimation methods to determine the sizes of key populations at elevated risk of human immunodeficiency virus (HIV)/acquired immune deficiency syndrome (AIDS), an important step in quantifying epidemic impact, advocating for high-risk groups, and planning, implementing, and monitoring prevention, care, and treatment programs. Conventional procedures often use information about sample respondents' social network contacts to estimate the sizes of key populations of interest. A recent study proposes a generalized network scale-up method that combines two samples - a traditional sample of the general population and a link-tracing sample of the hidden population - and produces more accurate results with fewer assumptions than conventional approaches. Methods: We extended the generalized network scale-up method from link-tracing samples to samples collected with venue-based sampling designs popular in sampling key populations at risk of HIV. Our method obviates the need for a traditional sample of the general population, as long as the size of the venue-attending population is approximately known. We tested the venue-based generalized network scale-up method in a comprehensive simulation evaluation framework. Results: The venue-based generalized network scale-up method provided accurate and efficient estimates of key population sizes, even when few members of the key population were sampled, yielding average biases below ±6% except when false-positive reporting error is high. It relies on limited assumptions and, in our tests, was robust to numerous threats to inference. Conclusions: Key population size estimation is vital to the successful implementation of efforts to combat HIV/AIDS. Venue-based network scale-up approaches offer another tool that researchers and policymakers can apply to these problems

    Sampling migrants from their social networks: The demography and social organization of Chinese migrants in Dar es Salaam, Tanzania

    The streams of Chinese migration to Africa are growing in tandem with rising Chinese investments and trade flows in and to the African continent. In spite of the high profile of this phenomenon in the media, there are few rich and broad descriptions of Chinese communities in Africa. Reasons for this include the rarity of official statistics on foreign-born populations in African censuses, the absence of predefined sampling frames required to draw representative samples with conventional survey methods and difficulties to reach certain segments of this population. Here, we use a novel network-based approach, Network Sampling with Memory, which overcomes the challenges of sampling ‘hidden’ populations in the absence of a sampling frame, to recruit a sample of recent Chinese immigrants in Dar es Salaam, Tanzania and collect information on the demographic characteristics, migration histories and social ties of members of this sample. These data reveal a heterogeneous Chinese community composed of “state-led” migrants who come to Africa to work on projects undertaken by large Chinese state-owned enterprises and “independent” migrants who come on their own accord to engage in various types of business ventures. They offer a rich description of the demographic profile and social organization of this community, highlight key differences between the two categories of migrants and map the structure of the social ties linking them. We highlight needs for future research on inter-group differences in individual motivations for migration, economic activities, migration outcomes, expectations about future residence in Africa, social integration and relations with local communities

    Sociol Methodol

    Respondent-driven sampling (RDS) is a popular method for sampling hard-to-survey populations that leverages social network connections through peer recruitment. While RDS is most frequently applied to estimate the prevalence of infections and risk behaviors of interest to public health, such as HIV/AIDS or condom use, it is rarely used to draw inferences about the structural properties of social networks among such populations because it does not typically collect the necessary data. Drawing on recent advances in computer science, we introduce a set of data collection instruments and RDS estimators for network clustering, an important topological property that has been linked to a network's potential for diffusion of information, disease, and health behaviors. We use simulations to explore how these estimators, originally developed for random walk samples of computer networks, perform when applied to RDS samples with characteristics encountered in realistic field settings that depart from random walks. In particular, we explore the effects of multiple seeds, without replacement versus with replacement, branching chains, imperfect response rates, preferential recruitment, and misreporting of ties. We find that clustering coefficient estimators retain desirable properties in RDS samples. This paper takes an important step toward calculating network characteristics using nontraditional sampling methods, and it expands the potential of RDS to tell researchers more about hidden populations and the social factors driving disease prevalence.