14 research outputs found

    Estimation of global network statistics from incomplete data

    Get PDF
    Complex networks underlie an enormous variety of social, biological, physical, and virtual systems. A profound complication for the science of complex networks is that in most cases, observing all nodes and all network interactions is impossible. Previous work addressing the impacts of partial network data is surprisingly limited, focuses primarily on missing nodes, and suggests that network statistics derived from subsampled data are not suitable estimators for the same network statistics describing the overall network topology. We generate scaling methods to predict true network statistics, including the degree distribution, from only partial knowledge of nodes, links, or weights. Our methods are transparent and do not assume a known generating process for the network, thus enabling prediction of network statistics for a wide variety of applications. We validate analytical results on four simulated network classes and empirical data sets of various sizes. We perform subsampling experiments by varying proportions of sampled data and demonstrate that our scaling methods can provide very good estimates of true network statistics while acknowledging limits. Lastly, we apply our techniques to a set of rich and evolving large-scale social networks, Twitter reply networks. Based on 100 million tweets, we use our scaling techniques to propose a statistical characterization of the Twitter Interactome from September 2008 to November 2008. Our treatment allows us to find support for Dunbar\u27s hypothesis in detecting an upper threshold for the number of active social contacts that individuals maintain over the course of one week

    Impact of spatially constrained sampling of temporal contact networks on the evaluation of the epidemic risk

    Full text link
    The ability to directly record human face-to-face interactions increasingly enables the development of detailed data-driven models for the spread of directly transmitted infectious diseases at the scale of individuals. Complete coverage of the contacts occurring in a population is however generally unattainable, due for instance to limited participation rates or experimental constraints in spatial coverage. Here, we study the impact of spatially constrained sampling on our ability to estimate the epidemic risk in a population using such detailed data-driven models. The epidemic risk is quantified by the epidemic threshold of the susceptible-infectious-recovered-susceptible model for the propagation of communicable diseases, i.e. the critical value of disease transmissibility above which the disease turns endemic. We verify for both synthetic and empirical data of human interactions that the use of incomplete data sets due to spatial sampling leads to the underestimation of the epidemic risk. The bias is however smaller than the one obtained by uniformly sampling the same fraction of contacts: it depends nonlinearly on the fraction of contacts that are recorded and becomes negligible if this fraction is large enough. Moreover, it depends on the interplay between the timescales of population and spreading dynamics.Comment: 21 pages, 7 figure

    Estimating the epidemic risk using non-uniformly sampled contact data

    Full text link
    Many datasets describing contacts in a population suffer from incompleteness due to population sampling and underreporting of contacts. Data-driven simulations of spreading processes using such incomplete data lead to an underestimation of the epidemic risk, and it is therefore important to devise methods to correct this bias. We focus here on a non-uniform sampling of the contacts between individuals, aimed at mimicking the results of diaries or surveys, and consider as case studies two datasets collected in different contexts. We show that using surrogate data built using a method developed in the case of uniform population sampling yields an improvement with respect to the use of the sampled data but is strongly limited by the underestimation of the link density in the sampled network. We put forward a second method to build surrogate data that assumes knowledge of the density of links within one of the groups forming the population. We show that it gives very good results when the population is strongly structured, and discuss its limitations in the case of a population with a weaker group structure. These limitations highlight the interest of measurements using wearable sensors able to yield accurate information on the structure and durations of contacts

    Epidemic risk from friendship network data: an equivalence with a non-uniform sampling of contact networks

    Full text link
    Contacts between individuals play an important role in determining how infectious diseases spread. Various methods to gather data on such contacts co-exist, from surveys to wearable sensors. Comparisons of data obtained by different methods in the same context are however scarce, in particular with respect to their use in data-driven models of spreading processes. Here, we use a combined data set describing contacts registered by sensors and friendship relations in the same population to address this issue in a case study. We investigate if the use of the friendship network is equivalent to a sampling procedure performed on the sensor contact network with respect to the outcome of simulations of spreading processes: such an equivalence might indeed give hints on ways to compensate for the incompleteness of contact data deduced from surveys. We show that this is indeed the case for these data, for a specifically designed sampling procedure, in which respondents report their neighbors with a probability depending on their contact time. We study the impact of this specific sampling procedure on several data sets, discuss limitations of our approach and its possible applications in the use of data sets of various origins in data-driven simulations of epidemic processes

    Compensating for population sampling in simulations of epidemic spread on temporal contact networks

    Full text link
    Data describing human interactions often suffer from incomplete sampling of the underlying population. As a consequence, the study of contagion processes using data-driven models can lead to a severe underestimation of the epidemic risk. Here we present a systematic method to alleviate this issue and obtain a better estimation of the risk in the context of epidemic models informed by high-resolution time-resolved contact data. We consider several such data sets collected in various contexts and perform controlled resampling experiments. We show how the statistical information contained in the resampled data can be used to build a series of surrogate versions of the unknown contacts. We simulate epidemic processes on the resulting reconstructed data sets and show that it is possible to obtain good estimates of the outcome of simulations performed using the complete data set. We discuss limitations and potential improvements of our method

    Livestock network analysis for rhodesiense human African trypanosomiasis control in Uganda

    Get PDF
    Background: Infected cattle sourced from districts with established foci for Trypanosoma brucei rhodesiense human African trypanosomiasis (rHAT) migrating to previously unaffected districts, have resulted in a significant expansion of the disease in Uganda. This study explores livestock movement data to describe cattle trade network topology and assess the effects of disease control interventions on the transmission of rHAT infectiousness.Methods: Network analysis was used to generate a cattle trade network with livestock data which was collected from cattle traders (n = 197) and validated using random graph methods. Additionally, the cattle trade network was combined with a susceptible, infected, recovered (SIR) compartmental model to simulate spread of rHAT (Ro 1.287), hence regarded as “slow” pathogen, and evaluate the effects of disease interventions.Results: The cattle trade network exhibited a low clustering coefficient (0.5) with most cattle markets being weakly connected and a few being highly connected. Also, analysis of the cattle movement data revealed a core group comprising of cattle markets from both eastern (rHAT endemic) and northwest regions (rHAT unaffected area). Presence of a core group may result in rHAT spread to unaffected districts and occurrence of super spreader cattle market or markets in case of an outbreak. The key cattle markets that may be targeted for routine rHAT surveillance and control included Namutumba, Soroti, and Molo, all of which were in southeast Uganda. Using effective trypanosomiasis such as integrated cattle injection with trypanocides and spraying can sufficiently slow the spread of rHAT in the network.Conclusion: Cattle trade network analysis indicated a pathway along which T. b. rhodesiense could spread northward from eastern Uganda. Targeted T. b. rhodesiense surveillance and control in eastern Uganda, through enhanced public–private partnerships, would serve to limit its spread

    Connecting every bit of knowledge: The structure of Wikipedia\u27s First Link Network

    Get PDF
    Apples, porcupines, and the most obscure Bob Dylan song—is every topic a few clicks from Philosophy? Within Wikipedia, the surprising answer is yes: nearly all paths lead to Philosophy. Wikipedia is the largest, most meticulously indexed collection of human knowledge ever amassed. More than information about a topic, Wikipedia is a web of naturally emerging relationships. By following the first link in each article, we algorithmically construct a directed network of all 4.7 million articles: Wikipedia\u27s First Link Network. Here, we study the English edition of Wikipedia\u27s First Link Network for insight into how the many articles on inventions, places, people, objects, and events are related and organized. By traversing every path, we measure the accumulation of first links, path lengths, groups of path-connected articles, and cycles. We also develop a new method, traversal funnels, to measure the influence each article exerts in shaping the network. Traversal funnels provide a new measure of influence for directed networks without spill-over into cycles, in contrast to traditional network centrality measures. Within Wikipedia\u27s First Link Network, we find scale-free distributions describe path length, accumulation, and influence. Far from dispersed, first links disproportionately accumulate at a few articles—flowing from specific to general and culminating around fundamental notions such as Community, State, and Science. Philosophy directs more paths than any other article by two orders of magnitude. We also observe a gravitation toward topical articles such as Health Care and Fossil Fuel. These findings enrich our view of the connections and structure of Wikipedia\u27s ever growing store of knowledge

    Joint Inference of Structure and Diffusion in Partially Observed Social Networks

    Full text link
    Access to complete data in large scale networks is often infeasible. Therefore, the problem of missing data is a crucial and unavoidable issue in analysis and modeling of real-world social networks. However, most of the research on different aspects of social networks do not consider this limitation. One effective way to solve this problem is to recover the missing data as a pre-processing step. The present paper tries to infer the unobserved data from both diffusion network and network structure by learning a model from the partially observed data. We develop a probabilistic generative model called "DiffStru" to jointly discover the hidden links of network structure and the omitted diffusion activities. The interrelations among links of nodes and cascade processes are utilized in the proposed method via learning coupled low dimensional latent factors. In addition to inferring the unseen data, the learned latent factors may also help network classification problems such as community detection. Simulation results on synthetic and real-world datasets show the excellent performance of the proposed method in terms of link prediction and discovering the identity and infection time of invisible social behaviors
    corecore