14 research outputs found
Estimation of global network statistics from incomplete data
Complex networks underlie an enormous variety of social, biological, physical, and virtual systems. A profound complication for the science of complex networks is that in most cases, observing all nodes and all network interactions is impossible. Previous work addressing the impacts of partial network data is surprisingly limited, focuses primarily on missing nodes, and suggests that network statistics derived from subsampled data are not suitable estimators for the same network statistics describing the overall network topology. We generate scaling methods to predict true network statistics, including the degree distribution, from only partial knowledge of nodes, links, or weights. Our methods are transparent and do not assume a known generating process for the network, thus enabling prediction of network statistics for a wide variety of applications. We validate analytical results on four simulated network classes and empirical data sets of various sizes. We perform subsampling experiments by varying proportions of sampled data and demonstrate that our scaling methods can provide very good estimates of true network statistics while acknowledging limits. Lastly, we apply our techniques to a set of rich and evolving large-scale social networks, Twitter reply networks. Based on 100 million tweets, we use our scaling techniques to propose a statistical characterization of the Twitter Interactome from September 2008 to November 2008. Our treatment allows us to find support for Dunbar\u27s hypothesis in detecting an upper threshold for the number of active social contacts that individuals maintain over the course of one week
Impact of spatially constrained sampling of temporal contact networks on the evaluation of the epidemic risk
The ability to directly record human face-to-face interactions increasingly
enables the development of detailed data-driven models for the spread of
directly transmitted infectious diseases at the scale of individuals. Complete
coverage of the contacts occurring in a population is however generally
unattainable, due for instance to limited participation rates or experimental
constraints in spatial coverage. Here, we study the impact of spatially
constrained sampling on our ability to estimate the epidemic risk in a
population using such detailed data-driven models. The epidemic risk is
quantified by the epidemic threshold of the
susceptible-infectious-recovered-susceptible model for the propagation of
communicable diseases, i.e. the critical value of disease transmissibility
above which the disease turns endemic. We verify for both synthetic and
empirical data of human interactions that the use of incomplete data sets due
to spatial sampling leads to the underestimation of the epidemic risk. The bias
is however smaller than the one obtained by uniformly sampling the same
fraction of contacts: it depends nonlinearly on the fraction of contacts that
are recorded and becomes negligible if this fraction is large enough. Moreover,
it depends on the interplay between the timescales of population and spreading
dynamics.Comment: 21 pages, 7 figure
Estimating the epidemic risk using non-uniformly sampled contact data
Many datasets describing contacts in a population suffer from incompleteness
due to population sampling and underreporting of contacts. Data-driven
simulations of spreading processes using such incomplete data lead to an
underestimation of the epidemic risk, and it is therefore important to devise
methods to correct this bias. We focus here on a non-uniform sampling of the
contacts between individuals, aimed at mimicking the results of diaries or
surveys, and consider as case studies two datasets collected in different
contexts. We show that using surrogate data built using a method developed in
the case of uniform population sampling yields an improvement with respect to
the use of the sampled data but is strongly limited by the underestimation of
the link density in the sampled network. We put forward a second method to
build surrogate data that assumes knowledge of the density of links within one
of the groups forming the population. We show that it gives very good results
when the population is strongly structured, and discuss its limitations in the
case of a population with a weaker group structure. These limitations highlight
the interest of measurements using wearable sensors able to yield accurate
information on the structure and durations of contacts
Epidemic risk from friendship network data: an equivalence with a non-uniform sampling of contact networks
Contacts between individuals play an important role in determining how
infectious diseases spread. Various methods to gather data on such contacts
co-exist, from surveys to wearable sensors. Comparisons of data obtained by
different methods in the same context are however scarce, in particular with
respect to their use in data-driven models of spreading processes. Here, we use
a combined data set describing contacts registered by sensors and friendship
relations in the same population to address this issue in a case study. We
investigate if the use of the friendship network is equivalent to a sampling
procedure performed on the sensor contact network with respect to the outcome
of simulations of spreading processes: such an equivalence might indeed give
hints on ways to compensate for the incompleteness of contact data deduced from
surveys. We show that this is indeed the case for these data, for a
specifically designed sampling procedure, in which respondents report their
neighbors with a probability depending on their contact time. We study the
impact of this specific sampling procedure on several data sets, discuss
limitations of our approach and its possible applications in the use of data
sets of various origins in data-driven simulations of epidemic processes
Compensating for population sampling in simulations of epidemic spread on temporal contact networks
Data describing human interactions often suffer from incomplete sampling of
the underlying population. As a consequence, the study of contagion processes
using data-driven models can lead to a severe underestimation of the epidemic
risk. Here we present a systematic method to alleviate this issue and obtain a
better estimation of the risk in the context of epidemic models informed by
high-resolution time-resolved contact data. We consider several such data sets
collected in various contexts and perform controlled resampling experiments. We
show how the statistical information contained in the resampled data can be
used to build a series of surrogate versions of the unknown contacts. We
simulate epidemic processes on the resulting reconstructed data sets and show
that it is possible to obtain good estimates of the outcome of simulations
performed using the complete data set. We discuss limitations and potential
improvements of our method
Livestock network analysis for rhodesiense human African trypanosomiasis control in Uganda
Background: Infected cattle sourced from districts with established foci for Trypanosoma brucei rhodesiense human African trypanosomiasis (rHAT) migrating to previously unaffected districts, have resulted in a significant expansion of the disease in Uganda. This study explores livestock movement data to describe cattle trade network topology and assess the effects of disease control interventions on the transmission of rHAT infectiousness.Methods: Network analysis was used to generate a cattle trade network with livestock data which was collected from cattle traders (n = 197) and validated using random graph methods. Additionally, the cattle trade network was combined with a susceptible, infected, recovered (SIR) compartmental model to simulate spread of rHAT (Ro 1.287), hence regarded as “slow” pathogen, and evaluate the effects of disease interventions.Results: The cattle trade network exhibited a low clustering coefficient (0.5) with most cattle markets being weakly connected and a few being highly connected. Also, analysis of the cattle movement data revealed a core group comprising of cattle markets from both eastern (rHAT endemic) and northwest regions (rHAT unaffected area). Presence of a core group may result in rHAT spread to unaffected districts and occurrence of super spreader cattle market or markets in case of an outbreak. The key cattle markets that may be targeted for routine rHAT surveillance and control included Namutumba, Soroti, and Molo, all of which were in southeast Uganda. Using effective trypanosomiasis such as integrated cattle injection with trypanocides and spraying can sufficiently slow the spread of rHAT in the network.Conclusion: Cattle trade network analysis indicated a pathway along which T. b. rhodesiense could spread northward from eastern Uganda. Targeted T. b. rhodesiense surveillance and control in eastern Uganda, through enhanced public–private partnerships, would serve to limit its spread
Connecting every bit of knowledge: The structure of Wikipedia\u27s First Link Network
Apples, porcupines, and the most obscure Bob Dylan song—is every topic a few clicks from Philosophy? Within Wikipedia, the surprising answer is yes: nearly all paths lead to Philosophy. Wikipedia is the largest, most meticulously indexed collection of human knowledge ever amassed. More than information about a topic, Wikipedia is a web of naturally emerging relationships. By following the first link in each article, we algorithmically construct a directed network of all 4.7 million articles: Wikipedia\u27s First Link Network. Here, we study the English edition of Wikipedia\u27s First Link Network for insight into how the many articles on inventions, places, people, objects, and events are related and organized. By traversing every path, we measure the accumulation of first links, path lengths, groups of path-connected articles, and cycles. We also develop a new method, traversal funnels, to measure the influence each article exerts in shaping the network. Traversal funnels provide a new measure of influence for directed networks without spill-over into cycles, in contrast to traditional network centrality measures. Within Wikipedia\u27s First Link Network, we find scale-free distributions describe path length, accumulation, and influence. Far from dispersed, first links disproportionately accumulate at a few articles—flowing from specific to general and culminating around fundamental notions such as Community, State, and Science. Philosophy directs more paths than any other article by two orders of magnitude. We also observe a gravitation toward topical articles such as Health Care and Fossil Fuel. These findings enrich our view of the connections and structure of Wikipedia\u27s ever growing store of knowledge
Joint Inference of Structure and Diffusion in Partially Observed Social Networks
Access to complete data in large scale networks is often infeasible.
Therefore, the problem of missing data is a crucial and unavoidable issue in
analysis and modeling of real-world social networks. However, most of the
research on different aspects of social networks do not consider this
limitation. One effective way to solve this problem is to recover the missing
data as a pre-processing step. The present paper tries to infer the unobserved
data from both diffusion network and network structure by learning a model from
the partially observed data. We develop a probabilistic generative model called
"DiffStru" to jointly discover the hidden links of network structure and the
omitted diffusion activities. The interrelations among links of nodes and
cascade processes are utilized in the proposed method via learning coupled low
dimensional latent factors. In addition to inferring the unseen data, the
learned latent factors may also help network classification problems such as
community detection. Simulation results on synthetic and real-world datasets
show the excellent performance of the proposed method in terms of link
prediction and discovering the identity and infection time of invisible social
behaviors