5 research outputs found
On the Universality of Jordan Centers for Estimating Infection Sources in Tree Networks
Finding the infection sources in a network when we only know the network
topology and infected nodes, but not the rates of infection, is a challenging
combinatorial problem, and it is even more difficult in practice where the
underlying infection spreading model is usually unknown a priori. In this
paper, we are interested in finding a source estimator that is applicable to
various spreading models, including the Susceptible-Infected (SI),
Susceptible-Infected-Recovered (SIR), Susceptible-Infected-Recovered-Infected
(SIRI), and Susceptible-Infected-Susceptible (SIS) models. We show that under
the SI, SIR and SIRI spreading models and with mild technical assumptions, the
Jordan center is the infection source associated with the most likely infection
path in a tree network with a single infection source. This conclusion applies
for a wide range of spreading parameters, while it holds for regular trees
under the SIS model with homogeneous infection and recovery rates. Since the
Jordan center does not depend on the infection, recovery and reinfection rates,
it can be regarded as a universal source estimator. We also consider the case
where there are k>1 infection sources, generalize the Jordan center definition
to a k-Jordan center set, and show that this is an optimal infection source set
estimator in a tree network for the SI model. Simulation results on various
general synthetic networks and real world networks suggest that Jordan
center-based estimators consistently outperform the betweenness, closeness,
distance, degree, eigenvector, and pagerank centrality based heuristics, even
if the network is not a tree
On the Properties of Gromov Matrices and their Applications in Network Inference
The spanning tree heuristic is a commonly adopted procedure in network
inference and estimation. It allows one to generalize an inference method
developed for trees, which is usually based on a statistically rigorous
approach, to a heuristic procedure for general graphs by (usually randomly)
choosing a spanning tree in the graph to apply the approach developed for
trees. However, there are an intractable number of spanning trees in a dense
graph. In this paper, we represent a weighted tree with a matrix, which we call
a Gromov matrix. We propose a method that constructs a family of Gromov
matrices using convex combinations, which can be used for inference and
estimation instead of a randomly selected spanning tree. This procedure
increases the size of the candidate set and hence enhances the performance of
the classical spanning tree heuristic. On the other hand, our new scheme is
based on simple algebraic constructions using matrices, and hence is still
computationally tractable. We discuss some applications on network inference
and estimation to demonstrate the usefulness of the proposed method
Infection Spreading and Source Identification: A Hide and Seek Game
The goal of an infection source node (e.g., a rumor or computer virus source)
in a network is to spread its infection to as many nodes as possible, while
remaining hidden from the network administrator. On the other hand, the network
administrator aims to identify the source node based on knowledge of which
nodes have been infected. We model the infection spreading and source
identification problem as a strategic game, where the infection source and the
network administrator are the two players. As the Jordan center estimator is a
minimax source estimator that has been shown to be robust in recent works, we
assume that the network administrator utilizes a source estimation strategy
that can probe any nodes within a given radius of the Jordan center. Given any
estimation strategy, we design a best-response infection strategy for the
source. Given any infection strategy, we design a best-response estimation
strategy for the network administrator. We derive conditions under which a Nash
equilibrium of the strategic game exists. Simulations in both synthetic and
real-world networks demonstrate that our proposed infection strategy infects
more nodes while maintaining the same safety margin between the true source
node and the Jordan center source estimator
Statistical methods for certain large, complex data challenges
Big data concerns large-volume, complex, growing data sets, and it provides us opportunities as well as challenges. This thesis focuses on statistical methods for several specific large, complex data challenges - each involving representation of data with complex format, utilization of complicated information, and/or intensive computational cost.
The first problem we work on is hypothesis testing for multilayer network data, motivated by an example in computational biology. We show how to represent the complex structure of a multilayer network as a single data point within the space of supra-Laplacians and then develop a central limit theorem and hypothesis testing theories for multilayer networks in that space. We develop both global and local testing strategies for mean comparison and investigate sample size requirements. The methods were applied to the motivating computational biology example and compared with the classic Gene Set Enrichment Analysis(GSEA). More biological insights are found in this comparison.
The second problem is the source detection problem in epidemiology, which is one of the most important issues for control of epidemics. Ideally, we want to locate the sources based on all history data. However, this is often infeasible, because the history data is complex, high-dimensional and cannot be fully observed. Epidemiologists have recognized the crucial role of human mobility as an important proxy to a complete history, but little in the literature to date uses this information for source detection. We recast the source detection problem as identifying a relevant mixture component in a multivariate Gaussian mixture model. Human mobility within a stochastic PDE model is used to calibrate the parameters. The capability of our method is demonstrated in the context of the 2000-2002 cholera outbreak in the KwaZulu-Natal province.
The third problem is about multivariate time series imputation, which is a classic problem in statistics. To address the common problem of low signal-to-noise ratio in high-dimensional multivariate time series, we propose models based on state-space models which provide more precise inference of missing values by clustering multivariate time series components in a nonparametric way. The models are suitable for large-scale time series due to their efficient parameter estimation.2019-05-15T00:00:00