123 research outputs found
Bias-Variance Tradeoffs in Joint Spectral Embeddings
Joint spectral embeddings facilitate analysis of multiple network data by
simultaneously mapping vertices in each network to points in Euclidean space
where statistical inference is then performed. In this work, we consider one
such joint embedding technique, the omnibus embedding of arXiv:1705.09355 ,
which has been successfully used for community detection, anomaly detection,
and hypothesis testing tasks. To date the theoretical properties of this method
have only been established under the strong assumption that the networks are
conditionally i.i.d. random dot product graphs. In practice we anticipate
multiple networks will possess different structures, necessitating further
analysis. Herein, we take a first step in characterizing the theoretical
properties of the omnibus embedding in the presence of heterogeneous network
data. Under a simple latent position model, we uncover a bias-variance tradeoff
for latent position estimation. We establish an explicit bias expression,
derive a uniform concentration bound on the residual, and prove a central limit
theorem characterizing the distributional properties of these estimates. These
explicit bias and variance expressions enable us to state sufficient conditions
for exact recovery in community detection tasks and develop a pivotal test
statistic to determine whether two graphs share the same set of latent
positions; demonstrating that accurate inference is achievable despite the
estimator's inconsistency. These results are demonstrated in several
experimental settings where statistical procedures utilizing the omnibus
embedding are competitive, and oftentimes preferable, to comparable embedding
techniques. These observations accentuate the viability of the omnibus
embedding for multiple graph inference beyond the homogeneous network setting.Comment: 45 pages, 7 figure
A central limit theorem for an omnibus embedding of multiple random graphs and implications for multiscale network inference
Performing statistical analyses on collections of graphs is of import to many
disciplines, but principled, scalable methods for multi-sample graph inference
are few. Here we describe an "omnibus" embedding in which multiple graphs on
the same vertex set are jointly embedded into a single space with a distinct
representation for each graph. We prove a central limit theorem for this
embedding and demonstrate how it streamlines graph comparison, obviating the
need for pairwise subspace alignments. The omnibus embedding achieves
near-optimal inference accuracy when graphs arise from a common distribution
and yet retains discriminatory power as a test procedure for the comparison of
different graphs. Moreover, this joint embedding and the accompanying central
limit theorem are important for answering multiscale graph inference questions,
such as the identification of specific subgraphs or vertices responsible for
similarity or difference across networks. We illustrate this with a pair of
analyses of connectome data derived from dMRI and fMRI scans of human subjects.
In particular, we show that this embedding allows the identification of
specific brain regions associated with population-level differences. Finally,
we sketch how the omnibus embedding can be used to address pressing open
problems, both theoretical and practical, in multisample graph inference
Statistical inference on random dot product graphs: a survey
The random dot product graph (RDPG) is an independent-edge random graph that
is analytically tractable and, simultaneously, either encompasses or can
successfully approximate a wide range of random graphs, from relatively simple
stochastic block models to complex latent position graphs. In this survey
paper, we describe a comprehensive paradigm for statistical inference on random
dot product graphs, a paradigm centered on spectral embeddings of adjacency and
Laplacian matrices. We examine the analogues, in graph inference, of several
canonical tenets of classical Euclidean inference: in particular, we summarize
a body of existing results on the consistency and asymptotic normality of the
adjacency and Laplacian spectral embeddings, and the role these spectral
embeddings can play in the construction of single- and multi-sample hypothesis
tests for graph data. We investigate several real-world applications, including
community detection and classification in large social networks and the
determination of functional and biologically relevant network properties from
an exploratory data analysis of the Drosophila connectome. We outline requisite
background and current open problems in spectral graph inference.Comment: An expository survey paper on a comprehensive paradigm for inference
for random dot product graphs, centered on graph adjacency and Laplacian
spectral embeddings. Paper outlines requisite background; summarizes theory,
methodology, and applications from previous and ongoing work; and closes with
a discussion of several open problem
The Importance of Being Correlated: Implications of Dependence in Joint Spectral Inference across Multiple Networks
Spectral inference on multiple networks is a rapidly-developing subfield of
graph statistics. Recent work has demonstrated that joint, or simultaneous,
spectral embedding of multiple independent network realizations can deliver
more accurate estimation than individual spectral decompositions of those same
networks. Little attention has been paid, however, to the network correlation
that such joint embedding procedures necessarily induce. In this paper, we
present a detailed analysis of induced correlation in a {\em generalized
omnibus} embedding for multiple networks. We show that our embedding procedure
is flexible and robust, and, moreover, we prove a central limit theorem for
this embedding and explicitly compute the limiting covariance. We examine how
this covariance can impact inference in a network time series, and we construct
an appropriately calibrated omnibus embedding that can detect changes in real
biological networks that previous embedding procedures could not discern. Our
analysis confirms that the effect of induced correlation can be both subtle and
transformative, with import in theory and practice
GraSPy: Graph Statistics in Python
We introduce GraSPy, a Python library devoted to statistical inference,
machine learning, and visualization of random graphs and graph populations.
This package provides flexible and easy-to-use algorithms for analyzing and
understanding graphs with a scikit-learn compliant API. GraSPy can be
downloaded from Python Package Index (PyPi), and is released under the Apache
2.0 open-source license. The documentation and all releases are available at
https://neurodata.io/graspy
The two-to-infinity norm and singular subspace geometry with applications to high-dimensional statistics
The singular value matrix decomposition plays a ubiquitous role throughout
statistics and related fields. Myriad applications including clustering,
classification, and dimensionality reduction involve studying and exploiting
the geometric structure of singular values and singular vectors.
This paper provides a novel collection of technical and theoretical tools for
studying the geometry of singular subspaces using the two-to-infinity norm.
Motivated by preliminary deterministic Procrustes analysis, we consider a
general matrix perturbation setting in which we derive a new Procrustean matrix
decomposition. Together with flexible machinery developed for the
two-to-infinity norm, this allows us to conduct a refined analysis of the
induced perturbation geometry with respect to the underlying singular vectors
even in the presence of singular value multiplicity. Our analysis yields
singular vector entrywise perturbation bounds for a range of popular matrix
noise models, each of which has a meaningful associated statistical inference
task. In addition, we demonstrate how the two-to-infinity norm is the preferred
norm in certain statistical settings. Specific applications discussed in this
paper include covariance estimation, singular subspace recovery, and multiple
graph inference.
Both our Procrustean matrix decomposition and the technical machinery
developed for the two-to-infinity norm may be of independent interest.Comment: 36 page
Information Recovery in Shuffled Graphs via Graph Matching
While many multiple graph inference methodologies operate under the implicit
assumption that an explicit vertex correspondence is known across the vertex
sets of the graphs, in practice these correspondences may only be partially or
errorfully known. Herein, we provide an information theoretic foundation for
understanding the practical impact that errorfully observed vertex
correspondences can have on subsequent inference, and the capacity of graph
matching methods to recover the lost vertex alignment and inferential
performance. Working in the correlated stochastic blockmodel setting, we
establish a duality between the loss of mutual information due to an errorfully
observed vertex correspondence and the ability of graph matching algorithms to
recover the true correspondence across graphs. In the process, we establish a
phase transition for graph matchability in terms of the correlation across
graphs, and we conjecture the analogous phase transition for the relative
information loss due to shuffling vertex labels. We demonstrate the practical
effect that graph shuffling---and matching---can have on subsequent inference,
with examples from two sample graph hypothesis testing and joint spectral graph
clustering.Comment: 55 pages, 6 figure
Inference for multiple heterogeneous networks with a common invariant subspace
The development of models for multiple heterogeneous network data is of
critical importance both in statistical network theory and across multiple
application domains. Although single-graph inference is well-studied, multiple
graph inference is largely unexplored, in part because of the challenges
inherent in appropriately modeling graph differences and yet retaining
sufficient model simplicity to render estimation feasible. This paper addresses
exactly this gap, by introducing a new model, the common subspace
independent-edge (COSIE) multiple random graph model, which describes a
heterogeneous collection of networks with a shared latent structure on the
vertices but potentially different connectivity patterns for each graph. The
COSIE model encompasses many popular network representations, including the
stochastic blockmodel. The model is both flexible enough to meaningfully
account for important graph differences and tractable enough to allow for
accurate inference in multiple networks. In particular, a joint spectral
embedding of adjacency matrices - the multiple adjacency spectral embedding
(MASE) - leads, in a COSIE model, to simultaneous consistent estimation of
underlying parameters for each graph. Under mild additional assumptions, MASE
estimates satisfy asymptotic normality and yield improvements for graph
eigenvalue estimation and hypothesis testing. In both simulated and real data,
the COSIE model and the MASE embedding can be deployed for a number of
subsequent network inference tasks, including dimensionality reduction,
classification, hypothesis testing and community detection. Specifically, when
MASE is applied to a dataset of connectomes constructed through diffusion
magnetic resonance imaging, the result is an accurate classification of brain
scans by patient and a meaningful determination of heterogeneity across scans
of different subjects
Out-of-sample extension of graph adjacency spectral embedding
Many popular dimensionality reduction procedures have out-of-sample
extensions, which allow a practitioner to apply a learned embedding to
observations not seen in the initial training sample. In this work, we consider
the problem of obtaining an out-of-sample extension for the adjacency spectral
embedding, a procedure for embedding the vertices of a graph into Euclidean
space. We present two different approaches to this problem, one based on a
least-squares objective and the other based on a maximum-likelihood
formulation. We show that if the graph of interest is drawn according to a
certain latent position model called a random dot product graph, then both of
these out-of-sample extensions estimate the true latent position of the
out-of-sample vertex with the same error rate. Further, we prove a central
limit theorem for the least-squares-based extension, showing that the estimate
is asymptotically normal about the truth in the large-graph limit
Link prediction in dynamic networks using random dot product graphs
The problem of predicting links in large networks is an important task in a
variety of practical applications, including social sciences, biology and
computer security. In this paper, statistical techniques for link prediction
based on the popular random dot product graph model are carefully presented,
analysed and extended to dynamic settings. Motivated by a practical application
in cyber-security, this paper demonstrates that random dot product graphs not
only represent a powerful tool for inferring differences between multiple
networks, but are also efficient for prediction purposes and for understanding
the temporal evolution of the network. The probabilities of links are obtained
by fusing information at two stages: spectral methods provide estimates of
latent positions for each node, and time series models are used to capture
temporal dynamics. In this way, traditional link prediction methods, usually
based on decompositions of the entire network adjacency matrix, are extended
using temporal information. The methods presented in this article are applied
to a number of simulated and real-world computer network graphs, showing
promising results
- …