9 research outputs found
Network Modularity in the Presence of Covariates
We characterize the large-sample properties of network modularity in the presence of covariates, under a natural and flexible null model. This provides for the first time an objective measure of whether or not a particular value of modularity is meaningful. In particular, our results quantify the strength of the relation between observed community structure and the interactions in a network. Our technical contribution is to provide limit theorems for modularity when a community assignment is given by nodal features or covariates. These theorems hold for a broad class of network models over a range of sparsity regimes, as well as for weighted, multiedge, and power-law networks. This allows us to assign p-values to observed community structure, which we validate using several benchmark examples from the literature. We conclude by applying this methodology to investigate a multiedge network of corporate email interactions
Modularity of regular and treelike graphs
Clustering algorithms for large networks typically use modularity values to
test which partitions of the vertex set better represent structure in the data.
The modularity of a graph is the maximum modularity of a partition. We consider
the modularity of two kinds of graphs.
For -regular graphs with a given number of vertices, we investigate the
minimum possible modularity, the typical modularity, and the maximum possible
modularity. In particular, we see that for random cubic graphs the modularity
is usually in the interval , and for random -regular graphs
with large it usually is of order . These results help to
establish baselines for statistical tests on regular graphs.
The modularity of cycles and low degree trees is known to be close to 1: we
extend these results to `treelike' graphs, where the product of treewidth and
maximum degree is much less than the number of edges. This yields for example
the (deterministic) lower bound mentioned above on the modularity of
random cubic graphs.Comment: 25 page
Maximum Likelihood Estimation and Graph Matching in Errorfully Observed Networks
Given a pair of graphs with the same number of vertices, the inexact graph
matching problem consists in finding a correspondence between the vertices of
these graphs that minimizes the total number of induced edge disagreements. We
study this problem from a statistical framework in which one of the graphs is
an errorfully observed copy of the other. We introduce a corrupting channel
model, and show that in this model framework, the solution to the graph
matching problem is a maximum likelihood estimator. Necessary and sufficient
conditions for consistency of this MLE are presented, as well as a relaxed
notion of consistency in which a negligible fraction of the vertices need not
be matched correctly. The results are used to study matchability in several
families of random graphs, including edge independent models, random regular
graphs and small-world networks. We also use these results to introduce
measures of matching feasibility, and experimentally validate the results on
simulated and real-world networks
Understanding Community Structure for Large Networks
The general theme of this thesis is to improve our understanding of community structure for large networks. A scientific challenge across fields (e.g., neuroscience, genetics, and social science) is to understand what drives the interactions between nodes in a network. One of the fundamental concepts in this context is community structure: the tendency of nodes to connect based on similar characteristics. Network models where a single parameter per node governs the propensity of connection are popular in practice. They frequently arise as null models that indicate a lack of community structure, since they cannot readily describe networks whose aggregate links behave in a block-like manner. We generalize such a model called the degree-based model to a flexible, nonparametric class of network models, covering weighted, multi-edge, and power-law networks, and provide limit theorems that describe their asymptotic properties. We establish a theoretical foundation for modularity: a well-known measure for the strength of community structure and derive its asymptotic properties under the assumption of a lack of community structure (formalized by the class of degree-based models described above). This enables us to assess how informative covariates are for the network interactions. Modularity is intuitive and practically effective but until now has lacked a sound theoretical basis. We derive modularity from first principles, and give it a formal statistical interpretation. Moreover, by acknowledging that different community assignments may explain different aspects of a network’s observed structure, we extend the applicability of modularity beyond its typical use to find a single “best” community assignment. We develop from our theoretical results a methodology to quantify network community structure. After validating it using several benchmark examples, we investigate a multi-edge network of corporate email interactions. Here, we demonstrate that our method can identify those covariates that are informative and therefore improves our understanding of the network
Graph Inference with Applications to Low-Resource Audio Search and Indexing
The task of query-by-example search is to retrieve, from among a collection of data, the observations most similar to a given query. A common approach to this problem is based on viewing the data as vertices in a graph in which edge weights reflect similarities between observations. Errors arise in this graph-based framework both from errors in measuring these similarities and from approximations required for fast retrieval. In this thesis, we use tools from graph inference to analyze and control the sources of these errors. We establish novel theoretical results related to representation learning and to vertex nomination, and use these results to control the effects of model misspecification, noisy similarity measurement and approximation error on search accuracy. We present a state-of-the-art system for query-by-example audio search in the context of low-resource speech recognition, which also serves as an illustrative example and testbed for applying our theoretical results