45,043 research outputs found
Network inference and community detection, based on covariance matrices, correlations and test statistics from arbitrary distributions
In this paper we propose methodology for inference of binary-valued adjacency
matrices from various measures of the strength of association between pairs of
network nodes, or more generally pairs of variables. This strength of
association can be quantified by sample covariance and correlation matrices,
and more generally by test-statistics and hypothesis test p-values from
arbitrary distributions. Community detection methods such as block modelling
typically require binary-valued adjacency matrices as a starting point. Hence,
a main motivation for the methodology we propose is to obtain binary-valued
adjacency matrices from such pairwise measures of strength of association
between variables. The proposed methodology is applicable to large
high-dimensional data-sets and is based on computationally efficient
algorithms. We illustrate its utility in a range of contexts and data-sets
Evaluating Overfit and Underfit in Models of Network Community Structure
A common data mining task on networks is community detection, which seeks an
unsupervised decomposition of a network into structural groups based on
statistical regularities in the network's connectivity. Although many methods
exist, the No Free Lunch theorem for community detection implies that each
makes some kind of tradeoff, and no algorithm can be optimal on all inputs.
Thus, different algorithms will over or underfit on different inputs, finding
more, fewer, or just different communities than is optimal, and evaluation
methods that use a metadata partition as a ground truth will produce misleading
conclusions about general accuracy. Here, we present a broad evaluation of over
and underfitting in community detection, comparing the behavior of 16
state-of-the-art community detection algorithms on a novel and structurally
diverse corpus of 406 real-world networks. We find that (i) algorithms vary
widely both in the number of communities they find and in their corresponding
composition, given the same input, (ii) algorithms can be clustered into
distinct high-level groups based on similarities of their outputs on real-world
networks, and (iii) these differences induce wide variation in accuracy on link
prediction and link description tasks. We introduce a new diagnostic for
evaluating overfitting and underfitting in practice, and use it to roughly
divide community detection methods into general and specialized learning
algorithms. Across methods and inputs, Bayesian techniques based on the
stochastic block model and a minimum description length approach to
regularization represent the best general learning approach, but can be
outperformed under specific circumstances. These results introduce both a
theoretically principled approach to evaluate over and underfitting in models
of network community structure and a realistic benchmark by which new methods
may be evaluated and compared.Comment: 22 pages, 13 figures, 3 table
- …