19,663 research outputs found
Community detection and stochastic block models: recent developments
The stochastic block model (SBM) is a random graph model with planted
clusters. It is widely employed as a canonical model to study clustering and
community detection, and provides generally a fertile ground to study the
statistical and computational tradeoffs that arise in network and data
sciences.
This note surveys the recent developments that establish the fundamental
limits for community detection in the SBM, both with respect to
information-theoretic and computational thresholds, and for various recovery
requirements such as exact, partial and weak recovery (a.k.a., detection). The
main results discussed are the phase transitions for exact recovery at the
Chernoff-Hellinger threshold, the phase transition for weak recovery at the
Kesten-Stigum threshold, the optimal distortion-SNR tradeoff for partial
recovery, the learning of the SBM parameters and the gap between
information-theoretic and computational thresholds.
The note also covers some of the algorithms developed in the quest of
achieving the limits, in particular two-round algorithms via graph-splitting,
semi-definite programming, linearized belief propagation, classical and
nonbacktracking spectral methods. A few open problems are also discussed
Stochastic blockmodels and community structure in networks
Stochastic blockmodels have been proposed as a tool for detecting community
structure in networks as well as for generating synthetic networks for use as
benchmarks. Most blockmodels, however, ignore variation in vertex degree,
making them unsuitable for applications to real-world networks, which typically
display broad degree distributions that can significantly distort the results.
Here we demonstrate how the generalization of blockmodels to incorporate this
missing element leads to an improved objective function for community detection
in complex networks. We also propose a heuristic algorithm for community
detection using this objective function or its non-degree-corrected counterpart
and show that the degree-corrected version dramatically outperforms the
uncorrected one in both real-world and synthetic networks.Comment: 11 pages, 3 figure
Evaluating Overfit and Underfit in Models of Network Community Structure
A common data mining task on networks is community detection, which seeks an
unsupervised decomposition of a network into structural groups based on
statistical regularities in the network's connectivity. Although many methods
exist, the No Free Lunch theorem for community detection implies that each
makes some kind of tradeoff, and no algorithm can be optimal on all inputs.
Thus, different algorithms will over or underfit on different inputs, finding
more, fewer, or just different communities than is optimal, and evaluation
methods that use a metadata partition as a ground truth will produce misleading
conclusions about general accuracy. Here, we present a broad evaluation of over
and underfitting in community detection, comparing the behavior of 16
state-of-the-art community detection algorithms on a novel and structurally
diverse corpus of 406 real-world networks. We find that (i) algorithms vary
widely both in the number of communities they find and in their corresponding
composition, given the same input, (ii) algorithms can be clustered into
distinct high-level groups based on similarities of their outputs on real-world
networks, and (iii) these differences induce wide variation in accuracy on link
prediction and link description tasks. We introduce a new diagnostic for
evaluating overfitting and underfitting in practice, and use it to roughly
divide community detection methods into general and specialized learning
algorithms. Across methods and inputs, Bayesian techniques based on the
stochastic block model and a minimum description length approach to
regularization represent the best general learning approach, but can be
outperformed under specific circumstances. These results introduce both a
theoretically principled approach to evaluate over and underfitting in models
of network community structure and a realistic benchmark by which new methods
may be evaluated and compared.Comment: 22 pages, 13 figures, 3 table
A semidefinite program for unbalanced multisection in the stochastic block model
We propose a semidefinite programming (SDP) algorithm for community detection
in the stochastic block model, a popular model for networks with latent
community structure. We prove that our algorithm achieves exact recovery of the
latent communities, up to the information-theoretic limits determined by Abbe
and Sandon (2015). Our result extends prior SDP approaches by allowing for many
communities of different sizes. By virtue of a semidefinite approach, our
algorithms succeed against a semirandom variant of the stochastic block model,
guaranteeing a form of robustness and generalization. We further explore how
semirandom models can lend insight into both the strengths and limitations of
SDPs in this setting.Comment: 29 page
Inference of hidden structures in complex physical systems by multi-scale clustering
We survey the application of a relatively new branch of statistical
physics--"community detection"-- to data mining. In particular, we focus on the
diagnosis of materials and automated image segmentation. Community detection
describes the quest of partitioning a complex system involving many elements
into optimally decoupled subsets or communities of such elements. We review a
multiresolution variant which is used to ascertain structures at different
spatial and temporal scales. Significant patterns are obtained by examining the
correlations between different independent solvers. Similar to other
combinatorial optimization problems in the NP complexity class, community
detection exhibits several phases. Typically, illuminating orders are revealed
by choosing parameters that lead to extremal information theory correlations.Comment: 25 pages, 16 Figures; a review of earlier work
- …