42,542 research outputs found

    Model Selection for Stochastic Block Models

    Get PDF
    As a flexible representation for complex systems, networks (graphs) model entities and their interactions as nodes and edges. In many real-world networks, nodes divide naturally into functional communities, where nodes in the same group connect to the rest of the network in similar ways. Discovering such communities is an important part of modeling networks, as community structure offers clues to the processes which generated the graph. The stochastic block model is a popular network model based on community structures. It splits nodes into blocks, within which all nodes are stochastically equivalent in terms of how they connect to the rest of the network. As a generative model, it has a well-defined likelihood function with consistent parameter estimates. It is also highly flexible, capable of modeling a wide variety of community structures, including degree specific and overlapping communities. Performance of different block models vary under different scenarios. Picking the right model is crucial for successful network modeling. A good model choice should balance the trade-off between complexity and fit. The task of model selection is to automatically choose such a model given the data and the inference task. As a problem of wide interest, numerous statistical model selection techniques have been developed for classic independent data. Unfortunately, it has been a common mistake to use these techniques in block models without rigorous examinations of their derivations, ignoring the fact that some of the fundamental assumptions has been violated by moving into the domain of relational data sets such as networks. In this dissertation, I thoroughly exam the literature of statistical model selection techniques, including both Frequentist and Bayesian approaches. My goal is to develop principled statistical model selection criteria for block models by adapting classic methods for network data. I do this by running bootstrapping simulations with an efficient algorithm, and correcting classic model selection theories for block models based on the simulation data. The new model selection methods are verified by both synthetic and real world data sets

    Model Selection in Overlapping Stochastic Block Models

    Full text link
    Networks are a commonly used mathematical model to describe the rich set of interactions between objects of interest. Many clustering methods have been developed in order to partition such structures, among which several rely on underlying probabilistic models, typically mixture models. The relevant hidden structure may however show overlapping groups in several applications. The Overlapping Stochastic Block Model (2011) has been developed to take this phenomenon into account. Nevertheless, the problem of the choice of the number of classes in the inference step is still open. To tackle this issue, we consider the proposed model in a Bayesian framework and develop a new criterion based on a non asymptotic approximation of the marginal log-likelihood. We describe how the criterion can be computed through a variational Bayes EM algorithm, and demonstrate its efficiency by running it on both simulated and real data.Comment: articl

    Evaluating Overfit and Underfit in Models of Network Community Structure

    Full text link
    A common data mining task on networks is community detection, which seeks an unsupervised decomposition of a network into structural groups based on statistical regularities in the network's connectivity. Although many methods exist, the No Free Lunch theorem for community detection implies that each makes some kind of tradeoff, and no algorithm can be optimal on all inputs. Thus, different algorithms will over or underfit on different inputs, finding more, fewer, or just different communities than is optimal, and evaluation methods that use a metadata partition as a ground truth will produce misleading conclusions about general accuracy. Here, we present a broad evaluation of over and underfitting in community detection, comparing the behavior of 16 state-of-the-art community detection algorithms on a novel and structurally diverse corpus of 406 real-world networks. We find that (i) algorithms vary widely both in the number of communities they find and in their corresponding composition, given the same input, (ii) algorithms can be clustered into distinct high-level groups based on similarities of their outputs on real-world networks, and (iii) these differences induce wide variation in accuracy on link prediction and link description tasks. We introduce a new diagnostic for evaluating overfitting and underfitting in practice, and use it to roughly divide community detection methods into general and specialized learning algorithms. Across methods and inputs, Bayesian techniques based on the stochastic block model and a minimum description length approach to regularization represent the best general learning approach, but can be outperformed under specific circumstances. These results introduce both a theoretically principled approach to evaluate over and underfitting in models of network community structure and a realistic benchmark by which new methods may be evaluated and compared.Comment: 22 pages, 13 figures, 3 table
    • …
    corecore