    The minimum bisection in the planted bisection model

    In the planted bisection model a random graph G(n,p+,p)G(n,p_+,p_- ) with nn vertices is created by partitioning the vertices randomly into two classes of equal size (up to ±1\pm1). Any two vertices that belong to the same class are linked by an edge with probability p+p_+ and any two that belong to different classes with probability p<p+p_- <p_+ independently. The planted bisection model has been used extensively to benchmark graph partitioning algorithms. If p±=2d±/np_{\pm} =2d_{\pm} /n for numbers 0d<d+0\leq d_- <d_+ that remain fixed as nn\to\infty, then w.h.p. the ``planted'' bisection (the one used to construct the graph) will not be a minimum bisection. In this paper we derive an asymptotic formula for the minimum bisection width under the assumption that d+d>cd+lnd+d_+ -d_- >c\sqrt{d_+ \ln d_+ } for a certain constant c>0c>0

    The Geometric Block Model

    To capture the inherent geometric features of many community detection problems, we propose to use a new random graph model of communities that we call a Geometric Block Model. The geometric block model generalizes the random geometric graphs in the same way that the well-studied stochastic block model generalizes the Erdos-Renyi random graphs. It is also a natural extension of random community models inspired by the recent theoretical and practical advancement in community detection. While being a topic of fundamental theoretical interest, our main contribution is to show that many practical community structures are better explained by the geometric block model. We also show that a simple triangle-counting algorithm to detect communities in the geometric block model is near-optimal. Indeed, even in the regime where the average degree of the graph grows only logarithmically with the number of vertices (sparse-graph), we show that this algorithm performs extremely well, both theoretically and practically. In contrast, the triangle-counting algorithm is far from being optimum for the stochastic block model. We simulate our results on both real and synthetic datasets to show superior performance of both the new model as well as our algorithm.Comment: A shorter version of this paper has appeared in 32nd AAAI Conference on Artificial Intelligence. The AAAI proceedings version as well as the previous version in arxiv contained some errors that have been corrected in this versio

    Consistency Thresholds for the Planted Bisection Model

    The planted bisection model is a random graph model in which the nodes are divided into two equal-sized communities and then edges are added randomly in a way that depends on the community membership. We establish necessary and sufficient conditions for the asymptotic recoverability of the planted bisection in this model. When the bisection is asymptotically recoverable, we give an efficient algorithm that successfully recovers it. We also show that the planted bisection is recoverable asymptotically if and only if with high probability every node belongs to the same community as the majority of its neighbors. Our algorithm for finding the planted bisection runs in time almost linear in the number of edges. It has three stages: spectral clustering to compute an initial guess, a "replica" stage to get almost every vertex correct, and then some simple local moves to finish the job. An independent work by Abbe, Bandeira, and Hall establishes similar (slightly weaker) results but only in the case of logarithmic average degree.Comment: latest version contains an erratum, addressing an error pointed out by Jan van Waai

    Stochastic Block Model and Community Detection in the Sparse Graphs: A spectral algorithm with optimal rate of recovery

    In this paper, we present and analyze a simple and robust spectral algorithm for the stochastic block model with kk blocks, for any kk fixed. Our algorithm works with graphs having constant edge density, under an optimal condition on the gap between the density inside a block and the density between the blocks. As a co-product, we settle an open question posed by Abbe et. al. concerning censor block models

    Recovery, detection and confidence sets of communities in a sparse stochastic block model

    Posterior distributions for community assignment in the planted bi-section model are shown to achieve frequentist exact recovery and detection under sharp lower bounds on sparsity. Assuming posterior recovery (or detection), one may interpret credible sets (or enlarged credible sets) as consistent confidence sets. If credible levels grow to one quickly enough, credible sets can be interpreted as frequentist confidence sets without conditions on the parameters. In the regime where within-class and between-class edge-probabilities are very close, credible sets may be enlarged to achieve frequentist asymptotic coverage. The diameters of credible sets are controlled and match rates of posterior convergence.Comment: 22 pp., 2 fi

    Global and Local Information in Clustering Labeled Block Models

    The stochastic block model is a classical cluster-exhibiting random graph model that has been widely studied in statistics, physics and computer science. In its simplest form, the model is a random graph with two equal-sized clusters, with intra-cluster edge probability p, and inter-cluster edge probability q. We focus on the sparse case, i.e., p, q = O(1/n), which is practically more relevant and also mathematically more challenging. A conjecture of Decelle, Krzakala, Moore and Zdeborova, based on ideas from statistical physics, predicted a specific threshold for clustering. The negative direction of the conjecture was proved by Mossel, Neeman and Sly (2012), and more recently the positive direction was proven independently by Massoulie and Mossel, Neeman, and Sly. In many real network clustering problems, nodes contain information as well. We study the interplay between node and network information in clustering by studying a labeled block model, where in addition to the edge information, the true cluster labels of a small fraction of the nodes are revealed. In the case of two clusters, we show that below the threshold, a small amount of node information does not affect recovery. On the other hand, we show that for any small amount of information efficient local clustering is achievable as long as the number of clusters is sufficiently large (as a function of the amount of revealed information).Comment: 24 pages, 2 figures. A short abstract describing these results will appear in proceedings of RANDOM 201