11 research outputs found

    Community detection and stochastic block models: recent developments

    Full text link
    The stochastic block model (SBM) is a random graph model with planted clusters. It is widely employed as a canonical model to study clustering and community detection, and provides generally a fertile ground to study the statistical and computational tradeoffs that arise in network and data sciences. This note surveys the recent developments that establish the fundamental limits for community detection in the SBM, both with respect to information-theoretic and computational thresholds, and for various recovery requirements such as exact, partial and weak recovery (a.k.a., detection). The main results discussed are the phase transitions for exact recovery at the Chernoff-Hellinger threshold, the phase transition for weak recovery at the Kesten-Stigum threshold, the optimal distortion-SNR tradeoff for partial recovery, the learning of the SBM parameters and the gap between information-theoretic and computational thresholds. The note also covers some of the algorithms developed in the quest of achieving the limits, in particular two-round algorithms via graph-splitting, semi-definite programming, linearized belief propagation, classical and nonbacktracking spectral methods. A few open problems are also discussed

    Clustering of Diverse Multiplex Networks

    Get PDF
    This dissertation introduces the DIverse MultiPLEx Generalized Dot Product Graph (DIMPLE-GDPG) network model where all layers of the network have the same collection of nodes and follow the Generalized Dot Product Graph (GDPG) model. In addition, all layers can be partitioned into groups such that the layers in the same group are embedded in the same ambient subspace but otherwise all matrices of connection probabilities can be different. In common particular cases, where layers of the network follow the Stochastic Block Model (SBM) and Degree Corrected Block Model (DCBM), this setting implies that the groups of layers have common community structures but all matrices of block connection probabilities can be different. For DCBM, each group can also equip with nodes\u27 specific weights. We refer to this two versions as the DIMPLE model and the DIMPLE-DECOR model. While the DIMPLE-GDPG model generalizes the COmmon Subspace Independent Edge (COSIE) random graph model, the DIMPLE model generalizes a multitude of papers that study multilayer networks with the same community structures in all layers (which include the tensor block model, the checker-board model as well as the Mixture Multilayer Stochastic Block Model (MMLSBM) as particular cases). This dissertation introduces novel algorithms for the recovery of similar groups of layers, for the estimation of the ambient subspaces in the groups of layers in the DIMPLE-GDPG setting, and for the within-layer clustering in the case of the DIMPLE model. We also consider applications of the DIMPLE models to real-life data, and its comparison with the MMLSBM. And the DIMPLE model with its SBM-imposed structures provided better descriptions of the organization of layers than the ones obtained on the basis of the MMLSBM setting

    COMMUNITY DETECTION IN GRAPHS

    Get PDF
    Thesis (Ph.D.) - Indiana University, Luddy School of Informatics, Computing, and Engineering/University Graduate School, 2020Community detection has always been one of the fundamental research topics in graph mining. As a type of unsupervised or semi-supervised approach, community detection aims to explore node high-order closeness by leveraging graph topological structure. By grouping similar nodes or edges into the same community while separating dissimilar ones apart into different communities, graph structure can be revealed in a coarser resolution. It can be beneficial for numerous applications such as user shopping recommendation and advertisement in e-commerce, protein-protein interaction prediction in the bioinformatics, and literature recommendation or scholar collaboration in citation analysis. However, identifying communities is an ill-defined problem. Due to the No Free Lunch theorem [1], there is neither gold standard to represent perfect community partition nor universal methods that are able to detect satisfied communities for all tasks under various types of graphs. To have a global view of this research topic, I summarize state-of-art community detection methods by categorizing them based on graph types, research tasks and methodology frameworks. As academic exploration on community detection grows rapidly in recent years, I hereby particularly focus on the state-of-art works published in the latest decade, which may leave out some classic models published decades ago. Meanwhile, three subtle community detection tasks are proposed and assessed in this dissertation as well. First, apart from general models which consider only graph structures, personalized community detection considers user need as auxiliary information to guide community detection. In the end, there will be fine-grained communities for nodes better matching user needs while coarser-resolution communities for the rest of less relevant nodes. Second, graphs always suffer from the sparse connectivity issue. Leveraging conventional models directly on such graphs may hugely distort the quality of generate communities. To tackle such a problem, cross-graph techniques are involved to propagate external graph information as a support for target graph community detection. Third, graph community structure supports a natural language processing (NLP) task to depict node intrinsic characteristics by generating node summarizations via a text generative model. The contribution of this dissertation is threefold. First, a decent amount of researches are reviewed and summarized under a well-defined taxonomy. Existing works about methods, evaluation and applications are all addressed in the literature review. Second, three novel community detection tasks are demonstrated and associated models are proposed and evaluated by comparing with state-of-art baselines under various datasets. Third, the limitations of current works are pointed out and future research tracks with potentials are discussed as well
    corecore