2,265 research outputs found

    Post-processing partitions to identify domains of modularity optimization

    Full text link
    We introduce the Convex Hull of Admissible Modularity Partitions (CHAMP) algorithm to prune and prioritize different network community structures identified across multiple runs of possibly various computational heuristics. Given a set of partitions, CHAMP identifies the domain of modularity optimization for each partition ---i.e., the parameter-space domain where it has the largest modularity relative to the input set---discarding partitions with empty domains to obtain the subset of partitions that are "admissible" candidate community structures that remain potentially optimal over indicated parameter domains. Importantly, CHAMP can be used for multi-dimensional parameter spaces, such as those for multilayer networks where one includes a resolution parameter and interlayer coupling. Using the results from CHAMP, a user can more appropriately select robust community structures by observing the sizes of domains of optimization and the pairwise comparisons between partitions in the admissible subset. We demonstrate the utility of CHAMP with several example networks. In these examples, CHAMP focuses attention onto pruned subsets of admissible partitions that are 20-to-1785 times smaller than the sets of unique partitions obtained by community detection heuristics that were input into CHAMP.Comment: http://www.mdpi.com/1999-4893/10/3/9

    Evaluating Overfit and Underfit in Models of Network Community Structure

    Full text link
    A common data mining task on networks is community detection, which seeks an unsupervised decomposition of a network into structural groups based on statistical regularities in the network's connectivity. Although many methods exist, the No Free Lunch theorem for community detection implies that each makes some kind of tradeoff, and no algorithm can be optimal on all inputs. Thus, different algorithms will over or underfit on different inputs, finding more, fewer, or just different communities than is optimal, and evaluation methods that use a metadata partition as a ground truth will produce misleading conclusions about general accuracy. Here, we present a broad evaluation of over and underfitting in community detection, comparing the behavior of 16 state-of-the-art community detection algorithms on a novel and structurally diverse corpus of 406 real-world networks. We find that (i) algorithms vary widely both in the number of communities they find and in their corresponding composition, given the same input, (ii) algorithms can be clustered into distinct high-level groups based on similarities of their outputs on real-world networks, and (iii) these differences induce wide variation in accuracy on link prediction and link description tasks. We introduce a new diagnostic for evaluating overfitting and underfitting in practice, and use it to roughly divide community detection methods into general and specialized learning algorithms. Across methods and inputs, Bayesian techniques based on the stochastic block model and a minimum description length approach to regularization represent the best general learning approach, but can be outperformed under specific circumstances. These results introduce both a theoretically principled approach to evaluate over and underfitting in models of network community structure and a realistic benchmark by which new methods may be evaluated and compared.Comment: 22 pages, 13 figures, 3 table

    Community Structure in Industrial SAT Instances

    Get PDF
    Modern SAT solvers have experienced a remarkable progress on solving industrial instances. Most of the techniques have been developed after an intensive experimental process. It is believed that these techniques exploit the underlying structure of industrial instances. However, there are few works trying to exactly characterize the main features of this structure. The research community on complex networks has developed techniques of analysis and algorithms to study real-world graphs that can be used by the SAT community. Recently, there have been some attempts to analyze the structure of industrial SAT instances in terms of complex networks, with the aim of explaining the success of SAT solving techniques, and possibly improving them. In this paper, inspired by the results on complex networks, we study the community structure, or modularity, of industrial SAT instances. In a graph with clear community structure, or high modularity, we can find a partition of its nodes into communities such that most edges connect variables of the same community. In our analysis, we represent SAT instances as graphs, and we show that most application benchmarks are characterized by a high modularity. On the contrary, random SAT instances are closer to the classical Erd\"os-R\'enyi random graph model, where no structure can be observed. We also analyze how this structure evolves by the effects of the execution of a CDCL SAT solver. In particular, we use the community structure to detect that new clauses learned by the solver during the search contribute to destroy the original structure of the formula. This is, learned clauses tend to contain variables of distinct communities

    Community structure in industrial SAT instances

    Get PDF
    Modern SAT solvers have experienced a remarkable progress on solving industrial instances. It is believed that most of these successful techniques exploit the underlying structure of industrial instances. Recently, there have been some attempts to analyze the structure of industrial SAT instances in terms of complex networks, with the aim of explaining the success of SAT solving techniques, and possibly improving them. In this paper, we study the community structure, or modularity, of industrial SAT instances. In a graph with clear community structure, or high modularity, we can find a partition of its nodes into communities such that most edges connect variables of the same community. Representing SAT instances as graphs, we show that most application benchmarks are characterized by a high modularity. On the contrary, random SAT instances are closer to the classical Erdös-Rényi random graph model, where no structure can be observed. We also analyze how this structure evolves by the effects of the execution of a CDCL SAT solver, and observe that new clauses learned by the solver during the search contribute to destroy the original structure of the formula. Motivated by this observation, we finally present an application that exploits the community structure to detect relevant learned clauses, and we show that detecting these clauses results in an improvement on the performance of the SAT solver. Empirically, we observe that this improves the performance of several SAT solvers on industrial SAT formulas, especially on satisfiable instances.Peer ReviewedPostprint (published version

    Pruning Sets of Modularity Partitions in Single and Multi-layer Networks Via an Equivalence to Stochastic Block Model Inference

    Get PDF
    Partitioning a network into communitites of densely connected nodes is an important problem across a wide range of disciplines. We demonstrate a method for pruning sets of such partitions to identify small subsets that are statistically significant from the perspective of stochastic block model inference. Crucially, our method works in both single and multi-layer domains and allows for restricting to a fixed number of communities when desired. On the examples considered here, our procedure identifies fewer than 10 ``important'' partitions, even when the number of unique input partitions exceeds 100,000. Additionally, we derive upper bounds on the resolution parameter for which modularity maximization can be equivalent to optimizing the likelihood fit to a degree-corrected stochastic block model. We demonstrate that these bounds hold in practice, providing a priori regions where modularity maximization heuristics ``should'' be run if a certain number of communities is desired.Bachelor of Scienc

    One-class classifiers based on entropic spanning graphs

    Get PDF
    One-class classifiers offer valuable tools to assess the presence of outliers in data. In this paper, we propose a design methodology for one-class classifiers based on entropic spanning graphs. Our approach takes into account the possibility to process also non-numeric data by means of an embedding procedure. The spanning graph is learned on the embedded input data and the outcoming partition of vertices defines the classifier. The final partition is derived by exploiting a criterion based on mutual information minimization. Here, we compute the mutual information by using a convenient formulation provided in terms of the α\alpha-Jensen difference. Once training is completed, in order to associate a confidence level with the classifier decision, a graph-based fuzzy model is constructed. The fuzzification process is based only on topological information of the vertices of the entropic spanning graph. As such, the proposed one-class classifier is suitable also for data characterized by complex geometric structures. We provide experiments on well-known benchmarks containing both feature vectors and labeled graphs. In addition, we apply the method to the protein solubility recognition problem by considering several representations for the input samples. Experimental results demonstrate the effectiveness and versatility of the proposed method with respect to other state-of-the-art approaches.Comment: Extended and revised version of the paper "One-Class Classification Through Mutual Information Minimization" presented at the 2016 IEEE IJCNN, Vancouver, Canad
    • …
    corecore