2,265 research outputs found
Post-processing partitions to identify domains of modularity optimization
We introduce the Convex Hull of Admissible Modularity Partitions (CHAMP)
algorithm to prune and prioritize different network community structures
identified across multiple runs of possibly various computational heuristics.
Given a set of partitions, CHAMP identifies the domain of modularity
optimization for each partition ---i.e., the parameter-space domain where it
has the largest modularity relative to the input set---discarding partitions
with empty domains to obtain the subset of partitions that are "admissible"
candidate community structures that remain potentially optimal over indicated
parameter domains. Importantly, CHAMP can be used for multi-dimensional
parameter spaces, such as those for multilayer networks where one includes a
resolution parameter and interlayer coupling. Using the results from CHAMP, a
user can more appropriately select robust community structures by observing the
sizes of domains of optimization and the pairwise comparisons between
partitions in the admissible subset. We demonstrate the utility of CHAMP with
several example networks. In these examples, CHAMP focuses attention onto
pruned subsets of admissible partitions that are 20-to-1785 times smaller than
the sets of unique partitions obtained by community detection heuristics that
were input into CHAMP.Comment: http://www.mdpi.com/1999-4893/10/3/9
Evaluating Overfit and Underfit in Models of Network Community Structure
A common data mining task on networks is community detection, which seeks an
unsupervised decomposition of a network into structural groups based on
statistical regularities in the network's connectivity. Although many methods
exist, the No Free Lunch theorem for community detection implies that each
makes some kind of tradeoff, and no algorithm can be optimal on all inputs.
Thus, different algorithms will over or underfit on different inputs, finding
more, fewer, or just different communities than is optimal, and evaluation
methods that use a metadata partition as a ground truth will produce misleading
conclusions about general accuracy. Here, we present a broad evaluation of over
and underfitting in community detection, comparing the behavior of 16
state-of-the-art community detection algorithms on a novel and structurally
diverse corpus of 406 real-world networks. We find that (i) algorithms vary
widely both in the number of communities they find and in their corresponding
composition, given the same input, (ii) algorithms can be clustered into
distinct high-level groups based on similarities of their outputs on real-world
networks, and (iii) these differences induce wide variation in accuracy on link
prediction and link description tasks. We introduce a new diagnostic for
evaluating overfitting and underfitting in practice, and use it to roughly
divide community detection methods into general and specialized learning
algorithms. Across methods and inputs, Bayesian techniques based on the
stochastic block model and a minimum description length approach to
regularization represent the best general learning approach, but can be
outperformed under specific circumstances. These results introduce both a
theoretically principled approach to evaluate over and underfitting in models
of network community structure and a realistic benchmark by which new methods
may be evaluated and compared.Comment: 22 pages, 13 figures, 3 table
Community Structure in Industrial SAT Instances
Modern SAT solvers have experienced a remarkable progress on solving
industrial instances. Most of the techniques have been developed after an
intensive experimental process. It is believed that these techniques exploit
the underlying structure of industrial instances. However, there are few works
trying to exactly characterize the main features of this structure.
The research community on complex networks has developed techniques of
analysis and algorithms to study real-world graphs that can be used by the SAT
community. Recently, there have been some attempts to analyze the structure of
industrial SAT instances in terms of complex networks, with the aim of
explaining the success of SAT solving techniques, and possibly improving them.
In this paper, inspired by the results on complex networks, we study the
community structure, or modularity, of industrial SAT instances. In a graph
with clear community structure, or high modularity, we can find a partition of
its nodes into communities such that most edges connect variables of the same
community. In our analysis, we represent SAT instances as graphs, and we show
that most application benchmarks are characterized by a high modularity. On the
contrary, random SAT instances are closer to the classical Erd\"os-R\'enyi
random graph model, where no structure can be observed. We also analyze how
this structure evolves by the effects of the execution of a CDCL SAT solver. In
particular, we use the community structure to detect that new clauses learned
by the solver during the search contribute to destroy the original structure of
the formula. This is, learned clauses tend to contain variables of distinct
communities
Community structure in industrial SAT instances
Modern SAT solvers have experienced a remarkable progress on solving industrial instances. It is believed that most of these successful techniques exploit the underlying structure of industrial instances. Recently, there have been some attempts to analyze the structure of industrial SAT instances in terms of complex networks, with the aim of explaining the success of SAT solving techniques, and possibly improving them.
In this paper, we study the community structure, or modularity, of industrial SAT instances. In a graph with clear community structure, or high modularity, we can find a partition of its nodes into communities such that most edges connect variables of the same community. Representing SAT instances as graphs, we show that most application benchmarks are characterized by a high modularity. On the contrary, random SAT instances are closer to the classical Erdös-Rényi random graph model, where no structure can be observed. We also analyze how this structure evolves by the effects of the execution of a CDCL SAT solver, and observe that new clauses learned by the solver during the search contribute to destroy the original structure of the formula. Motivated by this observation, we finally present an application that exploits the community structure to detect relevant learned clauses, and we show that detecting these clauses results in an improvement on the performance of the SAT solver. Empirically, we observe that this improves the performance of several SAT solvers on industrial SAT formulas, especially on satisfiable instances.Peer ReviewedPostprint (published version
Pruning Sets of Modularity Partitions in Single and Multi-layer Networks Via an Equivalence to Stochastic Block Model Inference
Partitioning a network into communitites of densely connected nodes is an important problem across a wide range of disciplines. We demonstrate a method for pruning sets of such partitions to identify small subsets that are statistically significant from the perspective of stochastic block model inference. Crucially, our method works in both single and multi-layer domains and allows for restricting to a fixed number of communities when desired. On the examples considered here, our procedure identifies fewer than 10 ``important'' partitions, even when the number of unique input partitions exceeds 100,000. Additionally, we derive upper bounds on the resolution parameter for which modularity maximization can be equivalent to optimizing the likelihood fit to a degree-corrected stochastic block model. We demonstrate that these bounds hold in practice, providing a priori regions where modularity maximization heuristics ``should'' be run if a certain number of communities is desired.Bachelor of Scienc
One-class classifiers based on entropic spanning graphs
One-class classifiers offer valuable tools to assess the presence of outliers
in data. In this paper, we propose a design methodology for one-class
classifiers based on entropic spanning graphs. Our approach takes into account
the possibility to process also non-numeric data by means of an embedding
procedure. The spanning graph is learned on the embedded input data and the
outcoming partition of vertices defines the classifier. The final partition is
derived by exploiting a criterion based on mutual information minimization.
Here, we compute the mutual information by using a convenient formulation
provided in terms of the -Jensen difference. Once training is
completed, in order to associate a confidence level with the classifier
decision, a graph-based fuzzy model is constructed. The fuzzification process
is based only on topological information of the vertices of the entropic
spanning graph. As such, the proposed one-class classifier is suitable also for
data characterized by complex geometric structures. We provide experiments on
well-known benchmarks containing both feature vectors and labeled graphs. In
addition, we apply the method to the protein solubility recognition problem by
considering several representations for the input samples. Experimental results
demonstrate the effectiveness and versatility of the proposed method with
respect to other state-of-the-art approaches.Comment: Extended and revised version of the paper "One-Class Classification
Through Mutual Information Minimization" presented at the 2016 IEEE IJCNN,
Vancouver, Canad
- …