26,856 research outputs found
Information Theoretic Criteria for Community Detection
Many algorithms for finding community structure in graphs search for a partition that maximizes modularity. However, recent work has identified two important limitations of modularity as a community quality criterion: are solution limit; and a bias towards finding equal-sized communities. Information-theoretic approaches that search for partitions that minimize description length are a recent alternative to modularity. This paper shows that two information-theoretic algorithms are themselves subject to a resolution limit, identifies the component of each approach that is responsible for the resolution limit, proposes a variant, SGE (Sparse Graph Encoding), that addresses this limitation, and demonstrates on three artificial data sets that (1) SGE does not exhibit a resolution limit on sparse graphs in which other approaches do, and that (2) modularity and the compression-based algorithms, including SGE, behave similarly on graphs not subject to the resolution limit
Entrograms and coarse graining of dynamics on complex networks
Using an information theoretic point of view, we investigate how a dynamics
acting on a network can be coarse grained through the use of graph partitions.
Specifically, we are interested in how aggregating the state space of a Markov
process according to a partition impacts on the thus obtained lower-dimensional
dynamics. We highlight that for a dynamics on a particular graph there may be
multiple coarse grained descriptions that capture different, incomparable
features of the original process. For instance, a coarse graining induced by
one partition may be commensurate with a time-scale separation in the dynamics,
while another coarse graining may correspond to a different lower-dimensional
dynamics that preserves the Markov property of the original process. Taking
inspiration from the literature of Computational Mechanics, we find that a
convenient tool to summarise and visualise such dynamical properties of a
coarse grained model (partition) is the entrogram. The entrogram gathers
certain information-theoretic measures, which quantify how information flows
across time steps. These information theoretic quantities include the entropy
rate, as well as a measure for the memory contained in the process, i.e., how
well the dynamics can be approximated by a first order Markov process. We use
the entrogram to investigate how specific macro-scale connection patterns in
the state-space transition graph of the original dynamics result in desirable
properties of coarse grained descriptions. We thereby provide a fresh
perspective on the interplay between structure and dynamics in networks, and
the process of partitioning from an information theoretic perspective. We focus
on networks that may be approximated by both a core-periphery or a clustered
organization, and highlight that each of these coarse grained descriptions can
capture different aspects of a Markov process acting on the network.Comment: 17 pages, 6 figue
Clustering and Community Detection in Directed Networks: A Survey
Networks (or graphs) appear as dominant structures in diverse domains,
including sociology, biology, neuroscience and computer science. In most of the
aforementioned cases graphs are directed - in the sense that there is
directionality on the edges, making the semantics of the edges non symmetric.
An interesting feature that real networks present is the clustering or
community structure property, under which the graph topology is organized into
modules commonly called communities or clusters. The essence here is that nodes
of the same community are highly similar while on the contrary, nodes across
communities present low similarity. Revealing the underlying community
structure of directed complex networks has become a crucial and
interdisciplinary topic with a plethora of applications. Therefore, naturally
there is a recent wealth of research production in the area of mining directed
graphs - with clustering being the primary method and tool for community
detection and evaluation. The goal of this paper is to offer an in-depth review
of the methods presented so far for clustering directed networks along with the
relevant necessary methodological background and also related applications. The
survey commences by offering a concise review of the fundamental concepts and
methodological base on which graph clustering algorithms capitalize on. Then we
present the relevant work along two orthogonal classifications. The first one
is mostly concerned with the methodological principles of the clustering
algorithms, while the second one approaches the methods from the viewpoint
regarding the properties of a good cluster in a directed network. Further, we
present methods and metrics for evaluating graph clustering results,
demonstrate interesting application domains and provide promising future
research directions.Comment: 86 pages, 17 figures. Physics Reports Journal (To Appear
Evaluating Overfit and Underfit in Models of Network Community Structure
A common data mining task on networks is community detection, which seeks an
unsupervised decomposition of a network into structural groups based on
statistical regularities in the network's connectivity. Although many methods
exist, the No Free Lunch theorem for community detection implies that each
makes some kind of tradeoff, and no algorithm can be optimal on all inputs.
Thus, different algorithms will over or underfit on different inputs, finding
more, fewer, or just different communities than is optimal, and evaluation
methods that use a metadata partition as a ground truth will produce misleading
conclusions about general accuracy. Here, we present a broad evaluation of over
and underfitting in community detection, comparing the behavior of 16
state-of-the-art community detection algorithms on a novel and structurally
diverse corpus of 406 real-world networks. We find that (i) algorithms vary
widely both in the number of communities they find and in their corresponding
composition, given the same input, (ii) algorithms can be clustered into
distinct high-level groups based on similarities of their outputs on real-world
networks, and (iii) these differences induce wide variation in accuracy on link
prediction and link description tasks. We introduce a new diagnostic for
evaluating overfitting and underfitting in practice, and use it to roughly
divide community detection methods into general and specialized learning
algorithms. Across methods and inputs, Bayesian techniques based on the
stochastic block model and a minimum description length approach to
regularization represent the best general learning approach, but can be
outperformed under specific circumstances. These results introduce both a
theoretically principled approach to evaluate over and underfitting in models
of network community structure and a realistic benchmark by which new methods
may be evaluated and compared.Comment: 22 pages, 13 figures, 3 table
On the Permanence of Vertices in Network Communities
Despite the prevalence of community detection algorithms, relatively less
work has been done on understanding whether a network is indeed modular and how
resilient the community structure is under perturbations. To address this
issue, we propose a new vertex-based metric called "permanence", that can
quantitatively give an estimate of the community-like structure of the network.
The central idea of permanence is based on the observation that the strength
of membership of a vertex to a community depends upon the following two
factors: (i) the distribution of external connectivity of the vertex to
individual communities and not the total external connectivity, and (ii) the
strength of its internal connectivity and not just the total internal edges.
In this paper, we demonstrate that compared to other metrics, permanence
provides (i) a more accurate estimate of a derived community structure to the
ground-truth community and (ii) is more sensitive to perturbations in the
network. As a by-product of this study, we have also developed a community
detection algorithm based on maximizing permanence. For a modular network
structure, the results of our algorithm match well with ground-truth
communities.Comment: 10 pages, 5 figures, 8 tables, Accepted in 20th ACM SIGKDD Conference
on Knowledge Discovery and Data Minin
Community detection and stochastic block models: recent developments
The stochastic block model (SBM) is a random graph model with planted
clusters. It is widely employed as a canonical model to study clustering and
community detection, and provides generally a fertile ground to study the
statistical and computational tradeoffs that arise in network and data
sciences.
This note surveys the recent developments that establish the fundamental
limits for community detection in the SBM, both with respect to
information-theoretic and computational thresholds, and for various recovery
requirements such as exact, partial and weak recovery (a.k.a., detection). The
main results discussed are the phase transitions for exact recovery at the
Chernoff-Hellinger threshold, the phase transition for weak recovery at the
Kesten-Stigum threshold, the optimal distortion-SNR tradeoff for partial
recovery, the learning of the SBM parameters and the gap between
information-theoretic and computational thresholds.
The note also covers some of the algorithms developed in the quest of
achieving the limits, in particular two-round algorithms via graph-splitting,
semi-definite programming, linearized belief propagation, classical and
nonbacktracking spectral methods. A few open problems are also discussed
- …