33 research outputs found
Community detection algorithms: a comparative analysis
Uncovering the community structure exhibited by real networks is a crucial
step towards an understanding of complex systems that goes beyond the local
organization of their constituents. Many algorithms have been proposed so far,
but none of them has been subjected to strict tests to evaluate their
performance. Most of the sporadic tests performed so far involved small
networks with known community structure and/or artificial graphs with a
simplified structure, which is very uncommon in real systems. Here we test
several methods against a recently introduced class of benchmark graphs, with
heterogeneous distributions of degree and community size. The methods are also
tested against the benchmark by Girvan and Newman and on random graphs. As a
result of our analysis, three recent algorithms introduced by Rosvall and
Bergstrom, Blondel et al. and Ronhovde and Nussinov, respectively, have an
excellent performance, with the additional advantage of low computational
complexity, which enables one to analyze large systems.Comment: 12 pages, 8 figures. The software to compute the values of our
general normalized mutual information is available at
http://santo.fortunato.googlepages.com/inthepress
Robustness of journal rankings by network flows with different amounts of memory
As the number of scientific journals has multiplied, journal rankings have
become increasingly important for scientific decisions. From submissions and
subscriptions to grants and hirings, researchers, policy makers, and funding
agencies make important decisions with influence from journal rankings such as
the ISI journal impact factor. Typically, the rankings are derived from the
citation network between a selection of journals and unavoidably depend on this
selection. However, little is known about how robust rankings are to the
selection of included journals. Here we compare the robustness of three journal
rankings based on network flows induced on citation networks. They model
pathways of researchers navigating scholarly literature, stepping between
journals and remembering their previous steps to different degree: zero-step
memory as impact factor, one-step memory as Eigenfactor, and two-step memory,
corresponding to zero-, first-, and second-order Markov models of citation flow
between journals. We conclude that higher-order Markov models perform better
and are more robust to the selection of journals. Whereas our analysis
indicates that higher-order models perform better, the performance gain for the
second-order Markov model comes at the cost of requiring more citation data
over a longer time period.Comment: 9 pages, 5 figure
Identifying modular flows on multilayer networks reveals highly overlapping organization in social systems
Unveiling the community structure of networks is a powerful methodology to
comprehend interconnected systems across the social and natural sciences. To
identify different types of functional modules in interaction data aggregated
in a single network layer, researchers have developed many powerful methods.
For example, flow-based methods have proven useful for identifying modular
dynamics in weighted and directed networks that capture constraints on flow in
the systems they represent. However, many networked systems consist of agents
or components that exhibit multiple layers of interactions. Inevitably,
representing this intricate network of networks as a single aggregated network
leads to information loss and may obscure the actual organization. Here we
propose a method based on compression of network flows that can identify
modular flows in non-aggregated multilayer networks. Our numerical experiments
on synthetic networks show that the method can accurately identify modules that
cannot be identified in aggregated networks or by analyzing the layers
separately. We capitalize on our findings and reveal the community structure of
two multilayer collaboration networks: scientists affiliated to the Pierre
Auger Observatory and scientists publishing works on networks on the arXiv.
Compared to conventional aggregated methods, the multilayer method reveals
smaller modules with more overlap that better capture the actual organization
Mapping bilateral information interests using the activity of Wikipedia editors
We live in a global village where electronic communication has eliminated the
geographical barriers of information exchange. The road is now open to
worldwide convergence of information interests, shared values, and
understanding. Nevertheless, interests still vary between countries around the
world. This raises important questions about what today's world map of in-
formation interests actually looks like and what factors cause the barriers of
information exchange between countries. To quantitatively construct a world map
of information interests, we devise a scalable statistical model that
identifies countries with similar information interests and measures the
countries' bilateral similarities. From the similarities we connect countries
in a global network and find that countries can be mapped into 18 clusters with
similar information interests. Through regression we find that language and
religion best explain the strength of the bilateral ties and formation of
clusters. Our findings provide a quantitative basis for further studies to
better understand the complex interplay between shared interests and conflict
on a global scale. The methodology can also be extended to track changes over
time and capture important trends in global information exchange.Comment: 11 pages, 3 figures in Palgrave Communications 1 (2015
Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities
Many complex networks display a mesoscopic structure with groups of nodes
sharing many links with the other nodes in their group and comparatively few
with nodes of different groups. This feature is known as community structure
and encodes precious information about the organization and the function of the
nodes. Many algorithms have been proposed but it is not yet clear how they
should be tested. Recently we have proposed a general class of undirected and
unweighted benchmark graphs, with heterogenous distributions of node degree and
community size. An increasing attention has been recently devoted to develop
algorithms able to consider the direction and the weight of the links, which
require suitable benchmark graphs for testing. In this paper we extend the
basic ideas behind our previous benchmark to generate directed and weighted
networks with built-in community structure. We also consider the possibility
that nodes belong to more communities, a feature occurring in real systems,
like, e. g., social networks. As a practical application, we show how
modularity optimization performs on our new benchmark.Comment: 9 pages, 13 figures. Final version published in Physical Review E.
The code to create the benchmark graphs can be freely downloaded from
http://santo.fortunato.googlepages.com/inthepress
A high-reproducibility and high-accuracy method for automated topic classification
Much of human knowledge sits in large databases of unstructured text.
Leveraging this knowledge requires algorithms that extract and record metadata
on unstructured text documents. Assigning topics to documents will enable
intelligent search, statistical characterization, and meaningful
classification. Latent Dirichlet allocation (LDA) is the state-of-the-art in
topic classification. Here, we perform a systematic theoretical and numerical
analysis that demonstrates that current optimization techniques for LDA often
yield results which are not accurate in inferring the most suitable model
parameters. Adapting approaches for community detection in networks, we propose
a new algorithm which displays high-reproducibility and high-accuracy, and also
has high computational efficiency. We apply it to a large set of documents in
the English Wikipedia and reveal its hierarchical structure. Our algorithm
promises to make "big data" text analysis systems more reliable.Comment: 23 pages, 24 figure
Finding Statistically Significant Communities in Networks
Community structure is one of the main structural features of networks, revealing
both their internal organization and the similarity of their elementary units.
Despite the large variety of methods proposed to detect communities in graphs,
there is a big need for multi-purpose techniques, able to handle different types
of datasets and the subtleties of community structure. In this paper we present
OSLOM (Order Statistics Local Optimization Method), the first method capable to
detect clusters in networks accounting for edge directions, edge weights,
overlapping communities, hierarchies and community dynamics. It is based on the
local optimization of a fitness function expressing the statistical significance
of clusters with respect to random fluctuations, which is estimated with tools
of Extreme and Order Statistics. OSLOM can be used alone or as a refinement
procedure of partitions/covers delivered by other techniques. We have also
implemented sequential algorithms combining OSLOM with other fast techniques, so
that the community structure of very large networks can be uncovered. Our method
has a comparable performance as the best existing algorithms on artificial
benchmark graphs. Several applications on real networks are shown as well. OSLOM
is implemented in a freely available software (http://www.oslom.org), and we
believe it will be a valuable tool in the analysis of networks
Combinatorial approach to Modularity
Communities are clusters of nodes with a higher than average density of
internal connections. Their detection is of great relevance to better
understand the structure and hierarchies present in a network. Modularity has
become a standard tool in the area of community detection, providing at the
same time a way to evaluate partitions and, by maximizing it, a method to find
communities. In this work, we study the modularity from a combinatorial point
of view. Our analysis (as the modularity definition) relies on the use of the
configurational model, a technique that given a graph produces a series of
randomized copies keeping the degree sequence invariant. We develop an approach
that enumerates the null model partitions and can be used to calculate the
probability distribution function of the modularity. Our theory allows for a
deep inquiry of several interesting features characterizing modularity such as
its resolution limit and the statistics of the partitions that maximize it.
Additionally, the study of the probability of extremes of the modularity in the
random graph partitions opens the way for a definition of the statistical
significance of network partitions.Comment: 8 pages, 4 figure