Nodes in real-world networks organize into densely linked communities where
edges appear with high concentration among the members of the community.
Identifying such communities of nodes has proven to be a challenging task
mainly due to a plethora of definitions of a community, intractability of
algorithms, issues with evaluation and the lack of a reliable gold-standard
ground-truth.
In this paper we study a set of 230 large real-world social, collaboration
and information networks where nodes explicitly state their group memberships.
For example, in social networks nodes explicitly join various interest based
social groups. We use such groups to define a reliable and robust notion of
ground-truth communities. We then propose a methodology which allows us to
compare and quantitatively evaluate how different structural definitions of
network communities correspond to ground-truth communities. We choose 13
commonly used structural definitions of network communities and examine their
sensitivity, robustness and performance in identifying the ground-truth. We
show that the 13 structural definitions are heavily correlated and naturally
group into four classes. We find that two of these definitions, Conductance and
Triad-participation-ratio, consistently give the best performance in
identifying ground-truth communities. We also investigate a task of detecting
communities given a single seed node. We extend the local spectral clustering
algorithm into a heuristic parameter-free community detection method that
easily scales to networks with more than hundred million nodes. The proposed
method achieves 30% relative improvement over current local clustering methods.Comment: Proceedings of 2012 IEEE International Conference on Data Mining
(ICDM), 201