11,968 research outputs found
Comparative Study for Inference of Hidden Classes in Stochastic Block Models
Inference of hidden classes in stochastic block model is a classical problem
with important applications. Most commonly used methods for this problem
involve na\"{\i}ve mean field approaches or heuristic spectral methods.
Recently, belief propagation was proposed for this problem. In this
contribution we perform a comparative study between the three methods on
synthetically created networks. We show that belief propagation shows much
better performance when compared to na\"{\i}ve mean field and spectral
approaches. This applies to accuracy, computational efficiency and the tendency
to overfit the data.Comment: 8 pages, 5 figures AIGM1
Inference of hidden structures in complex physical systems by multi-scale clustering
We survey the application of a relatively new branch of statistical
physics--"community detection"-- to data mining. In particular, we focus on the
diagnosis of materials and automated image segmentation. Community detection
describes the quest of partitioning a complex system involving many elements
into optimally decoupled subsets or communities of such elements. We review a
multiresolution variant which is used to ascertain structures at different
spatial and temporal scales. Significant patterns are obtained by examining the
correlations between different independent solvers. Similar to other
combinatorial optimization problems in the NP complexity class, community
detection exhibits several phases. Typically, illuminating orders are revealed
by choosing parameters that lead to extremal information theory correlations.Comment: 25 pages, 16 Figures; a review of earlier work
Evaluating Overfit and Underfit in Models of Network Community Structure
A common data mining task on networks is community detection, which seeks an
unsupervised decomposition of a network into structural groups based on
statistical regularities in the network's connectivity. Although many methods
exist, the No Free Lunch theorem for community detection implies that each
makes some kind of tradeoff, and no algorithm can be optimal on all inputs.
Thus, different algorithms will over or underfit on different inputs, finding
more, fewer, or just different communities than is optimal, and evaluation
methods that use a metadata partition as a ground truth will produce misleading
conclusions about general accuracy. Here, we present a broad evaluation of over
and underfitting in community detection, comparing the behavior of 16
state-of-the-art community detection algorithms on a novel and structurally
diverse corpus of 406 real-world networks. We find that (i) algorithms vary
widely both in the number of communities they find and in their corresponding
composition, given the same input, (ii) algorithms can be clustered into
distinct high-level groups based on similarities of their outputs on real-world
networks, and (iii) these differences induce wide variation in accuracy on link
prediction and link description tasks. We introduce a new diagnostic for
evaluating overfitting and underfitting in practice, and use it to roughly
divide community detection methods into general and specialized learning
algorithms. Across methods and inputs, Bayesian techniques based on the
stochastic block model and a minimum description length approach to
regularization represent the best general learning approach, but can be
outperformed under specific circumstances. These results introduce both a
theoretically principled approach to evaluate over and underfitting in models
of network community structure and a realistic benchmark by which new methods
may be evaluated and compared.Comment: 22 pages, 13 figures, 3 table
Spectral redemption: clustering sparse networks
Spectral algorithms are classic approaches to clustering and community
detection in networks. However, for sparse networks the standard versions of
these algorithms are suboptimal, in some cases completely failing to detect
communities even when other algorithms such as belief propagation can do so.
Here we introduce a new class of spectral algorithms based on a
non-backtracking walk on the directed edges of the graph. The spectrum of this
operator is much better-behaved than that of the adjacency matrix or other
commonly used matrices, maintaining a strong separation between the bulk
eigenvalues and the eigenvalues relevant to community structure even in the
sparse case. We show that our algorithm is optimal for graphs generated by the
stochastic block model, detecting communities all the way down to the
theoretical limit. We also show the spectrum of the non-backtracking operator
for some real-world networks, illustrating its advantages over traditional
spectral clustering.Comment: 11 pages, 6 figures. Clarified to what extent our claims are
rigorous, and to what extent they are conjectures; also added an
interpretation of the eigenvectors of the 2n-dimensional version of the
non-backtracking matri
Latent tree models
Latent tree models are graphical models defined on trees, in which only a
subset of variables is observed. They were first discussed by Judea Pearl as
tree-decomposable distributions to generalise star-decomposable distributions
such as the latent class model. Latent tree models, or their submodels, are
widely used in: phylogenetic analysis, network tomography, computer vision,
causal modeling, and data clustering. They also contain other well-known
classes of models like hidden Markov models, Brownian motion tree model, the
Ising model on a tree, and many popular models used in phylogenetics. This
article offers a concise introduction to the theory of latent tree models. We
emphasise the role of tree metrics in the structural description of this model
class, in designing learning algorithms, and in understanding fundamental
limits of what and when can be learned
A survey of statistical network models
Networks are ubiquitous in science and have become a focal point for
discussion in everyday life. Formal statistical models for the analysis of
network data have emerged as a major topic of interest in diverse areas of
study, and most of these involve a form of graphical representation.
Probability models on graphs date back to 1959. Along with empirical studies in
social psychology and sociology from the 1960s, these early works generated an
active network community and a substantial literature in the 1970s. This effort
moved into the statistical literature in the late 1970s and 1980s, and the past
decade has seen a burgeoning network literature in statistical physics and
computer science. The growth of the World Wide Web and the emergence of online
networking communities such as Facebook, MySpace, and LinkedIn, and a host of
more specialized professional network communities has intensified interest in
the study of networks and network data. Our goal in this review is to provide
the reader with an entry point to this burgeoning literature. We begin with an
overview of the historical development of statistical network modeling and then
we introduce a number of examples that have been studied in the network
literature. Our subsequent discussion focuses on a number of prominent static
and dynamic network models and their interconnections. We emphasize formal
model descriptions, and pay special attention to the interpretation of
parameters and their estimation. We end with a description of some open
problems and challenges for machine learning and statistics.Comment: 96 pages, 14 figures, 333 reference
- …