1,127 research outputs found
A network approach to topic models
One of the main computational and scientific challenges in the modern age is
to extract useful information from unstructured texts. Topic models are one
popular machine-learning approach which infers the latent topical structure of
a collection of documents. Despite their success --- in particular of its most
widely used variant called Latent Dirichlet Allocation (LDA) --- and numerous
applications in sociology, history, and linguistics, topic models are known to
suffer from severe conceptual and practical problems, e.g. a lack of
justification for the Bayesian priors, discrepancies with statistical
properties of real texts, and the inability to properly choose the number of
topics. Here we obtain a fresh view on the problem of identifying topical
structures by relating it to the problem of finding communities in complex
networks. This is achieved by representing text corpora as bipartite networks
of documents and words. By adapting existing community-detection methods --
using a stochastic block model (SBM) with non-parametric priors -- we obtain a
more versatile and principled framework for topic modeling (e.g., it
automatically detects the number of topics and hierarchically clusters both the
words and documents). The analysis of artificial and real corpora demonstrates
that our SBM approach leads to better topic models than LDA in terms of
statistical model selection. More importantly, our work shows how to formally
relate methods from community detection and topic modeling, opening the
possibility of cross-fertilization between these two fields.Comment: 22 pages, 10 figures, code available at https://topsbm.github.io
Identifying overlapping terrorist cells from the Noordin Top actor-event network
Actor-event data are common in sociological settings, whereby one registers
the pattern of attendance of a group of social actors to a number of events. We
focus on 79 members of the Noordin Top terrorist network, who were monitored
attending 45 events. The attendance or non-attendance of the terrorist to
events defines the social fabric, such as group coherence and social
communities. The aim of the analysis of such data is to learn about the
affiliation structure. Actor-event data is often transformed to actor-actor
data in order to be further analysed by network models, such as stochastic
block models. This transformation and such analyses lead to a natural loss of
information, particularly when one is interested in identifying, possibly
overlapping, subgroups or communities of actors on the basis of their
attendances to events. In this paper we propose an actor-event model for
overlapping communities of terrorists, which simplifies interpretation of the
network. We propose a mixture model with overlapping clusters for the analysis
of the binary actor-event network data, called {\tt manet}, and develop a
Bayesian procedure for inference. After a simulation study, we show how this
analysis of the terrorist network has clear interpretative advantages over the
more traditional approaches of affiliation network analysis.Comment: 24 pages, 5 figures; related R package (manet) available on CRA
Evaluating Overfit and Underfit in Models of Network Community Structure
A common data mining task on networks is community detection, which seeks an
unsupervised decomposition of a network into structural groups based on
statistical regularities in the network's connectivity. Although many methods
exist, the No Free Lunch theorem for community detection implies that each
makes some kind of tradeoff, and no algorithm can be optimal on all inputs.
Thus, different algorithms will over or underfit on different inputs, finding
more, fewer, or just different communities than is optimal, and evaluation
methods that use a metadata partition as a ground truth will produce misleading
conclusions about general accuracy. Here, we present a broad evaluation of over
and underfitting in community detection, comparing the behavior of 16
state-of-the-art community detection algorithms on a novel and structurally
diverse corpus of 406 real-world networks. We find that (i) algorithms vary
widely both in the number of communities they find and in their corresponding
composition, given the same input, (ii) algorithms can be clustered into
distinct high-level groups based on similarities of their outputs on real-world
networks, and (iii) these differences induce wide variation in accuracy on link
prediction and link description tasks. We introduce a new diagnostic for
evaluating overfitting and underfitting in practice, and use it to roughly
divide community detection methods into general and specialized learning
algorithms. Across methods and inputs, Bayesian techniques based on the
stochastic block model and a minimum description length approach to
regularization represent the best general learning approach, but can be
outperformed under specific circumstances. These results introduce both a
theoretically principled approach to evaluate over and underfitting in models
of network community structure and a realistic benchmark by which new methods
may be evaluated and compared.Comment: 22 pages, 13 figures, 3 table
Topological Feature Based Classification
There has been a lot of interest in developing algorithms to extract clusters
or communities from networks. This work proposes a method, based on
blockmodelling, for leveraging communities and other topological features for
use in a predictive classification task. Motivated by the issues faced by the
field of community detection and inspired by recent advances in Bayesian topic
modelling, the presented model automatically discovers topological features
relevant to a given classification task. In this way, rather than attempting to
identify some universal best set of clusters for an undefined goal, the aim is
to find the best set of clusters for a particular purpose.
Using this method, topological features can be validated and assessed within
a given context by their predictive performance.
The proposed model differs from other relational and semi-supervised learning
models as it identifies topological features to explain the classification
decision. In a demonstration on a number of real networks the predictive
capability of the topological features are shown to rival the performance of
content based relational learners. Additionally, the model is shown to
outperform graph-based semi-supervised methods on directed and approximately
bipartite networks.Comment: Awarded 3rd Best Student Paper at 14th International Conference on
Information Fusion 201
- …