13 research outputs found
Graphs in machine learning: an introduction
Graphs are commonly used to characterise interactions between objects of
interest. Because they are based on a straightforward formalism, they are used
in many scientific fields from computer science to historical sciences. In this
paper, we give an introduction to some methods relying on graphs for learning.
This includes both unsupervised and supervised methods. Unsupervised learning
algorithms usually aim at visualising graphs in latent spaces and/or clustering
the nodes. Both focus on extracting knowledge from graph topologies. While most
existing techniques are only applicable to static graphs, where edges do not
evolve through time, recent developments have shown that they could be extended
to deal with evolving networks. In a supervised context, one generally aims at
inferring labels or numerical values attached to nodes using both the graph
and, when they are available, node characteristics. Balancing the two sources
of information can be challenging, especially as they can disagree locally or
globally. In both contexts, supervised and un-supervised, data can be
relational (augmented with one or several global graphs) as described above, or
graph valued. In this latter case, each object of interest is given as a full
graph (possibly completed by other characteristics). In this context, natural
tasks include graph clustering (as in producing clusters of graphs rather than
clusters of nodes in a single graph), graph classification, etc. 1 Real
networks One of the first practical studies on graphs can be dated back to the
original work of Moreno [51] in the 30s. Since then, there has been a growing
interest in graph analysis associated with strong developments in the modelling
and the processing of these data. Graphs are now used in many scientific
fields. In Biology [54, 2, 7], for instance, metabolic networks can describe
pathways of biochemical reactions [41], while in social sciences networks are
used to represent relation ties between actors [66, 56, 36, 34]. Other examples
include powergrids [71] and the web [75]. Recently, networks have also been
considered in other areas such as geography [22] and history [59, 39]. In
machine learning, networks are seen as powerful tools to model problems in
order to extract information from data and for prediction purposes. This is the
object of this paper. For more complete surveys, we refer to [28, 62, 49, 45].
In this section, we introduce notations and highlight properties shared by most
real networks. In Section 2, we then consider methods aiming at extracting
information from a unique network. We will particularly focus on clustering
methods where the goal is to find clusters of vertices. Finally, in Section 3,
techniques that take a series of networks into account, where each network i
The random subgraph model for the analysis of an ecclesiastical network in Merovingian Gaul
In the last two decades many random graph models have been proposed to
extract knowledge from networks. Most of them look for communities or, more
generally, clusters of vertices with homogeneous connection profiles. While the
first models focused on networks with binary edges only, extensions now allow
to deal with valued networks. Recently, new models were also introduced in
order to characterize connection patterns in networks through mixed
memberships. This work was motivated by the need of analyzing a historical
network where a partition of the vertices is given and where edges are typed. A
known partition is seen as a decomposition of a network into subgraphs that we
propose to model using a stochastic model with unknown latent clusters. Each
subgraph has its own mixing vector and sees its vertices associated to the
clusters. The vertices then connect with a probability depending on the
subgraphs only, while the types of edges are assumed to be sampled from the
latent clusters. A variational Bayes expectation-maximization algorithm is
proposed for inference as well as a model selection criterion for the
estimation of the cluster number. Experiments are carried out on simulated data
to assess the approach. The proposed methodology is then applied to an
ecclesiastical network in Merovingian Gaul. An R code, called Rambo,
implementing the inference algorithm is available from the authors upon
request.Comment: Published in at http://dx.doi.org/10.1214/13-AOAS691 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Efficient method for estimating the number of communities in a network
While there exist a wide range of effective methods for community detection
in networks, most of them require one to know in advance how many communities
one is looking for. Here we present a method for estimating the number of
communities in a network using a combination of Bayesian inference with a novel
prior and an efficient Monte Carlo sampling scheme. We test the method
extensively on both real and computer-generated networks, showing that it
performs accurately and consistently, even in cases where groups are widely
varying in size or structure.Comment: 13 pages, 4 figure
Strategies for online inference of model-based clustering in large and growing networks
In this paper we adapt online estimation strategies to perform model-based
clustering on large networks. Our work focuses on two algorithms, the first
based on the SAEM algorithm, and the second on variational methods. These two
strategies are compared with existing approaches on simulated and real data. We
use the method to decipher the connexion structure of the political websphere
during the US political campaign in 2008. We show that our online EM-based
algorithms offer a good trade-off between precision and speed, when estimating
parameters for mixture distributions in the context of random graphs.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS359 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Overlapping stochastic block models with application to the French political blogosphere
Complex systems in nature and in society are often represented as networks,
describing the rich set of interactions between objects of interest. Many
deterministic and probabilistic clustering methods have been developed to
analyze such structures. Given a network, almost all of them partition the
vertices into disjoint clusters, according to their connection profile.
However, recent studies have shown that these techniques were too restrictive
and that most of the existing networks contained overlapping clusters. To
tackle this issue, we present in this paper the Overlapping Stochastic Block
Model. Our approach allows the vertices to belong to multiple clusters, and, to
some extent, generalizes the well-known Stochastic Block Model [Nowicki and
Snijders (2001)]. We show that the model is generically identifiable within
classes of equivalence and we propose an approximate inference procedure, based
on global and local variational techniques. Using toy data sets as well as the
French Political Blogosphere network and the transcriptional network of
Saccharomyces cerevisiae, we compare our work with other approaches.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS382 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Enhanced MCL Clustering
Â
The goal of graph clustering is to partition vertices in a large graph into different clusters
based on various criteria such as vertex connectivity or neighborhood similarity. Graph
lustering techniques are very useful for detecting densely connected groups in a large graph. In
this research, we introduce a clustering algorithm for graphs; this algorithm is based on Markov
lustering (MCL), which is a clustering method that uses a simulation of stochastic flow. We
have tuned to set the proper factors of inflation, matrix and threshold. Theoretical analysis is
provided to show that the enhanced EMCL-Cluster is converging. Then the proposed method is
ompared with other clustering methods
Enhanced MCL Clustering
Â
The goal of graph clustering is to partition vertices in a large graph into different clusters
based on various criteria such as vertex connectivity or neighborhood similarity. Graph
lustering techniques are very useful for detecting densely connected groups in a large graph. In
this research, we introduce a clustering algorithm for graphs; this algorithm is based on Markov
lustering (MCL), which is a clustering method that uses a simulation of stochastic flow. We
have tuned to set the proper factors of inflation, matrix and threshold. Theoretical analysis is
provided to show that the enhanced EMCL-Cluster is converging. Then the proposed method is
ompared with other clustering methods