16 research outputs found
Covariance and Correlation Kernels on a Graph in the Generalized Bag-of-Paths Formalism
This work derives closed-form expressions computing the expectation of
co-presence and of number of co-occurrences of nodes on paths sampled from a
network according to general path weights (a bag of paths). The underlying idea
is that two nodes are considered as similar when they often appear together on
(preferably short) paths of the network. The different expressions are obtained
for both regular and hitting paths and serve as a basis for computing new
covariance and correlation measures between nodes, which are valid positive
semi-definite kernels on a graph. Experiments on semi-supervised classification
problems show that the introduced similarity measures provide competitive
results compared to other state-of-the-art distance and similarity measures
between nodes
Sparse Randomized Shortest Paths Routing with Tsallis Divergence Regularization
This work elaborates on the important problem of (1) designing optimal
randomized routing policies for reaching a target node t from a source note s
on a weighted directed graph G and (2) defining distance measures between nodes
interpolating between the least cost (based on optimal movements) and the
commute-cost (based on a random walk on G), depending on a temperature
parameter T. To this end, the randomized shortest path formalism (RSP,
[2,99,124]) is rephrased in terms of Tsallis divergence regularization, instead
of Kullback-Leibler divergence. The main consequence of this change is that the
resulting routing policy (local transition probabilities) becomes sparser when
T decreases, therefore inducing a sparse random walk on G converging to the
least-cost directed acyclic graph when T tends to 0. Experimental comparisons
on node clustering and semi-supervised classification tasks show that the
derived dissimilarity measures based on expected routing costs provide
state-of-the-art results. The sparse RSP is therefore a promising model of
movements on a graph, balancing sparse exploitation and exploration in an
optimal way
Randomized Shortest Paths with Net Flows and Capacity Constraints
This work extends the randomized shortest paths (RSP) model by investigating
the net flow RSP and adding capacity constraints on edge flows. The standard
RSP is a model of movement, or spread, through a network interpolating between
a random-walk and a shortest-path behavior [30, 42, 49]. The framework assumes
a unit flow injected into a source node and collected from a target node with
flows minimizing the expected transportation cost, together with a relative
entropy regularization term. In this context, the present work first develops
the net flow RSP model considering that edge flows in opposite directions
neutralize each other (as in electric networks), and proposes an algorithm for
computing the expected routing costs between all pairs of nodes. This quantity
is called the net flow RSP dissimilarity measure between nodes. Experimental
comparisons on node clustering tasks indicate that the net flow RSP
dissimilarity is competitive with other state-of-the-art dissimilarities. In
the second part of the paper, it is shown how to introduce capacity constraints
on edge flows, and a procedure is developed to solve this constrained problem
by exploiting Lagrangian duality. These two extensions should improve
significantly the scope of applications of the RSP framework
Essays on network data analysis through the bag-of-paths framework
Since the rapid growth of the Internet and the advent of social networks in the 2000s, the amount of available network data is quickly increasing, leading to the development of new network analysis methods. Nowadays, these network analysis methods have spread to various fields, including, among others, marketing, supply chain, finance, and biology, as essential analysis and prediction tools. This thesis focuses on the development of one of these methods, called the bag-of-paths framework. This framework has the specificity to define a family of dissimilarity measures between nodes of the network that extrapolate between an optimal exploitation of the graph structure (optimal behavior - shortest path distance) and a random exploration of the graph (random behavior - commute time distance) via a parameter that controls the desired degree of randomness/exploration. Throughout this thesis, we propose several theoretical and practical extensions of the bag-of-paths framework. Regarding the theoretical contributions, we incorporate capacity constraints on edges, marginal constraints on input and output flows, and a Poisson distribution weighting and constraining path lengths, into the bag-of-paths framework. Furthermore, we expose the applicability of this framework through graph-based semi-supervised classification tasks and a real-life fraud detection case.(ECGE - Sciences économiques et de gestion) -- UCL, 202
Community detection in networks by soft modularity maximization : A new approach and empirical comparisons
Community detection in networks is one of the major fundamentals of the science of networks. This is an emerging discipline and part of the computing sciences. It purports to study networks data and, especially, analyze the links and interconnections within these networks. Nevertheless, it did not attract significant interest until the rapid growth of the Internet in the early 2000's, as it became more and more popular and extended to diverse scientific areas such as physics, biology, ecology, marketing, etc... In general terms, a graph is a mathematical object composed of elements called "nodes" which can be connected two-by-two by an edge if there is any relation between them. As the science of networks spreads to more and more sectors, we can find networks in a growing number of contexts. Among these networks, one of the best-known is the World Wide Web, within which web pages are interconnected by hyperlinks. Another, more recent example of networks is Facebook, the well-known social media through which people connect with each other on the basis of friendships or any other characteristics which they are likely to share. The aim of this thesis is to examine a characteristic feature of any network: community structure, in particular the detection of these communities using clustering methods. Clustering groups nodes according to their similarities or to their difference in communities without knowing beforehand the class labels underlying the graph. Thus, the clustering algorithms generally allow to achieve a partition of the distribution of every nodes in the various communities. In this classic vision of clustering, every node is thus assigned to a single community. Nevertheless, this view was recently somewhat contradicted by the appearance of the concept of fuzzy communities, in which a node may be in more than one community at a time. In this concept, the communities may overlap and the structure of communities of the graph becomes more complex to analyze. That is why we introduce in this thesis two new clustering algorithms allowing us to find a fuzzy partition of communities in a network. These new algorithms are based on a measure of closeness called modularity and introduced by the physicist J. Newman which we modified to obtain a fuzzy version that allows us to meet new expectations in terms of detection of communities. The purpose of this thesis is to study the performances of our two new algorithms regarding detection of communities by comparing them with other methods of clustering which are already well-established in the science of networks. To direct our study, we posed two research questions: • Are the entropy based soft modularity and the deterministic annealing entropy based soft modularity algorithms competitive compared to the kernel k-means algorithms whenever we use the natural numbers of clusters ? • Are the entropy based soft modularity and the deterministic annealing entropy based soft modularity algorithms competitive compared to the kernel k-means algorithms and the Louvain method whenever the number of clusters has not been determined in advance ? To answer these questions, we are going to conduct two different experiments. In the first one, we shall compare our algorithms with four kernel k-means: the Sigmoid Commute Time, the Sigmoid Corrected Commute Time, the Log Forest and the Free Energy, by using the natural number of clusters for each dataset. In the second experiment, we shall once again compare our algorithms to four kernel k-means but also to the Louvain method, though in this case, we will not determine the number of clusters beforehand. We will thus have to define it empirically for each dataset and each algorithm, except for the Louvain method which, by itself, returns a certain number of clusters.Master [120] en Ingénieur de gestion, Université catholique de Louvain, 201
A simple extension of the bagof- paths model weighting path lengths by a Poisson distribution
This work extends the bag-of-paths model by introducing a weighting of the length of the paths in the network, provided by a Poisson probability distribution. The main advantage of this approach is that it allows to tune the mean path length parameter which is most relevant for the application at hand. Various quantities of interest, such as the probability of drawing a path from the bag of paths, or the join probability of sampling any path connecting two nodes of interest, can easily be computed in closed form from this model. In this context, a new distance measure between nodes of a network, considering a weighting factor on the length of the paths, is defined. Experiments on semi-supervised classification tasks show that the introduced distance measure provides competitive results compared to other state-of-the-art methods. Moreover, a new interpretation of the logarithmic communicability similarity measure is proposed in terms of the new model
A Simple Extension of the Bag-of-Paths Model Weighting Path Lengths by a Poisson Distribution
This work extends the bag-of-paths model by introducing a weighting of the length of the paths in the network, provided by a Poisson probability distribution. The main advantage of this approach is that it allows to tune the mean path length parameter which is most relevant for the application at hand. Various quantities of interest, such as the probability of drawing a path from the bag of paths, or the join probability of sampling any path connecting two nodes of interest, can easily be computed in closed form from this model. In this context, a new distance measure between nodes of a network, considering a weighting factor on the length of the paths, is defined. Experiments on semi-supervised classification tasks show that the introduced distance measure provides competitive results compared to other state-of-the-art methods. Moreover, a new interpretation of the logarithmic communicability similarity measure is proposed in terms of the new model
Covariance and correlation measures on a graph in a generalized bag-of-paths formalism
This work derives closed-form expressions computing the expectation of co-presence and of number of co-occurrences of nodes on paths sampled from a network according to general path weights (a bag of paths). The underlying idea is that two nodes are considered as similar when they often appear together on (preferably short) paths of the network. The different expressions are obtained for both regular and hitting paths and serve as a basis for computing new covariance and correlation measures between nodes, which are valid positive semi-definite kernels on a graph. Experiments on semi-supervised classification problems show that the introduced similarity measures provide competitive results compared to other state-of-the-art distance and similarity measures between nodes
Design of Biased Random Walks on a Graph with Application to Collaborative Recommendation
This work investigates a paths-based statistical physics formalism for the design of random walks on a graph in which the transition probabilities (the policy) are optimally biased in favor of some node features. More precisely, given a weighted directed graph and a nonnegative cost assigned to each edge, the biased random walk is defined as the policy minimizing the expected cost rate along the walks while maintaining a constant relative entropy rate. The model is formulated by assigning a Gibbs-Boltzmann distribution to the set of infinite walks and allows to recover some known results from the literature, derived from a different perspective. Examples of quantities of interest are the partition function of the system, the optimal transition probabilities, the cost rate, etc. In addition, the same formalism allows to introduce capacity constraints on the expected visit rates to the nodes and an algorithm for computing the optimal policy subject to capacity constraints is developed. Simulation results indicate that the proposed procedure can be effectively used in order to define a Markov chain driving the walk towards nodes having some specific properties, like seniority, education level or low node degree (hub-avoiding walk). An application relying on this last property is proposed as a tool for improving serendipity in collaborative recommendation, and is tested on the MovieLens data
Randomized Shortest Paths with Net Flows and Capacity Constraints
This work extends the randomized shortest paths model (RSP) by investigating the net flow RSP and adding capacity constraints on edge flows. The standard RSP is a model of movement, or spread, through a network interpolating between a random walk and a shortest path behaviour. This framework assumes a unit flow injected into a source node and collected from a target node with flows minimizing the expected transportation cost together with a relative entropy regularization term. In this context, the present work first develops the net flow RSP model considering that edge flows in opposite directions neutralize each other (as in electrical networks) and proposes an algorithm for computing the expected routing costs between all pairs of nodes. This quantity is called the net flow RSP dissimilarity measure between nodes. Experimental comparisons on node clustering tasks show that the net flow RSP dissimilarity is competitive with other state-of-the-art techniques. In the second part of the paper, it is shown how to introduce capacity constraints on edge flows and a procedure solving this constrained problem by using Lagrangian duality is developed. These two extensions improve significantly the scope of applications of the RSP framework