23,643 research outputs found
Synthetic Test Data Generation for Hierarchical Graph Clustering Methods
Recent achievements in graph-based clustering algorithms
revealed the need for large-scale test data sets. This paper introduces a
procedure that can provide synthetic but realistic test data to the hi-
erarchical Markov clustering algorithm. Being created according to the
structure and properties of the SCOP95 protein sequence data set, the
synthetic data act as a collection of proteins organized in a four-level
hierarchy and a similarity matrix containing pairwise similarity values
of the proteins. An ultimate high-speed TRIBE-MCL algorithm was em-
ployed to validate the synthetic data. Generated data sets have a healthy
amount of variability due to the randomness in the processing, and are
suitable for testing graph-based clustering algorithms on large-scale data
Analysis of Network Clustering Algorithms and Cluster Quality Metrics at Scale
Notions of community quality underlie network clustering. While studies
surrounding network clustering are increasingly common, a precise understanding
of the realtionship between different cluster quality metrics is unknown. In
this paper, we examine the relationship between stand-alone cluster quality
metrics and information recovery metrics through a rigorous analysis of four
widely-used network clustering algorithms -- Louvain, Infomap, label
propagation, and smart local moving. We consider the stand-alone quality
metrics of modularity, conductance, and coverage, and we consider the
information recovery metrics of adjusted Rand score, normalized mutual
information, and a variant of normalized mutual information used in previous
work. Our study includes both synthetic graphs and empirical data sets of sizes
varying from 1,000 to 1,000,000 nodes.
We find significant differences among the results of the different cluster
quality metrics. For example, clustering algorithms can return a value of 0.4
out of 1 on modularity but score 0 out of 1 on information recovery. We find
conductance, though imperfect, to be the stand-alone quality metric that best
indicates performance on information recovery metrics. Our study shows that the
variant of normalized mutual information used in previous work cannot be
assumed to differ only slightly from traditional normalized mutual information.
Smart local moving is the best performing algorithm in our study, but
discrepancies between cluster evaluation metrics prevent us from declaring it
absolutely superior. Louvain performed better than Infomap in nearly all the
tests in our study, contradicting the results of previous work in which Infomap
was superior to Louvain. We find that although label propagation performs
poorly when clusters are less clearly defined, it scales efficiently and
accurately to large graphs with well-defined clusters
HIERARCHICAL CLUSTERING USING LEVEL SETS
Over the past several decades, clustering algorithms have earned their place as a go-to solution for database mining. This paper introduces a new concept which is used to develop a new recursive version of DBSCAN that can successfully perform hierarchical clustering, called Level- Set Clustering (LSC). A level-set is a subset of points of a data-set whose densities are greater than some threshold, ‘t’. By graphing the size of each level-set against its respective ‘t,’ indents are produced in the line graph which correspond to clusters in the data-set, as the points in a cluster have very similar densities. This new algorithm is able to produce the clustering result with the same O(n log n) time complexity as DBSCAN and OPTICS, while catching clusters the others missed
Generating realistic scaled complex networks
Research on generative models is a central project in the emerging field of
network science, and it studies how statistical patterns found in real networks
could be generated by formal rules. Output from these generative models is then
the basis for designing and evaluating computational methods on networks, and
for verification and simulation studies. During the last two decades, a variety
of models has been proposed with an ultimate goal of achieving comprehensive
realism for the generated networks. In this study, we (a) introduce a new
generator, termed ReCoN; (b) explore how ReCoN and some existing models can be
fitted to an original network to produce a structurally similar replica, (c)
use ReCoN to produce networks much larger than the original exemplar, and
finally (d) discuss open problems and promising research directions. In a
comparative experimental study, we find that ReCoN is often superior to many
other state-of-the-art network generation methods. We argue that ReCoN is a
scalable and effective tool for modeling a given network while preserving
important properties at both micro- and macroscopic scales, and for scaling the
exemplar data by orders of magnitude in size.Comment: 26 pages, 13 figures, extended version, a preliminary version of the
paper was presented at the 5th International Workshop on Complex Networks and
their Application
Guided Machine Learning for power grid segmentation
The segmentation of large scale power grids into zones is crucial for control
room operators when managing the grid complexity near real time. In this paper
we propose a new method in two steps which is able to automatically do this
segmentation, while taking into account the real time context, in order to help
them handle shifting dynamics. Our method relies on a "guided" machine learning
approach. As a first step, we define and compute a task specific "Influence
Graph" in a guided manner. We indeed simulate on a grid state chosen
interventions, representative of our task of interest (managing active power
flows in our case). For visualization and interpretation, we then build a
higher representation of the grid relevant to this task by applying the graph
community detection algorithm \textit{Infomap} on this Influence Graph. To
illustrate our method and demonstrate its practical interest, we apply it on
commonly used systems, the IEEE-14 and IEEE-118. We show promising and original
interpretable results, especially on the previously well studied RTS-96 system
for grid segmentation. We eventually share initial investigation and results on
a large-scale system, the French power grid, whose segmentation had a
surprising resemblance with RTE's historical partitioning
- …