32,529 research outputs found
Median Graph Shift: A New Clustering Algorithm for Graph Domain
ISSN: 1051-4651 Print ISBN: 978-1-4244-7542-1International audiencen the context of unsupervised clustering, a new algorithm for the domain of graphs is introduced. In this paper, the key idea is to adapt the mean-shift clustering and its variants proposed for the domain of feature vectors to graph clustering. These algorithms have been applied successfully in image analysis and computer vision domains. The proposed algorithm works in an iterative manner by shifting each graph towards the median graph in a neighborhood. Both the set median graph and the generalized median graph are tested for the shifting procedure. In the experiment part, a set of cluster validation indices are used to evaluate our clustering algorithm and a comparison with the well-known Kmeans algorithm is provided
Comparison and validation of community structures in complex networks
The issue of partitioning a network into communities has attracted a great
deal of attention recently. Most authors seem to equate this issue with the one
of finding the maximum value of the modularity, as defined by Newman. Since the
problem formulated this way is NP-hard, most effort has gone into the
construction of search algorithms, and less to the question of other measures
of community structures, similarities between various partitionings and the
validation with respect to external information. Here we concentrate on a class
of computer generated networks and on three well-studied real networks which
constitute a bench-mark for network studies; the karate club, the US college
football teams and a gene network of yeast. We utilize some standard ways of
clustering data (originally not designed for finding community structures in
networks) and show that these classical methods sometimes outperform the newer
ones. We discuss various measures of the strength of the modular structure, and
show by examples features and drawbacks. Further, we compare different
partitions by applying some graph-theoretic concepts of distance, which
indicate that one of the quality measures of the degree of modularity
corresponds quite well with the distance from the true partition. Finally, we
introduce a way to validate the partitionings with respect to external data
when the nodes are classified but the network structure is unknown. This is
here possible since we know everything of the computer generated networks, as
well as the historical answer to how the karate club and the football teams are
partitioned in reality. The partitioning of the gene network is validated by
use of the Gene Ontology database, where we show that a community in general
corresponds to a biological process.Comment: To appear in Physica A; 25 page
Exploiting citation networks for large-scale author name disambiguation
We present a novel algorithm and validation method for disambiguating author
names in very large bibliographic data sets and apply it to the full Web of
Science (WoS) citation index. Our algorithm relies only upon the author and
citation graphs available for the whole period covered by the WoS. A pair-wise
publication similarity metric, which is based on common co-authors,
self-citations, shared references and citations, is established to perform a
two-step agglomerative clustering that first connects individual papers and
then merges similar clusters. This parameterized model is optimized using an
h-index based recall measure, favoring the correct assignment of well-cited
publications, and a name-initials-based precision using WoS metadata and
cross-referenced Google Scholar profiles. Despite the use of limited metadata,
we reach a recall of 87% and a precision of 88% with a preference for
researchers with high h-index values. 47 million articles of WoS can be
disambiguated on a single machine in less than a day. We develop an h-index
distribution model, confirming that the prediction is in excellent agreement
with the empirical data, and yielding insight into the utility of the h-index
in real academic ranking scenarios.Comment: 14 pages, 5 figure
Systematic Analysis of Cluster Similarity Indices: How to Validate Validation Measures
Many cluster similarity indices are used to evaluate clustering algorithms,
and choosing the best one for a particular task remains an open problem. We
demonstrate that this problem is crucial: there are many disagreements among
the indices, these disagreements do affect which algorithms are preferred in
applications, and this can lead to degraded performance in real-world systems.
We propose a theoretical framework to tackle this problem: we develop a list of
desirable properties and conduct an extensive theoretical analysis to verify
which indices satisfy them. This allows for making an informed choice: given a
particular application, one can first select properties that are desirable for
the task and then identify indices satisfying these. Our work unifies and
considerably extends existing attempts at analyzing cluster similarity indices:
we introduce new properties, formalize existing ones, and mathematically prove
or disprove each property for an extensive list of validation indices. This
broader and more rigorous approach leads to recommendations that considerably
differ from how validation indices are currently being chosen by practitioners.
Some of the most popular indices are even shown to be dominated by previously
overlooked ones
- …