19,843 research outputs found
Exploiting citation networks for large-scale author name disambiguation
We present a novel algorithm and validation method for disambiguating author
names in very large bibliographic data sets and apply it to the full Web of
Science (WoS) citation index. Our algorithm relies only upon the author and
citation graphs available for the whole period covered by the WoS. A pair-wise
publication similarity metric, which is based on common co-authors,
self-citations, shared references and citations, is established to perform a
two-step agglomerative clustering that first connects individual papers and
then merges similar clusters. This parameterized model is optimized using an
h-index based recall measure, favoring the correct assignment of well-cited
publications, and a name-initials-based precision using WoS metadata and
cross-referenced Google Scholar profiles. Despite the use of limited metadata,
we reach a recall of 87% and a precision of 88% with a preference for
researchers with high h-index values. 47 million articles of WoS can be
disambiguated on a single machine in less than a day. We develop an h-index
distribution model, confirming that the prediction is in excellent agreement
with the empirical data, and yielding insight into the utility of the h-index
in real academic ranking scenarios.Comment: 14 pages, 5 figure
Finding groups in data: Cluster analysis with ants
Wepresent in this paper a modification of Lumer and Faietaâs algorithm for data clustering. This approach
mimics the clustering behavior observed in real ant colonies. This algorithm discovers automatically
clusters in numerical data without prior knowledge of possible number of clusters. In this paper we focus
on ant-based clustering algorithms, a particular kind of a swarm intelligent system, and on the effects on
the final clustering by using during the classification differentmetrics of dissimilarity: Euclidean, Cosine,
and Gower measures. Clustering with swarm-based algorithms is emerging as an alternative to more
conventional clustering methods, such as e.g. k-means, etc. Among the many bio-inspired techniques, ant
clustering algorithms have received special attention, especially because they still require much
investigation to improve performance, stability and other key features that would make such algorithms
mature tools for data mining.
As a case study, this paper focus on the behavior of clustering procedures in those new approaches.
The proposed algorithm and its modifications are evaluated in a number of well-known benchmark
datasets. Empirical results clearly show that ant-based clustering algorithms performs well when
compared to another techniques
- âŠ