285 research outputs found
Scale‐free collaboration networks: An author name disambiguation perspective
Peer Reviewedhttps://deepblue.lib.umich.edu/bitstream/2027.42/149559/1/asi24158.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/149559/2/asi24158_am.pd
Improving co-authorship network structures by combining multiple data sources: evidence from Italian academic statisticians
The aim of the present contribution is to merge bibliographic data for members of a bounded scientific community in order to derive a complete unified archive, with top-international and nationally oriented production, as a new basis to carry out network analysis on a unified co-authorship network. A two-step procedure is used to deal with the identification of duplicate records and the author name disambiguation. Specifically, for the second step we strongly drew inspiration from a well-established unsupervised disambiguation method proposed in the literature following a network-based approach and requiring a restricted set of record attributes. Evidences from Italian academic statisticians were provided by merging data from three bibliographic archives. Non-negligible differences were observed in network results in the comparison of disambiguated and not disambiguated data sets, especially in network measures at individual level
Author identification in bibliographic data using deep neural networks
Author name disambiguation (AND) is a challenging task for scholars who mine bibliographic information for scientific knowledge. A constructive approach for resolving name ambiguity is to use computer algorithms to identify author names. Some algorithm-based disambiguation methods have been developed by computer and data scientists. Among them, supervised machine learning has been stated to produce decent to very accurate disambiguation results. This paper presents a combination of principal component analysis (PCA) as a feature reduction and deep neural networks (DNNs), as a supervised algorithm for classifying AND problems. The raw data is grouped into four classes, i.e., synonyms, homonyms, homonyms-synonyms, and non-homonyms-synonyms classification. We have taken into account several hyperparameters tuning, such as learning rate, batch size, number of the neuron and hidden units, and analyzed their impact on the accuracy of results. To the best of our knowledge, there are no previous studies with such a scheme. The proposed DNNs are validated with other ML techniques such as Naïve Bayes, random forest (RF), and support vector machine (SVM) to produce a good classifier. By exploring the result in all data, our proposed DNNs classifier has an outperformed other ML technique, with accuracy, precision, recall, and F1-score, which is 99.98%, 97.98%, 97.86%, and 99.99%, respectively. In the future, this approach can be easily extended to any dataset and any bibliographic records provider
Harnessing Historical Corrections to build Test Collections for Named Entity Disambiguation
Matching mentions of persons to the actual persons (the name disambiguation
problem) is central for several digital library applications. Scientists have
been working on algorithms to create this matching for decades without finding
a universal solution. One problem is that test collections for this problem are
often small and specific to a certain collection. In this work, we present an
approach that can create large test collections from historical metadata with
minimal extra cost. We apply this approach to the DBLP collection to generate
two freely available test collections. One collection focuses on the properties
of defects and one on the evaluation of disambiguation algorithms.Comment: Preprint of a paper accepted at TPDL 201
Exploiting citation networks for large-scale author name disambiguation
We present a novel algorithm and validation method for disambiguating author
names in very large bibliographic data sets and apply it to the full Web of
Science (WoS) citation index. Our algorithm relies only upon the author and
citation graphs available for the whole period covered by the WoS. A pair-wise
publication similarity metric, which is based on common co-authors,
self-citations, shared references and citations, is established to perform a
two-step agglomerative clustering that first connects individual papers and
then merges similar clusters. This parameterized model is optimized using an
h-index based recall measure, favoring the correct assignment of well-cited
publications, and a name-initials-based precision using WoS metadata and
cross-referenced Google Scholar profiles. Despite the use of limited metadata,
we reach a recall of 87% and a precision of 88% with a preference for
researchers with high h-index values. 47 million articles of WoS can be
disambiguated on a single machine in less than a day. We develop an h-index
distribution model, confirming that the prediction is in excellent agreement
with the empirical data, and yielding insight into the utility of the h-index
in real academic ranking scenarios.Comment: 14 pages, 5 figure
Identification of Indonesian Authors Using Deep Neural Networks
Author Name Disambiguation (AND) is a problem that occurs when a set of publications contains ambiguous names of authors, i.e. the same author may appear with different names (synonyms) in other published papers, or author (authors) who may be different who may have the same name (homonym). In this final project, we will design a model with a Deep Neural Network (DNN) classifier. The dataset used in this final project uses primary data sourced from the Scopus website. This research focuses on integrating data from Indonesian authors. Parameters accuracy, sensitivity and precision are standard benchmarks to determine the performance of the method used to solve AND problems. The best DNN classification model achieves 99.9936% Accuracy, 93.1433% Sensitivity, 94.3733% Precision. Then for the highest performance measurement, the case of Non Synonym-Homonym (SH) has 99.9967% Accuracy, 96.7388% Sensitivity, and 97.5102% Precision
Author Name Disambiguation via Heterogeneous Network Embedding from Structural and Semantic Perspectives
Name ambiguity is common in academic digital libraries, such as multiple
authors having the same name. This creates challenges for academic data
management and analysis, thus name disambiguation becomes necessary. The
procedure of name disambiguation is to divide publications with the same name
into different groups, each group belonging to a unique author. A large amount
of attribute information in publications makes traditional methods fall into
the quagmire of feature selection. These methods always select attributes
artificially and equally, which usually causes a negative impact on accuracy.
The proposed method is mainly based on representation learning for
heterogeneous networks and clustering and exploits the self-attention
technology to solve the problem. The presentation of publications is a
synthesis of structural and semantic representations. The structural
representation is obtained by meta-path-based sampling and a skip-gram-based
embedding method, and meta-path level attention is introduced to automatically
learn the weight of each feature. The semantic representation is generated
using NLP tools. Our proposal performs better in terms of name disambiguation
accuracy compared with baselines and the ablation experiments demonstrate the
improvement by feature selection and the meta-path level attention in our
method. The experimental results show the superiority of our new method for
capturing the most attributes from publications and reducing the impact of
redundant information
- …