2 research outputs found
Scaling Author Name Disambiguation with CNF Blocking
An author name disambiguation (AND) algorithm identifies a unique author
entity record from all similar or same publication records in scholarly or
similar databases. Typically, a clustering method is used that requires
calculation of similarities between each possible record pair. However, the
total number of pairs grows quadratically with the size of the author database
making such clustering difficult for millions of records. One remedy for this
is a blocking function that reduces the number of pairwise similarity
calculations. Here, we introduce a new way of learning blocking schemes by
using a conjunctive normal form (CNF) in contrast to the disjunctive normal
form (DNF). We demonstrate on PubMed author records that CNF blocking reduces
more pairs while preserving high pairs completeness compared to the previous
methods that use a DNF with the computation time significantly reduced. Thus,
these concepts in scholarly data can be better represented with CNFs. Moreover,
we also show how to ensure that the method produces disjoint blocks so that the
rest of the AND algorithm can be easily paralleled. Our CNF blocking tested on
the entire PubMed database of 80 million author mentions efficiently removes
82.17% of all author record pairs in 10 minutes
The impact of imbalanced training data on machine learning for author name disambiguation
In supervised machine learning for author name disambiguation, negative
training data are often dominantly larger than positive training data. This
paper examines how the ratios of negative to positive training data can affect
the performance of machine learning algorithms to disambiguate author names in
bibliographic records. On multiple labeled datasets, three classifiers -
Logistic Regression, Na\"ive Bayes, and Random Forest - are trained through
representative features such as coauthor names, and title words extracted from
the same training data but with various positive-negative training data ratios.
Results show that increasing negative training data can improve disambiguation
performance but with a few percent of performance gains and sometimes degrade
it. Logistic Regression and Na\"ive Bayes learn optimal disambiguation models
even with a base ratio (1:1) of positive and negative training data. Also, the
performance improvement by Random Forest tends to quickly saturate roughly
after 1:10 ~ 1:15. These findings imply that contrary to the common practice
using all training data, name disambiguation algorithms can be trained using
part of negative training data without degrading much disambiguation
performance while increasing computational efficiency. This study calls for
more attention from author name disambiguation scholars to methods for machine
learning from imbalanced data.Comment: 17 pages, 3 figures, and 3 table