In this paper, we present a method to automatically build large labeled
datasets for the author ambiguity problem in the academic world by leveraging
the authoritative academic resources, ORCID and DOI. Using the method, we built
LAGOS-AND, two large, gold-standard datasets for author name disambiguation
(AND), of which LAGOS-AND-BLOCK is created for clustering-based AND research
and LAGOS-AND-PAIRWISE is created for classification-based AND research. Our
LAGOS-AND datasets are substantially different from the existing ones. The
initial versions of the datasets (v1.0, released in February 2021) include 7.5M
citations authored by 798K unique authors (LAGOS-AND-BLOCK) and close to 1M
instances (LAGOS-AND-PAIRWISE). And both datasets show close similarities to
the whole Microsoft Academic Graph (MAG) across validations of six facets. In
building the datasets, we reveal the variation degrees of last names in three
literature databases, PubMed, MAG, and Semantic Scholar, by comparing author
names hosted to the authors' official last names shown on the ORCID pages.
Furthermore, we evaluate several baseline disambiguation methods as well as the
MAG's author IDs system on our datasets, and the evaluation helps identify
several interesting findings. We hope the datasets and findings will bring new
insights for future studies. The code and datasets are publicly available.Comment: 33 pages, 7 tables, 7 figure