1,318 research outputs found
Identifying Geographic Clusters: A Network Analytic Approach
In recent years there has been a growing interest in the role of networks and
clusters in the global economy. Despite being a popular research topic in
economics, sociology and urban studies, geographical clustering of human
activity has often studied been by means of predetermined geographical units
such as administrative divisions and metropolitan areas. This approach is
intrinsically time invariant and it does not allow one to differentiate between
different activities. Our goal in this paper is to present a new methodology
for identifying clusters, that can be applied to different empirical settings.
We use a graph approach based on k-shell decomposition to analyze world
biomedical research clusters based on PubMed scientific publications. We
identify research institutions and locate their activities in geographical
clusters. Leading areas of scientific production and their top performing
research institutions are consistently identified at different geographic
scales
Author Name Disambiguation Using Co-training
In the community of bibliometrics, author name ambiguity means that author\u27s name is not a reliable identier for associating academic papers with their authors. Author name ambiguity has been the problem in bibliometrics and service providers like Google Scholar, generating a domain of study call Author Name Disambiguation (AND). Author name ambiguity is often tackled using classication techniques, where labeled papers are provided, and papers are assigned to correct authors according to the paper text and paper citations. When applying classication methods to author name disambiguation, two issues stand out: one is that a paper has multiple views (paper text and citation network). The other is the lack of training data: there are not many papers that are labeled. To cope with these two issues, we propose to use the co-training algorithm in AND. The co-training algorithm uses two views to classify papers iteratively and add the top selected papers into the training pool. We demonstrate that the co-training algorithm outperforms the baseline multi-view classication algorithm. We also experiment with hyper-parameters in the co-training algorithm. The experiment is done on the PubMed dataset, where authors are labeled with ORCID. Papers are represented by two embeddings that are learnt from paper content and paper citation network separately. Baseline classiers for comparison are logistic regression and SVM
LAGOS-AND: A Large Gold Standard Dataset for Scholarly Author Name Disambiguation
In this paper, we present a method to automatically build large labeled
datasets for the author ambiguity problem in the academic world by leveraging
the authoritative academic resources, ORCID and DOI. Using the method, we built
LAGOS-AND, two large, gold-standard datasets for author name disambiguation
(AND), of which LAGOS-AND-BLOCK is created for clustering-based AND research
and LAGOS-AND-PAIRWISE is created for classification-based AND research. Our
LAGOS-AND datasets are substantially different from the existing ones. The
initial versions of the datasets (v1.0, released in February 2021) include 7.5M
citations authored by 798K unique authors (LAGOS-AND-BLOCK) and close to 1M
instances (LAGOS-AND-PAIRWISE). And both datasets show close similarities to
the whole Microsoft Academic Graph (MAG) across validations of six facets. In
building the datasets, we reveal the variation degrees of last names in three
literature databases, PubMed, MAG, and Semantic Scholar, by comparing author
names hosted to the authors' official last names shown on the ORCID pages.
Furthermore, we evaluate several baseline disambiguation methods as well as the
MAG's author IDs system on our datasets, and the evaluation helps identify
several interesting findings. We hope the datasets and findings will bring new
insights for future studies. The code and datasets are publicly available.Comment: 33 pages, 7 tables, 7 figure
- …