1,318 research outputs found

    Identifying Geographic Clusters: A Network Analytic Approach

    Get PDF
    In recent years there has been a growing interest in the role of networks and clusters in the global economy. Despite being a popular research topic in economics, sociology and urban studies, geographical clustering of human activity has often studied been by means of predetermined geographical units such as administrative divisions and metropolitan areas. This approach is intrinsically time invariant and it does not allow one to differentiate between different activities. Our goal in this paper is to present a new methodology for identifying clusters, that can be applied to different empirical settings. We use a graph approach based on k-shell decomposition to analyze world biomedical research clusters based on PubMed scientific publications. We identify research institutions and locate their activities in geographical clusters. Leading areas of scientific production and their top performing research institutions are consistently identified at different geographic scales

    Author Name Disambiguation Using Co-training

    Get PDF
    In the community of bibliometrics, author name ambiguity means that author\u27s name is not a reliable identier for associating academic papers with their authors. Author name ambiguity has been the problem in bibliometrics and service providers like Google Scholar, generating a domain of study call Author Name Disambiguation (AND). Author name ambiguity is often tackled using classication techniques, where labeled papers are provided, and papers are assigned to correct authors according to the paper text and paper citations. When applying classication methods to author name disambiguation, two issues stand out: one is that a paper has multiple views (paper text and citation network). The other is the lack of training data: there are not many papers that are labeled. To cope with these two issues, we propose to use the co-training algorithm in AND. The co-training algorithm uses two views to classify papers iteratively and add the top selected papers into the training pool. We demonstrate that the co-training algorithm outperforms the baseline multi-view classication algorithm. We also experiment with hyper-parameters in the co-training algorithm. The experiment is done on the PubMed dataset, where authors are labeled with ORCID. Papers are represented by two embeddings that are learnt from paper content and paper citation network separately. Baseline classiers for comparison are logistic regression and SVM

    LAGOS-AND: A Large Gold Standard Dataset for Scholarly Author Name Disambiguation

    Full text link
    In this paper, we present a method to automatically build large labeled datasets for the author ambiguity problem in the academic world by leveraging the authoritative academic resources, ORCID and DOI. Using the method, we built LAGOS-AND, two large, gold-standard datasets for author name disambiguation (AND), of which LAGOS-AND-BLOCK is created for clustering-based AND research and LAGOS-AND-PAIRWISE is created for classification-based AND research. Our LAGOS-AND datasets are substantially different from the existing ones. The initial versions of the datasets (v1.0, released in February 2021) include 7.5M citations authored by 798K unique authors (LAGOS-AND-BLOCK) and close to 1M instances (LAGOS-AND-PAIRWISE). And both datasets show close similarities to the whole Microsoft Academic Graph (MAG) across validations of six facets. In building the datasets, we reveal the variation degrees of last names in three literature databases, PubMed, MAG, and Semantic Scholar, by comparing author names hosted to the authors' official last names shown on the ORCID pages. Furthermore, we evaluate several baseline disambiguation methods as well as the MAG's author IDs system on our datasets, and the evaluation helps identify several interesting findings. We hope the datasets and findings will bring new insights for future studies. The code and datasets are publicly available.Comment: 33 pages, 7 tables, 7 figure
    • …
    corecore