958 research outputs found
A knowledge graph embeddings based approach for author name disambiguation using literals
Scholarly data is growing continuously containing information about the articles from a plethora of venues including conferences, journals, etc. Many initiatives have been taken to make scholarly data available in the form of Knowledge Graphs (KGs). These efforts to standardize these data and make them accessible have also led to many challenges such as exploration of scholarly articles, ambiguous authors, etc. This study more specifically targets the problem of Author Name Disambiguation (AND) on Scholarly KGs and presents a novel framework, Literally Author Name Disambiguation (LAND), which utilizes Knowledge Graph Embeddings (KGEs) using multimodal literal information generated from these KGs. This framework is based on three components: (1) multimodal KGEs, (2) a blocking procedure, and finally, (3) hierarchical Agglomerative Clustering. Extensive experiments have been conducted on two newly created KGs: (i) KG containing information from Scientometrics Journal from 1978 onwards (OC-782K), and (ii) a KG extracted from a well-known benchmark for AND provided by AMiner (AMiner-534K). The results show that our proposed architecture outperforms our baselines of 8–14% in terms of F1 score and shows competitive performances on a challenging benchmark such as AMiner. The code and the datasets are publicly available through Github (https://github.com/sntcristian/and-kge) and Zenodo (https://doi.org/10.5281/zenodo.6309855) respectively
A knowledge graph embeddings based approach for author name disambiguation using literals
Scholarly data is growing continuously containing information about the articles from a plethora of venues including conferences, journals, etc. Many initiatives have been taken to make scholarly data available in the form of Knowledge Graphs (KGs). These efforts to standardize these data and make them accessible have also led to many challenges such as exploration of scholarly articles, ambiguous authors, etc. This study more specifically targets the problem of Author Name Disambiguation (AND) on Scholarly KGs and presents a novel framework, Literally Author Name Disambiguation (LAND), which utilizes Knowledge Graph Embeddings (KGEs) using multimodal literal information generated from these KGs. This framework is based on three components: (1) multimodal KGEs, (2) a blocking procedure, and finally, (3) hierarchical Agglomerative Clustering. Extensive experiments have been conducted on two newly created KGs: (i) KG containing information from Scientometrics Journal from 1978 onwards (OC-782K), and (ii) a KG extracted from a well-known benchmark for AND provided by AMiner (AMiner-534K). The results show that our proposed architecture outperforms our baselines of 8–14% in terms of F1 score and shows competitive performances on a challenging benchmark such as AMiner. The code and the datasets are publicly available through Github (https://github.com/sntcristian/and-kge) and Zenodo (https://doi.org/10.5281/zenodo.6309855) respectively
A knowledge graph embeddings based approach for author name disambiguation using literals
Scholarly data is growing continuously containing information about the articles from a plethora of venues including conferences, journals, etc. Many initiatives have been taken to make scholarly data available in the form of Knowledge Graphs (KGs). These efforts to standardize these data and make them accessible have also led to many challenges such as exploration of scholarly articles, ambiguous authors, etc. This study more specifically targets the problem of Author Name Disambiguation (AND) on Scholarly KGs and presents a novel framework, Literally Author Name Disambiguation (LAND), which utilizes Knowledge Graph Embeddings (KGEs) using multimodal literal information generated from these KGs. This framework is based on three components: (1) multimodal KGEs, (2) a blocking procedure, and finally, (3) hierarchical Agglomerative Clustering. Extensive experiments have been conducted on two newly created KGs: (i) KG containing information from Scientometrics Journal from 1978 onwards (OC-782K), and (ii) a KG extracted from a well-known benchmark for AND provided by AMiner (AMiner-534K). The results show that our proposed architecture outperforms our baselines of 8–14% in terms of F1 score and shows competitive performances on a challenging benchmark such as AMiner. The code and the datasets are publicly available through Github (https://github.com/sntcristian/and-kge) and Zenodo (https://doi.org/10.5281/zenodo.6309855) respectively
Author Matching Classification with Anomaly Detection Approach for Bibliomethric Repository Data
Authors name disambiguation (AND) is a complex problem in the process of identifying an author in a digital library (DL). The AND data classification process is very much determined by the grouping process and data processing techniques before entering the classifier algorithm. In general, the data pre-processing technique used is pairwise and similarity to do author matching. In a large enough data set scale, the pairwise technique used in this study is to do a combination of each attribute in the AND dataset and by defining a binary class for each author matching combination, where the unequal author is given a value of 0 and the same author is given a value of 1. The technique produces very high imbalance data where class 0 becomes 98.9% of the amount of data compared to 1.1% of class 1. The results bring up an analysis in which class 1 can be considered and processed as data anomaly of the whole data. Therefore, anomaly detection is the method chosen in this study using the Isolation Forest algorithm as its classifier. The results obtained are very satisfying in terms of accuracy which can reach 99.5%
Author Name Disambiguation via Heterogeneous Network Embedding from Structural and Semantic Perspectives
Name ambiguity is common in academic digital libraries, such as multiple
authors having the same name. This creates challenges for academic data
management and analysis, thus name disambiguation becomes necessary. The
procedure of name disambiguation is to divide publications with the same name
into different groups, each group belonging to a unique author. A large amount
of attribute information in publications makes traditional methods fall into
the quagmire of feature selection. These methods always select attributes
artificially and equally, which usually causes a negative impact on accuracy.
The proposed method is mainly based on representation learning for
heterogeneous networks and clustering and exploits the self-attention
technology to solve the problem. The presentation of publications is a
synthesis of structural and semantic representations. The structural
representation is obtained by meta-path-based sampling and a skip-gram-based
embedding method, and meta-path level attention is introduced to automatically
learn the weight of each feature. The semantic representation is generated
using NLP tools. Our proposal performs better in terms of name disambiguation
accuracy compared with baselines and the ablation experiments demonstrate the
improvement by feature selection and the meta-path level attention in our
method. The experimental results show the superiority of our new method for
capturing the most attributes from publications and reducing the impact of
redundant information
OAG-BERT: Pre-train Heterogeneous Entity-augmented Academic Language Models
To enrich language models with domain knowledge is crucial but difficult.
Based on the world's largest public academic graph Open Academic Graph (OAG),
we pre-train an academic language model, namely OAG-BERT, which integrates
massive heterogeneous entities including paper, author, concept, venue, and
affiliation. To better endow OAG-BERT with the ability to capture entity
information, we develop novel pre-training strategies including heterogeneous
entity type embedding, entity-aware 2D positional encoding, and span-aware
entity masking. For zero-shot inference, we design a special decoding strategy
to allow OAG-BERT to generate entity names from scratch. We evaluate the
OAG-BERT on various downstream academic tasks, including NLP benchmarks,
zero-shot entity inference, heterogeneous graph link prediction, and author
name disambiguation. Results demonstrate the effectiveness of the proposed
pre-training approach to both comprehending academic texts and modeling
knowledge from heterogeneous entities. OAG-BERT has been deployed to multiple
real-world applications, such as reviewer recommendations and paper tagging in
the AMiner system. It is also available to the public through the CogDL
package
Whois? Deep Author Name Disambiguation using Bibliographic Data
As the number of authors is increasing exponentially over years, the number
of authors sharing the same names is increasing proportionally. This makes it
challenging to assign newly published papers to their adequate authors.
Therefore, Author Name Ambiguity (ANA) is considered a critical open problem in
digital libraries. This paper proposes an Author Name Disambiguation (AND)
approach that links author names to their real-world entities by leveraging
their co-authors and domain of research. To this end, we use a collection from
the DBLP repository that contains more than 5 million bibliographic records
authored by around 2.6 million co-authors. Our approach first groups authors
who share the same last names and same first name initials. The author within
each group is identified by capturing the relation with his/her co-authors and
area of research, which is represented by the titles of the validated
publications of the corresponding author. To this end, we train a neural
network model that learns from the representations of the co-authors and
titles. We validated the effectiveness of our approach by conducting extensive
experiments on a large dataset.Comment: Accepted for publication @ TPDL202
LAGOS-AND: A Large Gold Standard Dataset for Scholarly Author Name Disambiguation
In this paper, we present a method to automatically build large labeled
datasets for the author ambiguity problem in the academic world by leveraging
the authoritative academic resources, ORCID and DOI. Using the method, we built
LAGOS-AND, two large, gold-standard datasets for author name disambiguation
(AND), of which LAGOS-AND-BLOCK is created for clustering-based AND research
and LAGOS-AND-PAIRWISE is created for classification-based AND research. Our
LAGOS-AND datasets are substantially different from the existing ones. The
initial versions of the datasets (v1.0, released in February 2021) include 7.5M
citations authored by 798K unique authors (LAGOS-AND-BLOCK) and close to 1M
instances (LAGOS-AND-PAIRWISE). And both datasets show close similarities to
the whole Microsoft Academic Graph (MAG) across validations of six facets. In
building the datasets, we reveal the variation degrees of last names in three
literature databases, PubMed, MAG, and Semantic Scholar, by comparing author
names hosted to the authors' official last names shown on the ORCID pages.
Furthermore, we evaluate several baseline disambiguation methods as well as the
MAG's author IDs system on our datasets, and the evaluation helps identify
several interesting findings. We hope the datasets and findings will bring new
insights for future studies. The code and datasets are publicly available.Comment: 33 pages, 7 tables, 7 figure
- …