1,651 research outputs found
Name Disambiguation from link data in a collaboration graph using temporal and topological features
In a social community, multiple persons may share the same name, phone number
or some other identifying attributes. This, along with other phenomena, such as
name abbreviation, name misspelling, and human error leads to erroneous
aggregation of records of multiple persons under a single reference. Such
mistakes affect the performance of document retrieval, web search, database
integration, and more importantly, improper attribution of credit (or blame).
The task of entity disambiguation partitions the records belonging to multiple
persons with the objective that each decomposed partition is composed of
records of a unique person. Existing solutions to this task use either
biographical attributes, or auxiliary features that are collected from external
sources, such as Wikipedia. However, for many scenarios, such auxiliary
features are not available, or they are costly to obtain. Besides, the attempt
of collecting biographical or external data sustains the risk of privacy
violation. In this work, we propose a method for solving entity disambiguation
task from link information obtained from a collaboration network. Our method is
non-intrusive of privacy as it uses only the time-stamped graph topology of an
anonymized network. Experimental results on two real-life academic
collaboration networks show that the proposed method has satisfactory
performance.Comment: The short version of this paper has been accepted to ASONAM 201
Bayesian Non-Exhaustive Classification A Case Study: Online Name Disambiguation using Temporal Record Streams
The name entity disambiguation task aims to partition the records of multiple
real-life persons so that each partition contains records pertaining to a
unique person. Most of the existing solutions for this task operate in a batch
mode, where all records to be disambiguated are initially available to the
algorithm. However, more realistic settings require that the name
disambiguation task be performed in an online fashion, in addition to, being
able to identify records of new ambiguous entities having no preexisting
records. In this work, we propose a Bayesian non-exhaustive classification
framework for solving online name disambiguation task. Our proposed method uses
a Dirichlet process prior with a Normal * Normal * Inverse Wishart data model
which enables identification of new ambiguous entities who have no records in
the training data. For online classification, we use one sweep Gibbs sampler
which is very efficient and effective. As a case study we consider
bibliographic data in a temporal stream format and disambiguate authors by
partitioning their papers into homogeneous groups. Our experimental results
demonstrate that the proposed method is better than existing methods for
performing online name disambiguation task.Comment: to appear in CIKM 201
Integrating Weakly Supervised Word Sense Disambiguation into Neural Machine Translation
This paper demonstrates that word sense disambiguation (WSD) can improve
neural machine translation (NMT) by widening the source context considered when
modeling the senses of potentially ambiguous words. We first introduce three
adaptive clustering algorithms for WSD, based on k-means, Chinese restaurant
processes, and random walks, which are then applied to large word contexts
represented in a low-rank space and evaluated on SemEval shared-task data. We
then learn word vectors jointly with sense vectors defined by our best WSD
method, within a state-of-the-art NMT system. We show that the concatenation of
these vectors, and the use of a sense selection mechanism based on the weighted
average of sense vectors, outperforms several baselines including sense-aware
ones. This is demonstrated by translation on five language pairs. The
improvements are above one BLEU point over strong NMT baselines, +4% accuracy
over all ambiguous nouns and verbs, or +20% when scored manually over several
challenging words.Comment: To appear in TAC
Towards Name Disambiguation: Relational, Streaming, and Privacy-Preserving Text Data
In the real world, our DNA is unique but many people share names. This phenomenon often causes erroneous aggregation of documents of multiple persons who are namesakes of one another. Such mistakes deteriorate the performance of document retrieval, web search, and more seriously, cause improper attribution of credit or blame in digital forensics. To resolve this issue, the name disambiguation task 1 is designed to partition the documents associated with a name reference such that each partition contains documents pertaining to a unique real-life person. Existing algorithms for this task mainly suffer from the following drawbacks. First, the majority of existing solutions substantially rely on feature engineering, such as biographical feature extraction, or construction of auxiliary features from Wikipedia. However, for many scenarios, such features may be costly to obtain or unavailable in privacy sensitive domains. Instead we solve the name disambiguation task in restricted setting by leveraging only the relational data in the form of anonymized graphs. Second, most of the existing works for this task operate in a batch mode, where all records to be disambiguated are initially available to the algorithm. However, more realistic settings require that the name disambiguation task should be performed in an online streaming fashion in order to identify records of new ambiguous entities having no preexisting records. Finally, we investigate the potential disclosure risk of textual features used in name disambiguation and propose several algorithms to tackle the task in a privacy-aware scenario. In summary, in this dissertation, we present a number of novel approaches to address name disambiguation tasks from the above three aspects independently, namely relational, streaming, and privacy preserving textual data
LAGOS-AND: A Large Gold Standard Dataset for Scholarly Author Name Disambiguation
In this paper, we present a method to automatically build large labeled
datasets for the author ambiguity problem in the academic world by leveraging
the authoritative academic resources, ORCID and DOI. Using the method, we built
LAGOS-AND, two large, gold-standard datasets for author name disambiguation
(AND), of which LAGOS-AND-BLOCK is created for clustering-based AND research
and LAGOS-AND-PAIRWISE is created for classification-based AND research. Our
LAGOS-AND datasets are substantially different from the existing ones. The
initial versions of the datasets (v1.0, released in February 2021) include 7.5M
citations authored by 798K unique authors (LAGOS-AND-BLOCK) and close to 1M
instances (LAGOS-AND-PAIRWISE). And both datasets show close similarities to
the whole Microsoft Academic Graph (MAG) across validations of six facets. In
building the datasets, we reveal the variation degrees of last names in three
literature databases, PubMed, MAG, and Semantic Scholar, by comparing author
names hosted to the authors' official last names shown on the ORCID pages.
Furthermore, we evaluate several baseline disambiguation methods as well as the
MAG's author IDs system on our datasets, and the evaluation helps identify
several interesting findings. We hope the datasets and findings will bring new
insights for future studies. The code and datasets are publicly available.Comment: 33 pages, 7 tables, 7 figure
COAD: Contrastive Pre-training with Adversarial Fine-tuning for Zero-shot Expert Linking
Expert finding, a popular service provided by many online websites such as
Expertise Finder, LinkedIn, and AMiner, benefits seeking consultants,
collaborators, and candidate qualifications. However, its quality is suffered
from a single source of support information for experts. This paper employs
AMiner, a free online academic search and mining system, having collected more
than over 100 million researcher profiles together with 200 million papers from
multiple publication databases, as the basis for investigating the problem of
expert linking, which aims at linking any external information of persons to
experts in AMiner. A critical challenge is how to perform zero shot expert
linking without any labeled linkages from the external information to AMiner
experts, as it is infeasible to acquire sufficient labels for arbitrary
external sources. Inspired by the success of self supervised learning in
computer vision and natural language processing, we propose to train a self
supervised expert linking model, which is first pretrained by contrastive
learning on AMiner data to capture the common representation and matching
patterns of experts across AMiner and external sources, and is then fine-tuned
by adversarial learning on AMiner and the unlabeled external sources to improve
the model transferability. Experimental results demonstrate that COAD
significantly outperforms various baselines without contrastive learning of
experts on two widely studied downstream tasks: author identification
(improving up to 32.1% in HitRatio@1) and paper clustering (improving up to
14.8% in Pairwise-F1). Expert linking on two genres of external sources also
indicates the superiority of the proposed adversarial fine-tuning method
compared with other domain adaptation ways (improving up to 2.3% in
HitRatio@1).Comment: TKDE under revie
- …