22,029 research outputs found
Heterogeneous Entity Matching with Complex Attribute Associations using BERT and Neural Networks
Across various domains, data from different sources such as Baidu Baike and
Wikipedia often manifest in distinct forms. Current entity matching
methodologies predominantly focus on homogeneous data, characterized by
attributes that share the same structure and concise attribute values. However,
this orientation poses challenges in handling data with diverse formats.
Moreover, prevailing approaches aggregate the similarity of attribute values
between corresponding attributes to ascertain entity similarity. Yet, they
often overlook the intricate interrelationships between attributes, where one
attribute may have multiple associations. The simplistic approach of pairwise
attribute comparison fails to harness the wealth of information encapsulated
within entities.To address these challenges, we introduce a novel entity
matching model, dubbed Entity Matching Model for Capturing Complex Attribute
Relationships(EMM-CCAR),built upon pre-trained models. Specifically, this model
transforms the matching task into a sequence matching problem to mitigate the
impact of varying data formats. Moreover, by introducing attention mechanisms,
it identifies complex relationships between attributes, emphasizing the degree
of matching among multiple attributes rather than one-to-one correspondences.
Through the integration of the EMM-CCAR model, we adeptly surmount the
challenges posed by data heterogeneity and intricate attribute
interdependencies. In comparison with the prevalent DER-SSM and Ditto
approaches, our model achieves improvements of approximately 4% and 1% in F1
scores, respectively. This furnishes a robust solution for addressing the
intricacies of attribute complexity in entity matching
Entity Synonym Discovery via Multipiece Bilateral Context Matching
Being able to automatically discover synonymous entities in an open-world
setting benefits various tasks such as entity disambiguation or knowledge graph
canonicalization. Existing works either only utilize entity features, or rely
on structured annotations from a single piece of context where the entity is
mentioned. To leverage diverse contexts where entities are mentioned, in this
paper, we generalize the distributional hypothesis to a multi-context setting
and propose a synonym discovery framework that detects entity synonyms from
free-text corpora with considerations on effectiveness and robustness. As one
of the key components in synonym discovery, we introduce a neural network model
SYNONYMNET to determine whether or not two given entities are synonym with each
other. Instead of using entities features, SYNONYMNET makes use of multiple
pieces of contexts in which the entity is mentioned, and compares the
context-level similarity via a bilateral matching schema. Experimental results
demonstrate that the proposed model is able to detect synonym sets that are not
observed during training on both generic and domain-specific datasets:
Wiki+Freebase, PubMed+UMLS, and MedBook+MKG, with up to 4.16% improvement in
terms of Area Under the Curve and 3.19% in terms of Mean Average Precision
compared to the best baseline method.Comment: In IJCAI 2020 as a long paper. Code and data are available at
https://github.com/czhang99/SynonymNe
A Critical Re-evaluation of Benchmark Datasets for (Deep) Learning-Based Matching Algorithms
Entity resolution (ER) is the process of identifying records that refer to
the same entities within one or across multiple databases. Numerous techniques
have been developed to tackle ER challenges over the years, with recent
emphasis placed on machine and deep learning methods for the matching phase.
However, the quality of the benchmark datasets typically used in the
experimental evaluations of learning-based matching algorithms has not been
examined in the literature. To cover this gap, we propose four different
approaches to assessing the difficulty and appropriateness of 13 established
datasets: two theoretical approaches, which involve new measures of linearity
and existing measures of complexity, and two practical approaches: the
difference between the best non-linear and linear matchers, as well as the
difference between the best learning-based matcher and the perfect oracle. Our
analysis demonstrates that most of the popular datasets pose rather easy
classification tasks. As a result, they are not suitable for properly
evaluating learning-based matching algorithms. To address this issue, we
propose a new methodology for yielding benchmark datasets. We put it into
practice by creating four new matching tasks, and we verify that these new
benchmarks are more challenging and therefore more suitable for further
advancements in the field
- …