A Data-driven Approach to Large Knowledge Graph Matching

Abstract

In the last decade, a remarkable number of open Knowledge Graphs (KGs) were developed, such as DBpedia, NELL, and YAGO. While some of such KGs are curated via crowdsourcing platforms, others are semi-automatically constructed. This has resulted in a significant degree of semantic heterogeneity and overlapping facts. KGs are highly complementary; thus, mapping them can benefit intelligent applications that require integrating different KGs such as recommendation systems, query answering, and semantic web navigation. Although the problem of ontology matching has been investigated and a significant number of systems have been developed, the challenges of mapping large-scale KGs remain significant. KG matching has been a topic of interest in the Semantic Web community since it has been introduced to the Ontology Alignment Evaluation Initiative (OAEI) in 2018. Nonetheless, a major limitation of the current benchmarks is their lack of representation of real-world KGs. This work also highlights a number of limitations with current matching methods, such as: (i) they are highly dependent on string-based similarity measures, and (ii) they are primarily built to handle well-formed ontologies. These features make them unsuitable for large, (semi/fully) automatically constructed KGs with hundreds of classes and millions of instances. Another limitation of current work is the lack of benchmark datasets that represent the challenging task of matching real-world KGs. This work addresses the limitation of the current datasets by first introducing two gold standard datasets for matching the schema of large, automatically constructed, less-well-structured KGs based on common KGs such as NELL, DBpedia, and Wikidata. We believe that the datasets which we make public in this work make the largest domain-independent benchmarks for matching KG classes. As many state-of-the-art methods are not suitable for matching large-scale and cross-domain KGs that often suffer from highly imbalanced class distribution, recent studies have revisited instance-based matching techniques in addressing this task. This is because such large KGs often lack a well-defined structure and descriptive metadata about their classes, but contain numerous class instances. Therefore, inspired by the role of instances in KGs, we propose a hybrid matching approach. Our method composes an instance-based matcher that casts the schema-matching process as a text classification task by exploiting instances of KG classes, and a string-based matcher. Our method is domain-independent and is able to handle KG classes with imbalanced populations. Further, we show that incorporating an instance-based approach with the appropriate data balancing strategy results in significant results in matching large and common KG classes

    Similar works