67 research outputs found
Contextualized Structural Self-supervised Learning for Ontology Matching
Ontology matching (OM) entails the identification of semantic relationships
between concepts within two or more knowledge graphs (KGs) and serves as a
critical step in integrating KGs from various sources. Recent advancements in
deep OM models have harnessed the power of transformer-based language models
and the advantages of knowledge graph embedding. Nevertheless, these OM models
still face persistent challenges, such as a lack of reference alignments,
runtime latency, and unexplored different graph structures within an end-to-end
framework. In this study, we introduce a novel self-supervised learning OM
framework with input ontologies, called LaKERMap. This framework capitalizes on
the contextual and structural information of concepts by integrating implicit
knowledge into transformers. Specifically, we aim to capture multiple
structural contexts, encompassing both local and global interactions, by
employing distinct training objectives. To assess our methods, we utilize the
Bio-ML datasets and tasks. The findings from our innovative approach reveal
that LaKERMap surpasses state-of-the-art systems in terms of alignment quality
and inference time. Our models and codes are available here:
https://github.com/ellenzhuwang/lakermap
Recommended from our members
Results of the Ontology Alignment Evaluation Initiative 2023
The Ontology Alignment Evaluation Initiative (OAEI) aims at comparing ontology matching systems on precisely defined test cases. These test cases can be based on ontologies of different levels of complexity and use different evaluation modalities. The OAEI 2023 campaign offered 15 tracks and was attended by 16 participants. This paper is an overall presentation of that campaign
A Data-driven Approach to Large Knowledge Graph Matching
In the last decade, a remarkable number of open Knowledge Graphs (KGs) were developed, such as DBpedia, NELL, and YAGO. While some of such KGs are curated via crowdsourcing platforms, others are semi-automatically constructed. This has resulted in a significant degree of semantic heterogeneity and overlapping facts. KGs are highly complementary; thus, mapping them can benefit intelligent applications that require integrating different KGs such as recommendation systems, query answering, and semantic web navigation.
Although the problem of ontology matching has been investigated and a significant number of systems have been developed, the challenges of mapping large-scale KGs remain significant. KG matching has been a topic of interest in the Semantic Web community since it has been introduced to the Ontology Alignment Evaluation Initiative (OAEI) in 2018. Nonetheless, a major limitation of the current benchmarks is their lack of representation of real-world KGs. This work also highlights a number of limitations with current matching methods, such as: (i) they are highly dependent on string-based similarity measures, and (ii) they are primarily built to handle well-formed ontologies. These features make them unsuitable for large, (semi/fully) automatically constructed KGs with hundreds of classes and millions of instances. Another limitation of current work is the lack of benchmark datasets that represent the challenging task of matching real-world KGs.
This work addresses the limitation of the current datasets by first introducing two gold standard datasets for matching the schema of large, automatically constructed, less-well-structured KGs based on common KGs such as NELL, DBpedia, and Wikidata. We believe that the datasets which we make public in this work make the largest domain-independent benchmarks for matching KG classes. As many state-of-the-art methods are not suitable for matching large-scale and cross-domain KGs that often suffer from highly imbalanced class distribution, recent studies have revisited instance-based matching techniques in addressing this task. This is because such large KGs often lack a well-defined structure and descriptive metadata about their classes, but contain numerous class instances. Therefore, inspired by the role of instances in KGs, we propose a hybrid matching approach. Our method composes an instance-based matcher that casts the schema-matching process as a text classification task by exploiting instances of KG classes, and a string-based matcher. Our method is domain-independent and is able to handle KG classes with imbalanced populations. Further, we show that incorporating an instance-based approach with the appropriate data balancing strategy results in significant results in matching large and common KG classes
Machine learning for managing structured and semi-structured data
As the digitalization of private, commercial, and public sectors advances rapidly, an increasing amount of data is becoming available. In order to gain insights or knowledge from these enormous amounts of raw data, a deep analysis is essential. The immense volume requires highly automated processes with minimal manual interaction. In recent years, machine learning methods have taken on a central role in this task. In addition to the individual data points, their interrelationships often play a decisive role, e.g. whether two patients are related to each other or whether they are treated by the same physician. Hence, relational learning is an important branch of research, which studies how to harness this explicitly available structural information between different data points. Recently, graph neural networks have gained importance. These can be considered an extension of convolutional neural networks from regular grids to general (irregular) graphs.
Knowledge graphs play an essential role in representing facts about entities in a machine-readable way. While great efforts are made to store as many facts as possible in these graphs, they often remain incomplete, i.e., true facts are missing. Manual verification and expansion of the graphs is becoming increasingly difficult due to the large volume of data and must therefore be assisted or substituted by automated procedures which predict missing facts. The field of knowledge graph completion can be roughly divided into two categories: Link Prediction and Entity Alignment. In Link Prediction, machine learning models are trained to predict unknown facts between entities based on the known facts. Entity Alignment aims at identifying shared entities between graphs in order to link several such knowledge graphs based on some provided seed alignment pairs.
In this thesis, we present important advances in the field of knowledge graph completion. For Entity Alignment, we show how to reduce the number of required seed alignments while maintaining performance by novel active learning techniques. We also discuss the power of textual features and show that graph-neural-network-based methods have difficulties with noisy alignment data. For Link Prediction, we demonstrate how to improve the prediction for unknown entities at training time by exploiting additional metadata on individual statements, often available in modern graphs. Supported with results from a large-scale experimental study, we present an analysis of the effect of individual components of machine learning models, e.g., the interaction function or loss criterion, on the task of link prediction. We also introduce a software library that simplifies the implementation and study of such components and makes them accessible to a wide research community, ranging from relational learning researchers to applied fields, such as life sciences. Finally, we propose a novel metric for evaluating ranking results, as used for both completion tasks. It allows for easier interpretation and comparison, especially in cases with different numbers of ranking candidates, as encountered in the de-facto standard evaluation protocols for both tasks.Mit der rasant fortschreitenden Digitalisierung des privaten, kommerziellen und öffentlichen Sektors werden immer größere Datenmengen verfügbar. Um aus diesen enormen Mengen an Rohdaten Erkenntnisse oder Wissen zu gewinnen, ist eine tiefgehende Analyse unerlässlich. Das immense Volumen erfordert hochautomatisierte Prozesse mit minimaler manueller Interaktion. In den letzten Jahren haben Methoden des maschinellen Lernens eine zentrale Rolle bei dieser Aufgabe eingenommen. Neben den einzelnen Datenpunkten spielen oft auch deren Zusammenhänge eine entscheidende Rolle, z.B. ob zwei Patienten miteinander verwandt sind oder ob sie vom selben Arzt behandelt werden. Daher ist das relationale Lernen ein wichtiger Forschungszweig, der untersucht, wie diese explizit verfügbaren strukturellen Informationen zwischen verschiedenen Datenpunkten nutzbar gemacht werden können. In letzter Zeit haben Graph Neural Networks an Bedeutung gewonnen. Diese können als eine Erweiterung von CNNs von regelmäßigen Gittern auf allgemeine (unregelmäßige) Graphen betrachtet werden.
Wissensgraphen spielen eine wesentliche Rolle bei der Darstellung von Fakten über Entitäten in maschinenlesbaren Form. Obwohl große Anstrengungen unternommen werden, so viele Fakten wie möglich in diesen Graphen zu speichern, bleiben sie oft unvollständig, d. h. es fehlen Fakten. Die manuelle Überprüfung und Erweiterung der Graphen wird aufgrund der großen Datenmengen immer schwieriger und muss daher durch automatisierte Verfahren unterstützt oder ersetzt werden, die fehlende Fakten vorhersagen. Das Gebiet der Wissensgraphenvervollständigung lässt sich grob in zwei Kategorien einteilen: Link Prediction und Entity Alignment. Bei der Link Prediction werden maschinelle Lernmodelle trainiert, um unbekannte Fakten zwischen Entitäten auf der Grundlage der bekannten Fakten vorherzusagen. Entity Alignment zielt darauf ab, gemeinsame Entitäten zwischen Graphen zu identifizieren, um mehrere solcher Wissensgraphen auf der Grundlage einiger vorgegebener Paare zu verknüpfen.
In dieser Arbeit stellen wir wichtige Fortschritte auf dem Gebiet der Vervollständigung von Wissensgraphen vor. Für das Entity Alignment zeigen wir, wie die Anzahl der benötigten Paare reduziert werden kann, während die Leistung durch neuartige aktive Lerntechniken erhalten bleibt. Wir erörtern auch die Leistungsfähigkeit von Textmerkmalen und zeigen, dass auf Graph-Neural-Networks basierende Methoden Schwierigkeiten mit verrauschten Paar-Daten haben. Für die Link Prediction demonstrieren wir, wie die Vorhersage für unbekannte Entitäten zur Trainingszeit verbessert werden kann, indem zusätzliche Metadaten zu einzelnen Aussagen genutzt werden, die oft in modernen Graphen verfügbar sind. Gestützt auf Ergebnisse einer groß angelegten experimentellen Studie präsentieren wir eine Analyse der Auswirkungen einzelner Komponenten von Modellen des maschinellen Lernens, z. B. der Interaktionsfunktion oder des Verlustkriteriums, auf die Aufgabe der Link Prediction. Außerdem stellen wir eine Softwarebibliothek vor, die die Implementierung und Untersuchung solcher Komponenten vereinfacht und sie einer breiten Forschungsgemeinschaft zugänglich macht, die von Forschern im Bereich des relationalen Lernens bis hin zu angewandten Bereichen wie den Biowissenschaften reicht. Schließlich schlagen wir eine neuartige Metrik für die Bewertung von Ranking-Ergebnissen vor, wie sie für beide Aufgaben verwendet wird. Sie ermöglicht eine einfachere Interpretation und einen leichteren Vergleich, insbesondere in Fällen mit einer unterschiedlichen Anzahl von Kandidaten, wie sie in den de-facto Standardbewertungsprotokollen für beide Aufgaben vorkommen
A hybrid approach for large knowledge graphs matching
Matching large and heterogeneous Knowledge Graphs (KGs) has been a challenge in the Semantic Web research community. This work highlights a number of limitations with current matching methods, such as: (1) they are highly dependent on string-based similarity measures, and (2) they are primarily built to handle well-formed ontologies. These features make them unsuitable for large, (semi-) automatically constructed KGs with hundreds of classes and millions of instances. Such KGs share a remarkable number of complementary facts, often described using different vocabulary. Inspired by the role of instances in large-scale KGs, we propose a hybrid matching approach. Our method composes an instance-based matcher that casts the schema matching process as a two-way text classification task by exploiting instances of KG classes, and a string-based matcher. Our method is domain-independent and is able to handle KG classes with unbalanced population. Our evaluation on a real-world KG dataset shows that our method obtains the highest recall and F1 over all OAEI 2020 participants
Prediction of Adverse Biological Effects of Chemicals Using Knowledge Graph Embeddings
We have created a knowledge graph based on major data sources used in ecotoxicological risk assessment. We have applied this knowledge graph to an important task in risk assessment, namely chemical effect prediction. We have evaluated nine knowledge graph embedding models from a selection of geometric, decomposition, and convolutional models on this prediction task. We show that using knowledge graph embeddings can increase the accuracy of effect prediction with neural networks. Furthermore, we have implemented a fine-tuning architecture which adapts the knowledge graph embeddings to the effect prediction task and leads to a better performance. Finally, we evaluate certain characteristics of the knowledge graph embedding models to shed light on the individual model performance
Alinhamento de vocabulário de domínio utilizando os sistemas AML e LogMap
Introduction: In the context of the Semantic Web, interoperability among
heterogeneous ontologies is a challenge due to several factors, among which semantic ambiguity and redundancy stand out. To overcome these challenges, systems and algorithms are adopted to align different ontologies. In this study, it is understood that controlled vocabularies are a particular form of ontology.
Objective: to obtain a vocabulary resulting from the alignment and fusion of the Vocabularies Scientific Domains and Scientific Areas of the Foundation for Science and Technology, - FCT, European Science Vocabulary - EuroSciVoc and United Nations Educational, Scientific and Cultural Organization - UNESCO nomenclature for fields of Science and Technology, in the Computing Sciences domain, to be used
in the IViSSEM project. Methodology: literature review on systems/algorithms for
ontology alignment, using the Preferred Reporting Items for Systematic Reviews
and Meta-Analyses - PRISMA methodology; alignment of the three vocabularies;
and validation of the resulting vocabulary by means of a Delphi study. Results: we
proceeded to analyze the 25 ontology alignment systems and variants that
participated in at least one track of the Ontology Alignment Evaluation Initiative
competition between 2018 and 2019. From these systems, Agreement Maker Light
and Log Map were selected to perform the alignment of the three vocabularies,
making a cut to the area of Computer Science. Conclusion: The vocabulary was
obtained from Agreement Maker Light for having presented a better performance.
At the end, a vocabulary with 98 terms was obtained in the Computer Science
domain to be adopted by the IViSSEM project. The alignment resulted from the
vocabularies used by FCT (Portugal), with the one adopted by the European Union
(EuroSciVoc) and another one from the domain of Science & Technology
(UNESCO). This result is beneficial to other universities and projects, as well as to
FCT itself.Introdução: No contexto da Web Semântica, a interoperabilidade entre ontologias heterogêneas é um desafio devido a diversos fatores entre os quais se destacam a ambiguidade e a redundância semântica. Para superar tais desafios, adota-se sistemas e algoritmos para alinhamento de diferentes ontologias. Neste estudo, entende-se que vocabulários controlados são uma forma particular de ontologias.
Objetivo: obter um vocabulário resultante do alinhamento e fusão dos vocabulários
Domínios Científicos e Áreas Científicas da Fundação para Ciência e Tecnologia, - FCT, European Science Vocabulary - EuroSciVoc e Organização das Nações Unidas para a Educação, a Ciência e a Cultura - UNESCO nomenclature for fields of Science and
Technology, no domínio Ciências da Computação, para ser usado no âmbito do projeto IViSSEM. Metodologia: revisão da literatura sobre sistemas/algoritmos para
alinhamento de ontologias, utilizando a metodologia Preferred Reporting Items for Systematic Reviews and Meta-Analyses - PRISMA; alinhamento dos três
vocabulários; e validação do vocabulário resultante por meio do estudo Delphi.
Resultados: procedeu-se à análise dos 25 sistemas de alinhamento de ontologias e
variantes que participaram de pelo menos uma track da competição Ontology
Alignment Evaluation Iniciative entre 2018 e 2019. Destes sistemas foram
selecionados Agreement Maker Light e LogMap para realizar o alinhamento dos três
vocabulários, fazendo um recorte para a área da Ciência da Computação.
Conclusão: O vocabulário foi obtido a partir do Agreement Maker Light por ter
apresentado uma melhor performance. Ao final foi obtido o vocabulário, com 98
termos, no domínio da Ciência da Computação a ser adotado pelo projeto IViSSEM.
O alinhamento resultou dos vocabulários utilizados pela FCT (Portugal), com o
adotado pela União Europeia (EuroSciVoc) e outro do domínio da
Ciência&Tecnologia (UNESCO). Esse resultado é proveitoso para outras
universidades e projetos, bem como para a própria FCT
Exploiting general-purpose background knowledge for automated schema matching
The schema matching task is an integral part of the data integration process. It is usually the first step in integrating data. Schema matching is typically very complex and time-consuming. It is, therefore, to the largest part, carried out by humans. One reason for the low amount of automation is the fact that schemas are often defined with deep background knowledge that is not itself present within the schemas. Overcoming the problem of missing background knowledge is a core challenge in automating the data integration process.
In this dissertation, the task of matching semantic models, so-called ontologies, with the help of external background knowledge is investigated in-depth in Part I. Throughout this thesis, the focus lies on large, general-purpose resources since domain-specific resources are rarely available for most domains. Besides new knowledge resources, this thesis also explores new strategies to exploit such resources.
A technical base for the development and comparison of matching systems is presented in Part II. The framework introduced here allows for simple and modularized matcher development (with background knowledge sources) and for extensive evaluations of matching systems.
One of the largest structured sources for general-purpose background knowledge are knowledge graphs which have grown significantly in size in recent years. However, exploiting such graphs is not trivial. In Part III, knowledge graph em- beddings are explored, analyzed, and compared. Multiple improvements to existing approaches are presented.
In Part IV, numerous concrete matching systems which exploit general-purpose background knowledge are presented. Furthermore, exploitation strategies and resources are analyzed and compared. This dissertation closes with a perspective on real-world applications
Recommended from our members
Results of the Ontology Alignment Evaluation Initiative 2022
The Ontology Alignment Evaluation Initiative (OAEI) aims at comparing ontology matching systems on precisely defined test cases. These test cases can be based on ontologies of different levels of complexity and use different evaluation modalities. The OAEI 2022 campaign offered 14 tracks and was attended by 18 participants. This paper is an overall presentation of that campaign
Recommended from our members
Results of the Ontology Alignment Evaluation Initiative 2021
The Ontology Alignment Evaluation Initiative (OAEI) aims at comparing ontology matching systems on precisely defined test cases. These test cases can be based on ontologies of different levels of complexity and use different evaluation modalities (e.g., blind evaluation, open evaluation, or consensus). The OAEI 2021 campaign offered 13 tracks and was attended by 21 participants. This paper is an overall presentation of that campaign
- …