8 research outputs found

    SDM-RDFizer: An RML Interpreter for the Efficient Creation of RDF Knowledge Graphs

    Get PDF
    In recent years, the amount of data has increased exponentially, and knowledge graphs have gained attention as data structures to integrate data and knowledge harvested from myriad data sources. However, data complexity issues like large volume, high-duplicate rate, and heterogeneity usually characterize these data sources, being required data management tools able to address the negative impact of these issues on the knowledge graph creation process. In this paper, we propose the SDM-RDFizer, an interpreter of the RDF Mapping Language (RML), to transform raw data in various formats into an RDF knowledge graph. SDM-RDFizer implements novel algorithms to execute the logical operators between mappings in RML, allowing thus to scale up to complex scenarios where data is not only broad but has a high-duplication rate. We empirically evaluate the SDM-RDFizer performance against diverse testbeds with diverse configurations of data volume, duplicates, and heterogeneity. The observed results indicate that SDM-RDFizer is two orders of magnitude faster than state of the art, thus, meaning that SDM-RDFizer an interoperable and scalable solution for knowledge graph creation. SDM-RDFizer is publicly available as a resource through a Github repository and a DOI

    Normalization Techniques For Improving The Performance Of Knowledge Graph Creation Pipelines

    Get PDF
    With the rapid growth of data within the web, demands on discovering information within data and consecutively exploiting knowledge graphs rise much more than we think it does. Data integration systems can be of great help to meet this precious demand in that they offer transformation of data from various sources and with different volumes. To this end, a data integration system takes advantage of utilizing mapping rules-- specified in a language like RML -- to integrate data collected from various data sources into a knowledge graph. However, large data sources may suffer from various data quality issues, being redundant one of them. Regarding this, the Semantic Web community contributes to Knowledge Engineering with techniques to create a knowledge graph efficiently. The thesis reported in this document tackles creating knowledge graphs in the presence of data sources with redundant data, and a novel normalization theory is proposed to solve this problem. This theory covers not only the characteristics of the data sources but also mapping rules used to integrate the data sources into a knowledge graph. Based on this, three normal forms are proposed and an algorithm for transforming mapping rules and data sources into these normal forms. The proposed approach's performance is evaluated in different testbeds composed of real-world data and synthetic data. The observed results suggest that the proposed techniques can dramatically reduce the execution time of knowledge graph creation. Therefore, this thesis's normalization theory contributes to the repertoire of tools that facilitate the creation of knowledge graphs at scale

    Semantic data integration and knowledge graph creation at scale

    Get PDF
    Contrary to data, knowledge is often abstract. Concrete knowledge can be achieved through the inclusion of semantics in the data models, highlighting the role of data integration. The massive growing number of data, in recent years, has promoted the demand for scaling up data management techniques; materializing data integration, a.k.a., knowledge graph creation falls in that category. In this thesis, we investigate efficient methods and techniques for materializing data integration. We formalize the process of materializing data integration. We formally define the characteristics of a materialized data integration system that merge the data operators and sources. Owing to this formalism, both layers of data integration, including data and schema-level integration, are formalized in the context of mapping assertions. We explore optimization opportunities for improving the materialization of data integration systems. We recognize three angles including intra/inter-mapping assertions from which the materialization can be improved. Accordingly, we propose source-based, mapping-based, and inter-mapping assertion groups of optimization techniques. We utilize our proposed techniques in three real-world projects. We illustrate how applying these optimization techniques contribute to meeting the objectives of the mentioned projects. Furthermore, we study the parameters impacting the performance of materialization of data integration. Relying on reported parameters and the presumably impacting parameters, we build four groups of testbeds. We empirically study the performances of these different testbeds in the presence and absence of our proposed techniques, in terms of execution time. We observe that the savings can be up to 75%. Lastly, we contribute to facilitating the process of declarative data integration system definition. We propose two data operation function signatures in Function Ontology (FnO). The first set of functions is designed to perform the task of entity alignment by resorting to an entity and relation linking tool. The second library consists of domain-specific functions to align genomic entities by harmonizing their representations. Finally, we introduce a tool equipped with a user interface to facilitate the process of defining declarative mapping rules by allowing users to explore the data sources and unified schema while defining their correspondences.Im Gegensatz zu den Daten ist das Wissen oft abstrakt. Konkretes Wissen kann durch die Einbeziehung von Semantik in die Datenmodelle erreicht werden, was die Rolle der Datenintegration unterstreicht. Die massiv wachsende Zahl von Daten hat in den letzten Jahren die Nachfrage nach einer Ausweitung der Datenverwaltungstechnikengef¨ordert; die materialisierende Datenintegration, auch bekannt als die Erstellung von Wissensgraphen, f¨allt in diese Kategorie. In dieser Arbeit untersuchen wir effiziente Methoden und Techniken zur Materialisierung der Datenintegration. Wir formalisieren den Prozess der Materialisierung der Datenintegration. Wir definieren formal die Eigenschaften eines materialisierten Datenintegrationssystems, so dass die Datenoperatoren und -quellen zusammengef¨uhrt werden. Dank dieses Formalismus werden beide Ebenen der Datenintegration, einschließlich der Integration auf Daten- und Schemaebene, im Kontext von Mapping-Assertions formalisiert. Wir untersuchen die Optimierungsm¨oglichkeiten zur Verbesserung der Materialisierung von Datenintegrationssystemen. Wir erkennen drei Gesichtspunkte, einschließlich Intra-/Inter-Mapping-Assertions, unter denen die Materialisierung verbessert werden kann. Dementsprechend schlagen wir quellenbasierte, mappingbasierte und inter-mapping Assertionsgruppen von Optimierungstechniken vor. Wir setzen die von uns vorgeschlagenen Techniken in drei Forschungsprojekte ein. Wir veranschaulichen, wie die Anwendung dieser Optimierungstechniken dazu beitr¨agt, die Ziele der genannten Projekte zu erreichen. Wir untersuchen die Parameter, die sich auf die Leistung der Materialisierung der Datenintegration auswirken. Auf der Grundlage der gemeldeten Parameter und der vermutlich ausschlaggebenden Parameter erstellen wir vier Gruppen von Testumgebungen. Wir untersuchen empirisch die Leistung dieser verschiedenen Testbeds mit und ohne die von uns vorgeschlagenen Techniken in Bezug auf die Ausf¨uhrungszeit. Wir stellen fest, dass die Einsparungen bis zu 75% betragen k¨onnen. Schließlich tragen wir zur Erleichterung des Prozesses der deklarativen Definition von Datenintegrationssystemen bei, indem wir zwei Funktionssignaturen f¨ur Datenoperationen in der Function Ontology (FnO) vorschlagen. Die erste Gruppe von Funktionen ist f¨ur die Aufgabe des Entit¨atsabgleichs konzipiert, w¨ahrend die zweite Bibliothek aus dom¨anenspezifischen Funktionen zum Abgleich genomischer Entit¨aten durch Harmonisierung ihrer Darstellungen besteht. Schließlich stellen wir ein Tool vor, das mit einer Benutzeroberfl¨ache ausgestattet ist, um den Prozess der Definition deklarativer Mapping-Regeln zu erleichtern, indem es den Benutzern erm¨oglicht, die Datenquellen und das einheitliche Schema zu erkunden

    Dragoman: Efficiently Evaluating Declarative Mapping Languages over Frameworks for Knowledge Graph Creation

    Full text link
    In recent years, there have been valuable efforts and contributions to make the process of RDF knowledge graph creation traceable and transparent; extending and applying declarative mapping languages is an example. One challenging step is the traceability of procedures that aim to overcome interoperability issues, a.k.a. data-level integration. In most pipelines, data integration is performed by ad-hoc programs, preventing traceability and reusability. However, formal frameworks provided by function-based declarative mapping languages such as FunUL and RML+FnO empower expressiveness. Data-level integration can be defined as functions and integrated as part of the mappings performing schema-level integration. However, combining functions with the mappings introduces a new source of complexity that can considerably impact the required number of resources and execution time. We tackle the problem of efficiently executing mappings with functions and formalize the transformation of them into function-free mappings. These transformations are the basis of an optimization process that aims to perform an eager evaluation of function-based mapping rules. These techniques are implemented in a framework named Dragoman. We demonstrate the correctness of the transformations while ensuring that the function-free data integration processes are equivalent to the original one. The effectiveness of Dragoman is empirically evaluated in 230 testbeds composed of various types of functions integrated with mapping rules of different complexity. The outcomes suggest that evaluating function-free mapping rules reduces execution time in complex knowledge graph creation pipelines composed of large data sources and multiple types of mapping rules. The savings can be up to 75%, suggesting that eagerly executing functions in mapping rules enable making these pipelines applicable and scalable in real-world settings

    The RML Ontology: A Community-Driven Modular Redesign After a Decade of Experience in Mapping Heterogeneous Data to RDF

    Full text link
    peer reviewedAbstractThe Relational to RDF Mapping Language (R2RML) became a W3C Recommendation a decade ago. Despite its wide adoption, its potential applicability beyond relational databases was swiftly explored. As a result, several extensions and new mapping languages were proposed to tackle the limitations that surfaced as R2RML was applied in real-world use cases. Over the years, one of these languages, the RDF Mapping Language (RML), has gathered a large community of contributors, users, and compliant tools. So far, there has been no well-defined set of features for the mapping language, nor was there a consensus-marking ontology. Consequently, it has become challenging for non-experts to fully comprehend and utilize the full range of the language’s capabilities. After three years of work, the W3C Community Group on Knowledge Graph Construction proposes a new specification for RML. This paper presents the new modular RML ontology and the accompanying SHACL shapes that complement the specification. We discuss the motivations and challenges that emerged when extending R2RML, the methodology we followed to design the new ontology while ensuring its backward compatibility with R2RML, and the novel features which increase its expressiveness. The new ontology consolidates the potential of RML, empowers practitioners to define mapping rules for constructing RDF graphs that were previously unattainable, and allows developers to implement systems in adherence with [R2]RML.Resource type: Ontology/License: CC BY 4.0 InternationalDOI: 10.5281/zenodo.7918478/URL: http://w3id.org/rml/portal/</jats:ext-link

    Semantic Data Management in Data Lakes

    Full text link
    In recent years, data lakes emerged as away to manage large amounts of heterogeneous data for modern data analytics. One way to prevent data lakes from turning into inoperable data swamps is semantic data management. Some approaches propose the linkage of metadata to knowledge graphs based on the Linked Data principles to provide more meaning and semantics to the data in the lake. Such a semantic layer may be utilized not only for data management but also to tackle the problem of data integration from heterogeneous sources, in order to make data access more expressive and interoperable. In this survey, we review recent approaches with a specific focus on the application within data lake systems and scalability to Big Data. We classify the approaches into (i) basic semantic data management, (ii) semantic modeling approaches for enriching metadata in data lakes, and (iii) methods for ontologybased data access. In each category, we cover the main techniques and their background, and compare latest research. Finally, we point out challenges for future work in this research area, which needs a closer integration of Big Data and Semantic Web technologies
    corecore