15 research outputs found

    Characteristic sets profile features: Estimation and application to SPARQL query planning

    Get PDF
    RDF dataset profiling is the task of extracting a formal representation of a dataset’s features. Such features may cover various aspects of the RDF dataset ranging from information on licensing and provenance to statistical descriptors of the data distribution and its semantics. In this work, we focus on the characteristics sets profile features that capture both structural and semantic information of an RDF dataset, making them a valuable resource for different downstream applications. While previous research demonstrated the benefits of characteristic sets in centralized and federated query processing, access to these fine-grained statistics is taken for granted. However, especially in federated query processing, computing this profile feature is challenging as it can be difficult and/or costly to access and process the entire data from all federation members. We address this shortcoming by introducing the concept of a profile feature estimation and propose a sampling-based approach to generate estimations for the characteristic sets profile feature. In addition, we showcase the applicability of these feature estimations in federated querying by proposing a query planning approach that is specifically designed to leverage these feature estimations. In our first experimental study, we intrinsically evaluate our approach on the representativeness of the feature estimation. The results show that even small samples of just 0.5% of the original graph’s entities allow for estimating both structural and statistical properties of the characteristic sets profile features. Our second experimental study extrinsically evaluates the estimations by investigating their applicability in our query planner using the well-known FedBench benchmark. The results of the experiments show that the estimated profile features allow for obtaining efficient query plans

    Robust query processing for linked data fragments

    Get PDF
    Linked Data Fragments (LDFs) refer to interfaces that allow for publishing and querying Knowledge Graphs on the Web. These interfaces primarily differ in their expressivity and allow for exploring different trade-offs when balancing the workload between clients and servers in decentralized SPARQL query processing. To devise efficient query plans, clients typically rely on heuristics that leverage the metadata provided by the LDF interface, since obtaining fine-grained statistics from remote sources is a challenging task. However, these heuristics are prone to potential estimation errors based on the metadata which can lead to inefficient query executions with a high number of requests, large amounts of data transferred, and, consequently, excessive execution times. In this work, we investigate robust query processing techniques for Linked Data Fragment clients to address these challenges. We first focus on robust plan selection by proposing CROP, a query plan optimizer that explores the cost and robustness of alternative query plans. Then, we address robust query execution by proposing a new class of adaptive operators: Polymorphic Join Operators. These operators adapt their join strategy in response to possible cardinality estimation errors. The results of our first experimental study show that CROP outperforms state-of-the-art clients by exploring alternative plans based on their cost and robustness. In our second experimental study, we investigate how different planning approaches can benefit from polymorphic join operators and find that they enable more efficient query execution in the majority of cases

    Community detection applied on big linked data

    Get PDF
    The Linked Open Data (LOD) Cloud has more than tripled its sources in just six years (from 295 sources in 2011 to 1163 datasets in 2017). The actual Web of Data contains more then 150 Billions of triples. We are assisting at a staggering growth in the production and consumption of LOD and the generation of increasingly large datasets. In this scenario, providing researchers, domain experts, but also businessmen and citizens with visual representations and intuitive interactions can significantly aid the exploration and understanding of the domains and knowledge represented by Linked Data. Various tools and web applications have been developed to enable the navigation, and browsing of the Web of Data. However, these tools lack in producing high level representations for large datasets, and in supporting users in the exploration and querying of these big sources. Following this trend, we devised a new method and a tool called H-BOLD (High level visualizations on Big Open Linked Data). H-BOLD enables the exploratory search and multilevel analysis of Linked Open Data. It offers different levels of abstraction on Big Linked Data. Through the user interaction and the dynamic adaptation of the graph representing the dataset, it will be possible to perform an effective exploration of the dataset, starting from a set of few classes and adding new ones. Performance and portability of H-BOLD have been evaluated on the SPARQL endpoint listed on SPARQL ENDPOINT STATUS. The effectiveness of H-BOLD as a visualization tool is described through a user study

    Semantic Analysis of R2RML Mappings for Ontology-Based Data Access

    Get PDF
    Ontology-based data access (OBDA) deals with the problem of accessing autonomous data sources through a shared, virtual ontology, and declarative mappings connecting the data sources to the ontology. The W3C standard R2RML allows for mapping relational data sources to RDFS/OWL ontologies. In this paper, we present algorithms for the semantic analysis of R2RML mappings in the OBDA setting, when the ontology is expressed in OWL 2 QL. The focus of such algorithms is to identify the main semantical anomalies (inconsistency and redundancy) of a mapping specification with respect to the ontology and/or the data sources. Such algorithms have been implemented in the mapping analysis tool developed within the Optique European project. We also report on the experiments conducted within the Optique project use cases

    Ontology Pattern-Based Data Integration

    Get PDF
    Data integration is concerned with providing a unified access to data residing at multiple sources. Such a unified access is realized by having a global schema and a set of mappings between the global schema and the local schemas of each data source, which specify how user queries at the global schema can be translated into queries at the local schemas. Data sources are typically developed and maintained independently, and thus, highly heterogeneous. This causes difficulties in integration because of the lack of interoperability in the aspect of architecture, data format, as well as syntax and semantics of the data. This dissertation represents a study on how small, self-contained ontologies, called ontology design patterns, can be employed to provide semantic interoperability in a cross-repository data integration system. The idea of this so-called ontology pattern- based data integration is that a collection of ontology design patterns can act as the global schema that still contains sufficient semantics, but is also flexible and simple enough to be used by linked data providers. On the one side, this differs from existing ontology-based solutions, which are based on large, monolithic ontologies that provide very rich semantics, but enforce too restrictive ontological choices, hence are shunned by many data providers. On the other side, this also differs from the purely linked data based solutions, which do offer simplicity and flexibility in data publishing, but too little in terms of semantic interoperability. We demonstrate the feasibility of this idea through the actual development of a large scale data integration project involving seven ocean science data repositories from five institutions in the U.S. In addition, we make two contributions as part of this dissertation work, which also play crucial roles in the aforementioned data integration project. First, we develop a collection of more than a dozen ontology design patterns that capture the key notions in the ocean science occurring in the participating data repositories. These patterns contain axiomatization of the key notions and were developed with an intensive involvement from the domain experts. Modeling of the patterns was done in a systematic workflow to ensure modularity, reusability, and flexibility of the whole pattern collection. Second, we propose the so-called pattern views that allow data providers to publish their data in very simple intermediate schema and show that they can greatly assist data providers to publish their data without requiring a thorough understanding of the axiomatization of the patterns

    Tracking Federated Queries in the Linked Data

    Get PDF
    Federated query engines allow data consumers to execute queries over the federation of Linked Data (LD). However, as federated queries are decomposed into potentially thousands of subqueries distributed among SPARQL endpoints, data providers do not know federated queries, they only know subqueries they process. Consequently, unlike warehousing approaches, LD data providers have no access to secondary data. In this paper, we propose FETA (FEderated query TrAcking), a query tracking algorithm that infers Basic Graph Patterns (BGPs) processed by a federation from a shared log maintained by data providers. Concurrent execution of thousand subqueries generated by multiple federated query engines makes the query tracking process challenging and uncertain. Experiments with Anapsid show that FETA is able to extract BGPs which, even in a worst case scenario, contain BGPs of original queries

    Distributed Join Approaches for W3C-Conform SPARQL Endpoints

    Get PDF
    Currently many SPARQL endpoints are freely available and accessible without any costs to users: Everyone can submit SPARQL queries to SPARQL endpoints via a standardized protocol, where the queries are processed on the datasets of the SPARQL endpoints and the query results are sent back to the user in a standardized format. As these distributed execution environments for semantic big data (as intersection of semantic data and big data) are freely accessible, the Semantic Web is an ideal playground for big data research. However, when utilizing these distributed execution environments, questions about the performance arise. Especially when several datasets (locally and those residing in SPARQL endpoints) need to be combined, distributed joins need to be computed. In this work we give an overview of the various possibilities of distributed join processing in SPARQL endpoints, which follow the SPARQL specification and hence are "W3C conform". We also introduce new distributed join approaches as variants of the Bitvector-Join and combination of the Semi- and Bitvector-Join. Finally we compare all the existing and newly proposed distributed join approaches for W3C conform SPARQL endpoints in an extensive experimental evaluation

    Methods for Matching of Linked Open Social Science Data

    Get PDF
    In recent years, the concept of Linked Open Data (LOD), has gained popularity and acceptance across various communities and domains. Science politics and organizations claim that the potential of semantic technologies and data exposed in this manner may support and enhance research processes and infrastructures providing research information and services. In this thesis, we investigate whether these expectations can be met in the domain of the social sciences. In particular, we analyse and develop methods for matching social scientific data that is published as Linked Data, which we introduce as Linked Open Social Science Data. Based on expert interviews and a prototype application, we investigate the current consumption of LOD in the social sciences and its requirements. Following these insights, we first focus on the complete publication of Linked Open Social Science Data by extending and developing domain-specific ontologies for representing research communities, research data and thesauri. In the second part, methods for matching Linked Open Social Science Data are developed that address particular patterns and characteristics of the data typically used in social research. The results of this work contribute towards enabling a meaningful application of Linked Data in a scientific domain
    corecore