6 research outputs found

    RDF graph summarization: principles, techniques and applications (tutorial)

    Get PDF
    International audienceThe explosion in the amount of the RDF on the Web has lead to the need to explore, query and understand such data sources. The task is challenging due to the complex and heterogeneous structure of RDF graphs which, unlike relational databases, do not come with a structure-dictating schema. Summarization has been applied to RDF data to facilitate these tasks. Its purpose is to extract concise and meaningful information from RDF knowledge bases, representing their content as faithfully as possible. There is no single concept of RDF summary, and not a single but many approaches to build such summaries; the summarization goal, and the main computational tools employed for summarizing graphs, are the main factors behind this diversity. This tutorial presents a structured analysis and comparison existing works in the area of RDF summarization; it is based upon a recent survey which we co-authored with colleagues [3]. We present the concepts at the core of each approach, outline their main technical aspects and implementation. We conclude by identifying the most pertinent summarization method for different usage scenarios, and discussing areas where future effort is needed

    Statistically-driven generation of multidimensional analytical schemas from linked data

    Get PDF
    The ever-increasing Linked Data (LD) initiative has given place to open, large amounts of semi-structured and rich data published on the Web. However, effective analytical tools that aid the user in his/her analysis and go beyond browsing and querying are still lacking. To address this issue, we propose the automatic generation of multidimensional analytical stars (MDAS). The success of the multidimensional (MD) model for data analysis has been in great part due to its simplicity. Therefore, in this paper we aim at automatically discovering MD conceptual patterns that summarize LD. These patterns resemble the MD star schema typical of relational data warehousing. The underlying foundations of our method is a statistical framework that takes into account both concept and instance data. We present an implementation that makes use of the statistical framework to generate the MDAS. We have performed several experiments that assess and validate the statistical approach with two well-known and large LD sets.This research has been partially funded by the “Ministerio de Economía y Competitividad” with contract number TIN2014-55335-R. Victoria Nebot was supported by the UJI Postdoctoral Fel- lowship program with reference PI14490

    Instance-Based Lossless Summarization of Knowledge Graph With Optimized Triples and Corrections (IBA-OTC)

    Get PDF
    Knowledge graph (KG) summarization facilitates efficient information retrieval for exploring complex structural data. For fast information retrieval, it requires processing on redundant data. However, it necessitates the completion of information in a summary graph. It also saves computational time during data retrieval, storage space, in-memory visualization, and preserving structure after summarization. State-of-the-art approaches summarize a given KG by preserving its structure at the cost of information loss. Additionally, the approaches not preserving the underlying structure, compromise the summarization ratio by focusing only on the compression of specific regions. In this way, these approaches either miss preserving the original facts or the wrong prediction of inferred information. To solve these problems, we present a novel framework for generating a lossless summary by preserving the structure through super signatures and their corresponding corrections. The proposed approach summarizes only the naturally overlapped instances while maintaining its information and preserving the underlying Resource Description Framework RDF graph. The resultant summary is composed of triples with positive, negative, and star corrections that are optimized by the smart calling of two novel functions namely merge and disperse . To evaluate the effectiveness of our proposed approach, we perform experiments on nine publicly available real-world knowledge graphs and obtain a better summarization ratio than state-of-the-art approaches by a margin of 10% to 30% with achieving its completeness, correctness, and compactness. In this way, the retrieval of common events and groups by queries is accelerated in the resultant graph

    Flexible query processing of SPARQL queries

    Get PDF
    SPARQL is the predominant language for querying RDF data, which is the standard model for representing web data and more specifically Linked Open Data (a collection of heterogeneous connected data). Datasets in RDF form can be hard to query by a user if she does not have a full knowledge of the structure of the dataset. Moreover, many datasets in Linked Data are often extracted from actual web page content which might lead to incomplete or inaccurate data. We extend SPARQL 1.1 with two operators, APPROX and RELAX, previously introduced in the context of regular path queries. Using these operators we are able to support exible querying over the property path queries of SPARQL 1.1. We call this new language SPARQLAR. Using SPARQLAR users are able to query RDF data without fully knowing the structure of a dataset. APPROX and RELAX encapsulate different aspects of query flexibility: finding different answers and finding more answers, respectively. This means that users can access complex and heterogeneous datasets without the need to know precisely how the data is structured. One of the open problems we address is how to combine the APPROX and RELAX operators with a pragmatic language such as SPARQL. We also devise an implementation of a system that evaluates SPARQLAR queries in order to study the performance of the new language. We begin by defining the semantics of SPARQLAR and the complexity of query evaluation. We then present a query processing technique for evaluating SPARQLAR queries based on a rewriting algorithm and prove its soundness and completeness. During the evaluation of a SPARQLAR query we generate multiple SPARQL 1.1 queries that are evaluated against the dataset. Each such query will generate answers with a cost that indicates their distance with respect to the exact form of the original SPARQLAR query. Our prototype implementation incorporates three optimisation techniques that aim to enhance query execution performance: the first optimisation is a pre-computation technique that caches the answers of parts of the queries generated by the rewriting algorithm. These answers will then be reused to avoid the re-execution of those sub-queries. The second optimisation utilises a summary of the dataset to discard queries that it is known will not return any answer. The third optimisation technique uses the query containment concept to discard queries whose answers would be returned by another query at the same or lower cost. We conclude by conducting a performance study of the system on three different RDF datasets: LUBM (Lehigh University Benchmark), YAGO and DBpedia

    Flexible query processing of SPARQL queries

    Get PDF
    SPARQL is the predominant language for querying RDF data, which is the standard model for representing web data and more specifically Linked Open Data (a collection of heterogeneous connected data). Datasets in RDF form can be hard to query by a user if she does not have a full knowledge of the structure of the dataset. Moreover, many datasets in Linked Data are often extracted from actual web page content which might lead to incomplete or inaccurate data. We extend SPARQL 1.1 with two operators, APPROX and RELAX, previously introduced in the context of regular path queries. Using these operators we are able to support exible querying over the property path queries of SPARQL 1.1. We call this new language SPARQLAR. Using SPARQLAR users are able to query RDF data without fully knowing the structure of a dataset. APPROX and RELAX encapsulate different aspects of query flexibility: finding different answers and finding more answers, respectively. This means that users can access complex and heterogeneous datasets without the need to know precisely how the data is structured. One of the open problems we address is how to combine the APPROX and RELAX operators with a pragmatic language such as SPARQL. We also devise an implementation of a system that evaluates SPARQLAR queries in order to study the performance of the new language. We begin by defining the semantics of SPARQLAR and the complexity of query evaluation. We then present a query processing technique for evaluating SPARQLAR queries based on a rewriting algorithm and prove its soundness and completeness. During the evaluation of a SPARQLAR query we generate multiple SPARQL 1.1 queries that are evaluated against the dataset. Each such query will generate answers with a cost that indicates their distance with respect to the exact form of the original SPARQLAR query. Our prototype implementation incorporates three optimisation techniques that aim to enhance query execution performance: the first optimisation is a pre-computation technique that caches the answers of parts of the queries generated by the rewriting algorithm. These answers will then be reused to avoid the re-execution of those sub-queries. The second optimisation utilises a summary of the dataset to discard queries that it is known will not return any answer. The third optimisation technique uses the query containment concept to discard queries whose answers would be returned by another query at the same or lower cost. We conclude by conducting a performance study of the system on three different RDF datasets: LUBM (Lehigh University Benchmark), YAGO and DBpedia
    corecore