279 research outputs found

    Ontology Based Data Access in Statoil

    Get PDF
    Ontology Based Data Access (OBDA) is a prominent approach to query databases which uses an ontology to expose data in a conceptually clear manner by abstracting away from the technical schema-level details of the underlying data. The ontology is ‘connected’ to the data via mappings that allow to automatically translate queries posed over the ontology into data-level queries that can be executed by the underlying database management system. Despite a lot of attention from the research community, there are still few instances of real world industrial use of OBDA systems. In this work we present data access challenges in the data-intensive petroleum company Statoil and our experience in addressing these challenges with OBDA technology. In particular, we have developed a deployment module to create ontologies and mappings from relational databases in a semi-automatic fashion; a query processing module to perform and optimise the process of translating ontological queries into data queries and their execution over either a single DB of federated DBs; and a query formulation module to support query construction for engineers with a limited IT background. Our modules have been integrated in one OBDA system, deployed at Statoil, integrated with Statoil’s infrastructure, and evaluated with Statoil’s engineers and data

    Peer-based query rewriting in SPARQL for semantic integration of linked data

    Get PDF
    In this proposal we address the problem of ontology-based SPARQL query answering over distributed Linked Data sources, where the ontology is given by conjunctive mappings between the source schemas in a peer-to-peer fashion and by equality constraints between constants. In our setting, the data is not materialised in a single datastore: it is accessed in a distributed environment through SPARQL endpoints. We aim to achieve query answering by generating the perfect rewriting of the original query and then processing the rewritten query over distributed SPARQL endpoints. We identify a subset of ontology constraints that enjoy the first-order rewritability property and we perform preliminary empirical evaluation taking into account such restricted constraints only. For future work, we aim to tackle the query answering problem in the general case

    Querying the web of data with low latency: high performance distributed SPARQL processing and benchmarking

    No full text
    The Web of Data extends the World Wide Web (WWW) in a way that applications can understand information and cooperate with humans on complex tasks. The basis of performing complex tasks is low latency queries over the Web of Data. The large scale and distributed nature of the Web of Data have negative impacts on several critical factors for efficient query processing, including fast data transmission between datasets, predictable data distribution and statistics that summarise and describe certain patterns in the data. Moreover, it is common on the Web of Data that the same resource is identified by multiple URIs. This phenomenon, named co-reference, potentially increases the complexity of query processing, and makes it even harder to obtain accurate statistics. With the aforementioned challenges, it is not clear whether it is possible to achieve efficient queries on the Web of Data on a large scale.In this thesis, we explore techniques to improve the efficiency of querying the Web of Data on a large scale. More specifically, we investigate two typical scenarios on the Web of Data, which are: 1) the scenario in which all datasets provide detailed statistics that are possibly available on a large scale, and 2) the scenario in which co-reference is taken into account, and datasets’ statistics are not reliable. For each scenario we explore existing and novel optimisation techniques that are tailored for querying the Web of Data, as well as well developed techniques with careful adjustments.For the scenario with detailed statistics we provide a scheme that implements a statistics query optimisation approach that requires detailed statistics, and intensively exploits parallelism. We propose an efficient algorithm called Parallel Sub-query Identification () to increase the degree of parallelism. () breaks a SPARQL query into sub-queries that can be processed in parallel while not increasing network traffic. We combine with dynamic programming to produce query plans with both minimum costs and a fair degree of parallelism. Furthermore, we develop a mechanism that maximally exploits bandwidth and computing power of datasets. For the scenario having co-reference and without reliable statistics we provide a scheme that implements a dynamic query optimisation approach that takes co-reference into account, and utilises runtime statistics to elevate query efficiency even further. We propose a model called Virtual Graph to transform a query and all its co-referent siblings into a single query with pre-defined bindings. Virtual Graph reduces the large number of outgoing and incoming requests that is required to process co-referent queries individually. Moreover, Virtual Graph enables query optimisers to find the optimal plan with respect to all co-referent queries as a whole. () is used in this scheme as well but provides a higher degree of parallelism with the help of runtime statistics. A Minimum-Spanning-Tree-based algorithm is used in this scheme as a result of using runtime statistics. The same parallel execution mechanism used in the previous scenario is adopted here as well.In order to examine the effectiveness of our schemes in practice, we deploy the above approaches in two distributed SPARQL engines, LHD-s and LHD-d respectively. Both engines are implemented using a popular Java-based platform for building Semantic Web applications. They can be used as either standalone applications or integrated into existing systems that require quick response of Linked Data queries.We also propose a scalable and flexible benchmark, called Distributed SPARQL Evaluation Framework (DSEF), for evaluating optimisation approaches in the Web of Data. DSEF adopts a expandable virtual-machine-based structure and provides a set of efficient tools to help easily set up RDF networks of arbitrary sizes. We further investigate the proportion and distribution of co-reference in the real world, based on which DESF is able to simulate co-reference for given RDF datasets. DSEF bases its soundness in the usage of widely accepted assessment data and queries.By comparing both LHD-s and LHD-d with existing approaches using DSEF, we provide evidence that neither existing statistics provided by datasets nor cost estimation methods, are sufficiently accurate. On the other hand, dynamic optimisation using runtime statistics together with carefully tuned parallelism are promising for significantly reducing the latency of large scale queries on the Web of Data. We also demonstrate that () and Virtual Graph algorithms significantly increase query efficiency for queries with or without co-reference.In summary, the contributions of this these include: 1) proposing two schemes for improving query efficiency in two typical scenarios in the Web of Data; 2) providing implementations, named LHD-s and LHD-d, for the two schemes respectively; 3) proposing a scalable and flexible evaluation framework for distributed SPARQL engines called DSEF; and 4) showing evidence that runtime-statistics-based dynamic optimisation with parallelism are promising to reduce latency of Linked Data queries on a large scale

    How Many and What Types of SPARQL Queries can be Answered through Zero-Knowledge Link Traversal?

    Full text link
    The current de-facto way to query the Web of Data is through the SPARQL protocol, where a client sends queries to a server through a SPARQL endpoint. Contrary to an HTTP server, providing and maintaining a robust and reliable endpoint requires a significant effort that not all publishers are willing or able to make. An alternative query evaluation method is through link traversal, where a query is answered by dereferencing online web resources (URIs) at real time. While several approaches for such a lookup-based query evaluation method have been proposed, there exists no analysis of the types (patterns) of queries that can be directly answered on the live Web, without accessing local or remote endpoints and without a-priori knowledge of available data sources. In this paper, we first provide a method for checking if a SPARQL query (to be evaluated on a SPARQL endpoint) can be answered through zero-knowledge link traversal (without accessing the endpoint), and analyse a large corpus of real SPARQL query logs for finding the frequency and distribution of answerable and non-answerable query patterns. Subsequently, we provide an algorithm for transforming answerable queries to SPARQL-LD queries that bypass the endpoints. We report experimental results about the efficiency of the transformed queries and discuss the benefits and the limitations of this query evaluation method.Comment: Preprint of paper accepted for publication in the 34th ACM/SIGAPP Symposium On Applied Computing (SAC 2019

    Federating Queries to RDF repositories

    Get PDF
    Currently large amounts of RDF data are being published in the Web. These data is commonly accessed by means of SPARQL endpoints. However to query a set of SPARQL endpoints new mechanisms are needed due to neither the SPARQL protocol nor the language provide any norms or guidelines about how to proceed. In this paper we present an approach for federating queries to a set of SPARQL endpoints, using relational database distributed query processing techniques and part of the WS-DAI specification for web-service based access to relational and XML databases

    An integrated approach to deliver OLAP for multidimensional Semantic Web Databases

    Get PDF
    Semantic Webs (SW) and web data have become increasingly important sources to support Business Intelligence (BI), but they are difficult to manage due to the exponential increase in their volumes, inconsistency in semantics and complexity in representations. On-Line Analytical Processing (OLAP) is an important tool in analysing large and complex BI data, but it lacks the capability of processing disperse SW data due to the nature of its design. A new concept with a richer vocabulary than the existing ones for OLAP is needed to model distributed multidimensional semantic web databases. A new OLAP framework is developed, with multiple layers including additional vocabulary, extended OLAP operators, and usage of SPARQL to model heterogeneous semantic web data, unify multidimensional structures, and provide new enabling functions for interoperability. The framework is presented with examples to demonstrate its capability to unify existing vocabularies with additional vocabulary elements to handle both informational and topological data in Graph OLAP. The vocabularies used in this work are: the RDF Cube Vocabulary (QB) – proposed by the W3C to allow multi-dimensional, mostly statistical, data to be published in RDF; and the QB4OLAP – a QB extension introducing standard OLAP operators. The framework enables the composition of multiple databases (e.g. energy consumptions and property market values etc.) to generate observations through semantic pipe-like operators. This approach is demonstrated through Use Cases containing highly valuable data collected from a real-life environment. Its usability is proved through the development and usage of semantic pipe-like operators able to deliver OLAP specific functionalities. To the best of my knowledge there is no available data modelling approach handling both informational and topological Semantic Web data, which is designed either to provide OLAP capabilities over Semantic Web databases or to provide a means to connect such databases for further OLAP analysis. The thesis proposes that the presented work provides a wider understanding of: ways to access Semantic Web data; ways to build specialised Semantic Web databases, and, how to enrich them with powerful capabilities for further Business Intelligence

    Federated Query Processing for the Semantic Web

    Get PDF
    The recent years have witnessed a constant growth in the amount of RDF data available on the Web. This growth is largely based on the increasing rate of data publication on the Web by different actors such governments, life science researchers or geographical institutes. RDF data generation is mainly done by converting already existing legacy data resources into RDF (e.g. converting data stored in relational databases into RDF), but also by creating that RDF data directly (e.g. sensors). These RDF data are normally exposed by means of Linked Data-enabled URIs and SPARQL endpoints. Given the sustained growth that we are experiencing in the number of SPARQL endpoints available, the need to be able to send federated SPARQL queries across them has also grown. Tools for accessing sets of RDF data repositories are starting to appear, differing between them on the way in which they allow users to access these data (allowing users to specify directly what RDF data set they want to query, or making this process transparent to them). To overcome this heterogeneity in federated query processing solutions, the W3C SPARQL working group is deïŹning a federation extension for SPARQL 1.1, which allows combining in a single query, graph patterns that can be evaluated in several endpoints. In this PhD thesis, we describe the syntax of that SPARQL extension for providing access to distributed RDF data sets and formalise its semantics. We adapt existing techniques for distributed data access in relational databases in order to deal with SPARQL endpoints, which we have implemented in our federation query evaluation system (SPARQL-DQP). We describe the static optimisation techniques that we implemented in our system and we carry out a series of experiments that show that our optimisations signiïŹcantly speed up the query evaluation process in presence of large query results and optional operator
    • 

    corecore