4 research outputs found

    Towards efficient query processing over heterogeneous RDF interfaces

    Get PDF

    Learning from the History of Distributed Query Processing A Heretic View on Linked Data Management

    No full text
    Abstract. The vision of the Semantic Web has triggered the development of various new applications and opened up new directions in research. Recently, much effort has been put into the development of techniques for query processing over Linked Data. Being based upon techniques originally developed for distributed and federated databases, some of them inherit the same or similar problems. Thus, the goal of this paper is to point out pitfalls that the previous generation of researchers has already encountered and to introduce the Linked Data as a Service as an idea that has the potential to solve the problem in some scenarios. Hence, this paper discusses nine theses about Linked Data processing and sketches a research agenda for future endeavors in the area of Linked Data processing.

    A unified data repository for rich communication services

    Get PDF
    Rich Communication Services (RCS) is a framework that defines a set of IP-based services for the delivery of multimedia communications to mobile network subscribers. The framework unifies a set of pre-existing communication services under a single name, and permits network operators to re-use investments in existing network infrastructure, especially the IP Multimedia Subsystem (IMS), which is a core part of a mobile network and also acts as a docking station for RCS services. RCS generates and utilises disparate subscriber data sets during execution, however, it lacks a harmonised repository for the management of such data sets, thus making it difficult to obtain a unified view of heterogeneous subscriber data. This thesis proposes the creation of a unified data repository for RCS which is based on the User Data Convergence (UDC) standard. The standard was proposed by the 3rd Generation Partnership Project (3GPP), a major telecommunications standardisation group. UDC provides an approach for consolidating subscriber data into a single logical repository without adversely affecting existing network infrastructure, such as the IMS. Thus, this thesis details the design and development of a prototypical implementation of a unified repository, named Converged Subscriber Data Repository (CSDR). It adopts a polyglot persistence model for the underlying data store and exposes heterogeneous data through the Open Data Protocol (OData), which is a candidate implementation of the Ud interface defined in the UDC architecture. With the introduction of polyglot persistence, multiple data stores can be used within the CSDR and disparate network data sources can access heterogeneous data sets using OData as a standard communications protocol. As the CSDR persistence model becomes more complex due to the inclusion of more storage technologies, polyglot persistence ensures a consistent conceptual view of these data sets through OData. Importantly, the CSDR prototype was integrated into a popular open-source implementation of the core part of an IMS network known as the Open IMS Core. The successful integration of the prototype demonstrates its ability to manage and expose a consolidated view of heterogeneous subscriber data, which are generated and used by different RCS services deployed within IMS

    Efficient Source Selection For SPARQL Endpoint Query Federation

    Get PDF
    The Web of Data has grown enormously over the last years. Currently, it comprises a large compendium of linked and distributed datasets from multiple domains. Due to the decentralised architecture of the Web of Data, several of these datasets contain complementary data. Running complex queries on this compendium thus often requires accessing data from different data sources within one query. The abundance of datasets and the need for running complex query has thus motivated a considerable body of work on SPARQL query federation systems, the dedicated means to access data distributed over the Web of Data. This thesis addresses two key areas of federated SPARQL query processing: (1) efficient source selection, and (2) comprehensive SPARQL benchmarks to test and ranked federated SPARQL engines as well as triple stores. Efficient Source Selection: Efficient source selection is one of the most important optimization steps in federated SPARQL query processing. An overestimation of query relevant data sources increases the network traffic, result in irrelevant intermediate results, and can significantly affect the overall query processing time. Previous works have focused on generating optimized query execution plans for fast result retrieval. However, devising source selection approaches beyond triple pattern-wise source selection has not received much attention. Similarly, only little attention has been paid to the effect of duplicated data on federated querying. This thesis presents HiBISCuS and TBSS, novel hypergraph-based source selection approaches, and DAW, a duplicate-aware source selection approach to federated querying over the Web of Data. Each of these approaches can be combined directly with existing SPARQL query federation engines to achieve the same recall while querying fewer data sources. We combined the three (HiBISCuS, DAW, and TBSS) source selections approaches with query rewriting to form a complete SPARQL query federation engine named Quetsal. Furthermore, we present TopFed, a Cancer Genome Atlas (TCGA) tailored federated query processing engine that exploits the data distribution to perform intelligent source selection while querying over large TCGA SPARQL endpoints. Finally, we address the issue of rights managements and privacy while accessing sensitive resources. To this end, we present SAFE: a global source selection approach that enables decentralised, policy-aware access to sensitive clinical information represented as distributed RDF Data Cubes. Comprehensive SPARQL Benchmarks: Benchmarking is indispensable when aiming to assess technologies with respect to their suitability for given tasks. While several benchmarks and benchmark generation frameworks have been developed to evaluate federated SPARQL engines and triple stores, they mostly provide a one-fits-all solution to the benchmarking problem. This approach to benchmarking is however unsuitable to evaluate the performance of a triple store for a given application with particular requirements. The fitness of current SPARQL query federation approaches for real applications is difficult to evaluate with current benchmarks as current benchmarks are either synthetic or too small in size and complexity. Furthermore, state-of-the-art federated SPARQL benchmarks mostly focused on a single performance criterion, i.e., the overall query runtime. Thus, they cannot provide a fine-grained evaluation of the systems. We address these drawbacks by presenting FEASIBLE, an automatic approach for the generation of benchmarks out of the query history of applications, i.e., query logs and LargeRDFBench, a billion-triple benchmark for SPARQL query federation which encompasses real data as well as real queries pertaining to real bio-medical use cases. Our evaluation results show that HiBISCuS, TBSS, TopFed, DAW, and SAFE all can significantly reduce the total number of sources selected and thus improve the overall query performance. In particular, TBSS is the first source selection approach to remain under 5% overall relevant sources overestimation. Quetsal has reduced the number of sources selected (without losing recall), the source selection time as well as the overall query runtime as compared to state-of-the-art federation engines. The LargeRDFBench evaluation results suggests that the performance of current SPARQL query federation systems on simple queries does not reflect the systems\\\'' performance on more complex queries. Moreover, current federation systems seem unable to deal with many of the challenges that await them in the age of Big Data. Finally, the FEASIBLE\\\''s evaluation results shows that it generates better sample queries than the state-of-the-art. In addition, the better query selection and the larger set of query types used lead to triple store rankings which partly differ from the rankings generated by previous works