61 research outputs found
Tracking Federated Queries in the Linked Data
Federated query engines allow data consumers to execute queries over the
federation of Linked Data (LD). However, as federated queries are decomposed
into potentially thousands of subqueries distributed among SPARQL endpoints,
data providers do not know federated queries, they only know subqueries they
process. Consequently, unlike warehousing approaches, LD data providers have no
access to secondary data. In this paper, we propose FETA (FEderated query
TrAcking), a query tracking algorithm that infers Basic Graph Patterns (BGPs)
processed by a federation from a shared log maintained by data providers.
Concurrent execution of thousand subqueries generated by multiple federated
query engines makes the query tracking process challenging and uncertain.
Experiments with Anapsid show that FETA is able to extract BGPs which, even in
a worst case scenario, contain BGPs of original queries
How Many and What Types of SPARQL Queries can be Answered through Zero-Knowledge Link Traversal?
The current de-facto way to query the Web of Data is through the SPARQL
protocol, where a client sends queries to a server through a SPARQL endpoint.
Contrary to an HTTP server, providing and maintaining a robust and reliable
endpoint requires a significant effort that not all publishers are willing or
able to make. An alternative query evaluation method is through link traversal,
where a query is answered by dereferencing online web resources (URIs) at real
time. While several approaches for such a lookup-based query evaluation method
have been proposed, there exists no analysis of the types (patterns) of queries
that can be directly answered on the live Web, without accessing local or
remote endpoints and without a-priori knowledge of available data sources. In
this paper, we first provide a method for checking if a SPARQL query (to be
evaluated on a SPARQL endpoint) can be answered through zero-knowledge link
traversal (without accessing the endpoint), and analyse a large corpus of real
SPARQL query logs for finding the frequency and distribution of answerable and
non-answerable query patterns. Subsequently, we provide an algorithm for
transforming answerable queries to SPARQL-LD queries that bypass the endpoints.
We report experimental results about the efficiency of the transformed queries
and discuss the benefits and the limitations of this query evaluation method.Comment: Preprint of paper accepted for publication in the 34th ACM/SIGAPP
Symposium On Applied Computing (SAC 2019
The Odyssey Approach for Optimizing Federated SPARQL Queries
Answering queries over a federation of SPARQL endpoints requires combining
data from more than one data source. Optimizing queries in such scenarios is
particularly challenging not only because of (i) the large variety of possible
query execution plans that correctly answer the query but also because (ii)
there is only limited access to statistics about schema and instance data of
remote sources. To overcome these challenges, most federated query engines rely
on heuristics to reduce the space of possible query execution plans or on
dynamic programming strategies to produce optimal plans. Nevertheless, these
plans may still exhibit a high number of intermediate results or high execution
times because of heuristics and inaccurate cost estimations. In this paper, we
present Odyssey, an approach that uses statistics that allow for a more
accurate cost estimation for federated queries and therefore enables Odyssey to
produce better query execution plans. Our experimental results show that
Odyssey produces query execution plans that are better in terms of data
transfer and execution time than state-of-the-art optimizers. Our experiments
using the FedBench benchmark show execution time gains of at least 25 times on
average.Comment: 16 pages, 10 figure
SECF: Improving SPARQL Querying Performance with Proactive Fetching and Caching
Querying on SPARQL endpoints may be unsatisfactory due to high latency of connections to the endpoints. Caching is an important way to accelerate the query response speed. In this paper, we propose SPARQL Endpoint Caching Framework (SECF), a client-side caching framework for this purpose.
In particular, we prefetch and cache the results of similar queries to recently cached query aiming to improve the overall querying performance. The similarity between queries are calculated via an improved Graph Edit Distance (GED) function. We also adapt a smoothing method to implement the cache replacement. The empirical evaluations on real world queries show that our approach has great potential to enhance the cache hit rate and accelerate the querying speed on SPARQL endpoints
Un moteur de traitement de requêtes SPARQL distribuées optimisée pour les partitions de données verticales et horizontales
National audienceAn increasing number of linked knowledge bases are openly accessible over the Internet. Distributed Query Processing (DQP) techniques enable querying multiple knowledge bases coherently. However, the precise DQP semantics is often overlooked, and query performance issues arise. In this paper, we propose a DQP engine for distributed RDF graphs, adopting a SPARQL-compliant DQP semantics. We improve performance through heuristics that generate Basic Graph Pattern-based sub-queries designed to maximise the parts of the query processed by the remote endpoints. We evaluate our DQP engine considering a query set representative of most common SPARQL clauses and different data distribution schemes. Results show a significant reduction of the number of remote queries executed and the query execution time while preserving completeness.Un nombre grandissant de bases de connaissances liées sont exposéesexposéesà travers l'Internet. Le traitement de requêtes distribuées (DQP) permet d'interroger des bases de connais-sances multiples simultanément. Cependant, la sémantique DQP précise est souvent négligée, et desprobì emes de performance doiventêtredoiventêtre traités. Dans ce papier, nous proposons un moteur DQP pour l'interrogation de graphs RDF distribués, conformè a la sé-mantique de SPARQL. Nous en améliorons la performance grâcè a des heuristiques qui génèrent des sous-requêtesrequêtes`requêtesà par-tir de schémas de graphes basiques (BGPs) demanì erè a maximiser la partie de la requête traitée par les serveurs de données distants. NousévaluonsNousévaluons notre moteur DQPà DQP`DQPà travers un ensemble de reqêtes représentatives de clauses SPARQL les plus répen-dues et des schémas de distribution des données divers. Les résultats montrent un réduction significative du nombre de requêtes exécutées et du temps de traitement sans altération de la complétude des résultats
SMART-KG: Hybrid Shipping for SPARQL Querying on the Web
While Linked Data (LD) provides standards for publishing (RDF) and (SPARQL) querying Knowledge Graphs (KGs) on the Web, serving, accessing and processing such open, decentralized KGs is often practically impossible, as query timeouts on publicly available SPARQL endpoints show. Alternative solutions such as Triple Pattern Fragments (TPF) attempt to tackle the problem of availability by pushing query processing workload to the client side, but suffer from unnecessary transfer of irrelevant data on complex queries with large intermediate results. In this paper we present smart-KG, a novel approach to share the load between servers and clients, while significantly reducing data transfer volume, by combining TPF with shipping compressed KG partitions. Our evaluations show that smart-KG outperforms state-of-the-art client-side solutions and increases server-side availability towards more cost-effective and balanced hosting of open and decentralized KGs
A New Approach for Fast Processing of SPARQL Queries on RDF Quadruples
Title from PDF of title page, viewed on July 7, 2015Dissertation advisor: Praveen R. RaoVitaIncludes bibliographic references (pages 87-92)Thesis (Ph.D.)--School of Computing and Engineering. University of Missouri--Kansas City, 2015The Resource Description Framework (RDF) is a standard model for representing
data on the Web. It enables the interchange and machine processing of
data by considering its semantics. While RDF was first proposed with the vision of
enabling the Semantic Web, it has now become popular in domain-specific applications
and the Web. Through advanced RDF technologies, one can perform semantic
reasoning over data and extract knowledge in domains such as healthcare, biopharmaceuticals,
defense, and intelligence. Popular approaches like RDF-3X perform poorly
on RDF datasets containing billions of triples when the queries are large and complex.
This is because of the large number of join operations that must be performed
during query processing. Moreover, most of the scalable approaches were designed
to operate on RDF triples instead of quads. To address these issues, we propose to
develop a new approach for fast and cost-effective processing of SPARQL queries on
large RDF datasets containing RDF quadruples (or quads). Our approach employs
a decrease-and-conquer strategy: Rather than indexing the entire RDF dataset, it
identifies groups of similar RDF graphs and indexes each group separately. During
query processing, it uses a novel filtering index to first identify candidate groups that
may contain matches for the query. On these candidates, it executes queries using a
conventional SPARQL processor to produce the final results. A query optimization
strategy using the candidate groups to further improve the query processing performance
is also used.Introduction -- Background and motivations -- The design of RIQ -- Implementation of RIQ -- Evaluation -- Conclusion and future work -- Appendix A. Queries -- Appendix B. SPARQL gramma
Federated Query Processing over Heterogeneous Data Sources in a Semantic Data Lake
Data provides the basis for emerging scientific and interdisciplinary data-centric applications with the potential of improving the quality of life for citizens. Big Data plays an important role in promoting both manufacturing and scientific development through industrial digitization and emerging interdisciplinary research. Open data initiatives have encouraged the publication of Big Data by exploiting the decentralized nature of the Web, allowing for the availability of heterogeneous data generated and maintained by autonomous data providers. Consequently, the growing volume of data consumed by different applications raise the need for effective data integration approaches able to process a large volume of data that is represented in different format, schema and model, which may also include sensitive data, e.g., financial transactions, medical procedures, or personal data. Data Lakes are composed of heterogeneous data sources in their original format, that reduce the overhead of materialized data integration. Query processing over Data Lakes require the semantic description of data collected from heterogeneous data sources. A Data Lake with such semantic annotations is referred to as a Semantic Data Lake. Transforming Big Data into actionable knowledge demands novel and scalable techniques for enabling not only Big Data ingestion and curation to the Semantic Data Lake, but also for efficient large-scale semantic data integration, exploration, and discovery. Federated query processing techniques utilize source descriptions to find relevant data sources and find efficient execution plan that minimize the total execution time and maximize the completeness of answers. Existing federated query processing engines employ a coarse-grained description model where the semantics encoded in data sources are ignored. Such descriptions may lead to the erroneous selection of data sources for a query and unnecessary retrieval of data, affecting thus the performance of query processing engine. In this thesis, we address the problem of federated query processing against heterogeneous data sources in a Semantic Data Lake. First, we tackle the challenge of knowledge representation and propose a novel source description model, RDF Molecule Templates, that describe knowledge available in a Semantic Data Lake. RDF Molecule Templates (RDF-MTs) describes data sources in terms of an abstract description of entities belonging to the same semantic concept. Then, we propose a technique for data source selection and query decomposition, the MULDER approach, and query planning and optimization techniques, Ontario, that exploit the characteristics of heterogeneous data sources described using RDF-MTs and provide a uniform access to heterogeneous data sources. We then address the challenge of enforcing privacy and access control requirements imposed by data providers. We introduce a privacy-aware federated query technique, BOUNCER, able to enforce privacy and access control regulations during query processing over data sources in a Semantic Data Lake. In particular, BOUNCER exploits RDF-MTs based source descriptions in order to express privacy and access control policies as well as their automatic enforcement during source selection, query decomposition, and planning. Furthermore, BOUNCER implements query decomposition and optimization techniques able to identify query plans over data sources that not only contain the relevant entities to answer a query, but also are regulated by policies that allow for accessing these relevant entities. Finally, we tackle the problem of interest based update propagation and co-evolution of data sources. We present a novel approach for interest-based RDF update propagation that consistently maintains a full or partial replication of large datasets and deal with co-evolution
- …