498 research outputs found
The Odyssey Approach for Optimizing Federated SPARQL Queries
Answering queries over a federation of SPARQL endpoints requires combining
data from more than one data source. Optimizing queries in such scenarios is
particularly challenging not only because of (i) the large variety of possible
query execution plans that correctly answer the query but also because (ii)
there is only limited access to statistics about schema and instance data of
remote sources. To overcome these challenges, most federated query engines rely
on heuristics to reduce the space of possible query execution plans or on
dynamic programming strategies to produce optimal plans. Nevertheless, these
plans may still exhibit a high number of intermediate results or high execution
times because of heuristics and inaccurate cost estimations. In this paper, we
present Odyssey, an approach that uses statistics that allow for a more
accurate cost estimation for federated queries and therefore enables Odyssey to
produce better query execution plans. Our experimental results show that
Odyssey produces query execution plans that are better in terms of data
transfer and execution time than state-of-the-art optimizers. Our experiments
using the FedBench benchmark show execution time gains of at least 25 times on
average.Comment: 16 pages, 10 figure
Exploiting emergent schemas to make RDF systems more efficient
We build on our earlier finding that more than 95 % of the
triples in actual RDF triple graphs have a remarkably tabular structure,
whose schema does not necessarily follow from explicit metadata such as
ontologies, but for which an RDF store can automatically derive by looking
at the data using so-called āemergent schemaā detection techniques.
In this paper we investigate how computers and in particular RDF stores
can take advantage from this emergent schema to more compactly store
RDF data and more efficiently optimize and execute SPARQL queries.
To this end, we contribute techniques for efficient emergent schema aware
RDF storage and new query operator algorithms for emergent schema
aware scans and joins. In all, these techniques allow RDF schema processors
fully catch up with relational database techniques in terms of rich
physical database design options and efficiency, without requiring a rigid
upfront schema structure definition
Self-organizing Structured RDF in MonetDB
The semantic web uses RDF as its data model, providing ultimate flexibility
for users to represent and evolve data without need of a schema.
Yet, this flexibility poses challenges in implementing efficient RDF
stores, leading from plans with very many self-joins to a triple table,
difficulties to optimize these, and a lack of data locality since without
a notion of multi-attribute data structure, clustered indexing opportunities are lost.
Apart from performance issues, users of huge RDF graphs often have problems
formulating queries as they lack any system-supported notion of the structure in the data.
In this research, we exploit the observation that real RDF data, while not as regularly
structured as relational data, still has the great majority of triples conforming to regular patterns.
We conjecture that a system that would recognize this structure automatically
would both allow RDF stores to become more efficient and also easier to use.
Concretely, we propose to derive self-organizing RDF that stores data
in PSO format in such a way that the regular parts of the data physically
correspond to relational columnar storage; and propose RDFscan/RDFjoin algorithms
that compute star-patterns over these without wasting effort in self-joins.
These regular parts, i.e. tables, are identified on ingestion by a schema discovery
algorithm -- as such users will gain an SQL view of the regular part of the RDF data.
This research aims to produce a state-of-the-art SPARQL frontend for MonetDB
as a by-product, and we already present some preliminary results on this platform
Heuristics-based query optimisation for SPARQL
Query optimization in RDF Stores is a challenging problem as SPARQL queries typically contain many more joins than equivalent relational plans, and hence lead to a large join order search space. In such cases, cost-based query optimization often is not possible. One practical reason for this is that statistics typically are missing in web scale setting such as the Linked Open Datasets (LOD). The more profound reason is that due to the absence of schematic structure in RDF, join-hit ratio estimation requires complicated forms of correlated join statistics; and currently there are no methods to identify the relevant correlations beforehand. For this reason, the use of good heuristics is essential in SPARQL query optimization, even in the case that are partially used with cost-based statistics (i.e., hybrid query optimization). In this paper we describe a set of useful heuristics for SPARQL query optimizers. We present these in the context of a new Heuristic SPARQL Planner (HSP) that is capable of exploiting the syntactic and the structural variations of the triple patterns in a SPARQL query in order to choose an execution plan without the need of any cost model. For this, we deļ¬ne the variable graph and we show a reduction of the SPARQL query optimization problem to the maximum weight independent set problem.
We implemented our planner on top of the MonetDB open source column-store and evaluated its effectiveness against the state-ofthe-art RDF-3X engine as well as comparing the plan quality with
a relational (SQL) equivalent of the benchmarks
- ā¦