6 research outputs found
Optimisation techniques for flexible SPARQL queries
RDF datasets can be queried using the SPARQL language but are often irregularly structured and incomplete, which may make precise query formulation hard for users. The SPARQL language extends SPARQL 1.1 with two operators - APPROX and RELAX - so as to allow flexible querying over property paths. These operators encapsulate different dimensions of query flexibility, namely approximation and generalisation, and they allow users to query complex, heterogeneous knowledge graphs without needing to know precisely how the data is structured. Earlier work has described the syntax, semantics and complexity of SPARQL, has demonstrated its practical feasibility, but has also highlighted the need for improving the speed of query evaluation. In the present paper, we focus on the design of two optimisation techniques targeted at speeding up the execution of SPARQL queries and on their empirical evaluation on three knowledge graphs: LUBM, DBpedia and YAGO. We show that applying these optimisations can result in substantial improvements in the execution times of longer-running queries (sometimes by one or more orders of magnitude) without incurring significant performance penalties for fast queries
Combining approximation and relaxation in semantic web path queries
We develop query relaxation techniques for regular path queries and combine them with query approximation in order to support flexible querying of RDF data when the user lacks knowledge of its full structure or where the structure is irregular. In such circumstances, it is helpful if the querying system can perform both approximate matching and relaxation of the user’s query and can rank the answers according to how closely they match the original query. Our framework incorporates both standard notions of approximation based on edit distance and RDFS-based inference rules. The query language we adopt comprises conjunctions of regular path queries, thus including extensions proposed for SPARQL to allow for querying paths using regular expressions. We provide an incremental query evaluation algorithm which runs in polynomial time and returns answers to the user in ranked order
Rank-aware, Approximate Query Processing on the Semantic Web
Search over the Semantic Web corpus frequently leads to queries having large result sets. So, in order to discover relevant data elements, users must rely on ranking techniques to sort results according to their relevance. At the same time, applications oftentimes deal with information needs, which do not require complete and exact results. In this thesis, we face the problem of how to process queries over Web data in an approximate and rank-aware fashion
Workload-sensitive approaches to improving graph data partitioning online
PhD ThesisMany modern applications, from social networks to network security tools, rely upon
the graph data model, using it as part of an offline analytics pipeline or, increasingly,
for storing and querying data online, e.g. in a graph database management system
(GDBMS). Unfortunately, effective horizontal scaling of this graph data reduces to
the NP-Hard problem of “k-way balanced graph partitioning”.
Owing to the problem’s importance, several practical approaches exist, producing quality graph partitionings. However, these existing systems are unsuitable for partitioning
online graphs, either introducing unnecessary network latency during query processing, being unable to efficiently adapt to changing data and query workloads, or both.
In this thesis we propose partitioning techniques which are efficient and sensitive to
given query workloads, suitable for application to online graphs and query
workloads.
To incrementally adapt partitionings in response to workload change, we propose
TAPER: a graph repartitioner. TAPER uses novel datastructures to compute the
probability of expensive inter -partition traversals (ipt) from each vertex, given the
current workload of path queries. Subsequently, it iteratively adjusts an initial partitioning by swapping selected vertices amongst partitions, heuristically maintaining low
ipt and high partition quality with respect to that workload. Iterations are inexpensive
thanks to time and space optimisations in the underlying datastructures.
To incrementally create partitionings in response to graph growth, we propose Loom:
a streaming graph partitioner. Loom uses another novel datastructure to detect common patterns of edge traversals when executing a given workload of pattern matching
queries. Subsequently, it employs a probabilistic graph isomorphism method to incrementally and efficiently compare sub-graphs in the stream of graph updates, to
these common patterns. Matches are assigned within individual partitions if possible,
thereby also reducing ipt and increasing partitioning quality w.r.t the given workload.
- i -
Both partitioner and repartitioner are extensively evaluated with real/synthetic graph
datasets and query workloads. The headline results include that TAPER can reduce
ipt by upto 80% over a naive existing partitioning and can maintain this reduction in
the event of workload change, through additional iterations. Meanwhile, Loom reduces
ipt by upto 40% over a state of the art streaming graph partitioner
Flexible query processing of SPARQL queries
SPARQL is the predominant language for querying RDF data, which is the standard
model for representing web data and more specifically Linked Open Data (a
collection of heterogeneous connected data). Datasets in RDF form can be hard to
query by a user if she does not have a full knowledge of the structure of the dataset.
Moreover, many datasets in Linked Data are often extracted from actual web page
content which might lead to incomplete or inaccurate data.
We extend SPARQL 1.1 with two operators, APPROX and RELAX, previously
introduced in the context of regular path queries. Using these operators we are able
to support
exible querying over the property path queries of SPARQL 1.1. We call
this new language SPARQLAR.
Using SPARQLAR users are able to query RDF data without fully knowing the
structure of a dataset. APPROX and RELAX encapsulate different aspects of query flexibility: finding different answers and finding more answers, respectively. This
means that users can access complex and heterogeneous datasets without the need
to know precisely how the data is structured.
One of the open problems we address is how to combine the APPROX and
RELAX operators with a pragmatic language such as SPARQL. We also devise an
implementation of a system that evaluates SPARQLAR queries in order to study the
performance of the new language.
We begin by defining the semantics of SPARQLAR and the complexity of query
evaluation. We then present a query processing technique for evaluating SPARQLAR
queries based on a rewriting algorithm and prove its soundness and completeness.
During the evaluation of a SPARQLAR query we generate multiple SPARQL 1.1
queries that are evaluated against the dataset. Each such query will generate answers
with a cost that indicates their distance with respect to the exact form of the original
SPARQLAR query.
Our prototype implementation incorporates three optimisation techniques that
aim to enhance query execution performance: the first optimisation is a pre-computation
technique that caches the answers of parts of the queries generated by the rewriting
algorithm. These answers will then be reused to avoid the re-execution of those sub-queries. The second optimisation utilises a summary of the dataset to discard
queries that it is known will not return any answer. The third optimisation technique
uses the query containment concept to discard queries whose answers would
be returned by another query at the same or lower cost.
We conclude by conducting a performance study of the system on three different
RDF datasets: LUBM (Lehigh University Benchmark), YAGO and DBpedia
Ranked Similarity Search of Scientific Datasets: An Information Retrieval Approach
In the past decade, the amount of scientific data collected and generated by scientists has grown dramatically. This growth has intensified an existing problem: in large archives consisting of datasets stored in many files, formats and locations, how can scientists find data relevant to their research interests? We approach this problem in a new way: by adapting Information Retrieval techniques, developed for searching text documents, into the world of (primarily numeric) scientific data. We propose an approach that uses a blend of automated and curated methods to extract metadata from large repositories of scientific data. We then perform searches over this metadata, returning results ranked by similarity to the search criteria. We present a model of this approach, and describe a specific implementation thereof performed at an ocean-observatory data archive and now running in production. Our prototype implements scanners that extract metadata from datasets that contain different kinds of environmental observations, and a search engine with a candidate similarity measure for comparing a set of search terms to the extracted metadata. We evaluate the utility of the prototype by performing two user studies; these studies show that the approach resonates with users, and that our proposed similarity measure performs well when analyzed using standard Information Retrieval evaluation methods. We performed performance tests to explore how continued archive growth will affect our goal of interactive response, developed and applied techniques that mitigate the effects of that growth, and show that the techniques are effective. Lastly, we describe some of the research needed to extend this initial work into a true Google for data