94 research outputs found
DBpedia SPARQL Benchmark – Performance Assessment with Real Queries on Real Data
Abstract. Triple stores are the backbone of increasingly many Data Web appli-cations. It is thus evident that the performance of those stores is mission critical for individual projects as well as for data integration on the Data Web in gen-eral. Consequently, it is of central importance during the implementation of any of these applications to have a clear picture of the weaknesses and strengths of current triple store implementations. In this paper, we propose a generic SPARQL benchmark creation procedure, which we apply to the DBpedia knowledge base. Previous approaches often compared relational and triple stores and, thus, settled on measuring performance against a relational database which had been con-verted to RDF by using SQL-like queries. In contrast to those approaches, our benchmark is based on queries that were actually issued by humans and applica-tions against existing RDF data not resembling a relational schema. Our generic procedure for benchmark creation is based on query-log mining, clustering and SPARQL feature analysis. We argue that a pure SPARQL benchmark is more use-ful to compare existing triple stores and provide results for the popular triple store implementations Virtuoso, Sesame, Jena-TDB, and BigOWLIM. The subsequent comparison of our results with other benchmark results indicates that the per-formance of triple stores is by far less homogeneous than suggested by previous benchmarks. 1
Substring filtering for low-cost linked data interfaces
Recently, Triple Pattern Fragments (TPFS) were introduced as a low-cost server-side interface when high numbers of clients need to evaluate SPARQL queries. Scalability is achieved by moving part of the query execution to the client, at the cost of elevated query times. Since the TPFS interface purposely does not support complex constructs such as SPARQL filters, queries that use them need to be executed mostly on the client, resulting in long execution times. We therefore investigated the impact of adding a literal substring matching feature to the TPFS interface, with the goal of improving query performance while maintaining low server cost. In this paper, we discuss the client/server setup and compare the performance of SPARQL queries on multiple implementations, including Elastic Search and case-insensitive FM-index. Our evaluations indicate that these improvements allow for faster query execution without significantly increasing the load on the server. Offering the substring feature on TPF servers allows users to obtain faster responses for filter-based SPARQL queries. Furthermore, substring matching can be used to support other filters such as complete regular expressions or range queries
On construction, performance, and diversification for structured queries on the semantic desktop
[no abstract
Efficient Extraction and Query Benchmarking of Wikipedia Data
Knowledge bases are playing an increasingly important role for integrating information between systems and over the Web. Today, most knowledge bases cover only specific domains, they are created by relatively small groups of knowledge engineers, and it is very cost intensive to keep them up-to-date as domains change. In parallel, Wikipedia has grown into one of the central knowledge sources of mankind and is maintained by thousands of contributors. The DBpedia (http://dbpedia.org) project makes use of this large collaboratively edited knowledge source by extracting structured content from it, interlinking it with other knowledge bases, and making the result publicly available. DBpedia had and has a great effect on the Web of Data and became a crystallization point for it. Furthermore, many companies and researchers use DBpedia and its public services to improve their applications and research approaches.
However, the DBpedia release process is heavy-weight and the releases are
sometimes based on several months old data. Hence, a strategy to keep DBpedia always in synchronization with Wikipedia is highly required. In this thesis we propose the DBpedia Live framework, which reads a continuous stream of updated Wikipedia articles, and processes it. DBpedia Live processes that stream on-the-fly to obtain RDF data and updates the DBpedia knowledge base with the newly extracted data. DBpedia Live also publishes the newly added/deleted facts in files, in order to enable synchronization between our DBpedia endpoint and other DBpedia mirrors. Moreover, the new DBpedia Live framework incorporates several significant features, e.g. abstract extraction, ontology changes, and changesets publication.
Basically, knowledge bases, including DBpedia, are stored in triplestores in
order to facilitate accessing and querying their respective data. Furthermore, the triplestores constitute the backbone of increasingly many Data Web applications. It is thus evident that the performance of those stores is mission critical for individual projects as well as for data integration on the Data Web in general.
Consequently, it is of central importance during the implementation of any of these applications to have a clear picture of the weaknesses and strengths of current triplestore implementations. We introduce a generic SPARQL benchmark creation procedure, which we apply to the DBpedia knowledge base. Previous approaches often compared relational and triplestores and, thus, settled on measuring performance against a relational database which had been converted to RDF by using SQL-like queries. In contrast to those approaches, our benchmark is based on queries that were actually issued by humans and applications against existing RDF data not resembling a relational schema. Our generic procedure for benchmark creation is based on query-log mining, clustering and SPARQL feature analysis. We argue that a pure SPARQL benchmark is more useful to compare existing triplestores and provide results for the popular triplestore implementations Virtuoso, Sesame, Apache Jena-TDB, and BigOWLIM. The subsequent comparison of our results with other benchmark results indicates that the performance of triplestores is by far less homogeneous than suggested by previous benchmarks.
Further, one of the crucial tasks when creating and maintaining knowledge bases is validating their facts and maintaining the quality of their inherent data. This task include several subtasks, and in thesis we address two of those major subtasks, specifically fact validation and provenance, and data quality The subtask fact validation and provenance aim at providing sources for these facts in order to ensure correctness and traceability of the provided knowledge This subtask is often addressed by human curators in a three-step process: issuing appropriate keyword queries for the statement to check using standard search engines, retrieving potentially relevant documents and screening those documents for relevant content. The drawbacks of this process are manifold. Most importantly, it is very time-consuming as the experts have to carry out several search processes and must often read several documents. We present DeFacto (Deep Fact Validation), which is an algorithm for validating facts by finding trustworthy sources for it on the Web. DeFacto aims to provide an effective way of validating facts by supplying the user with relevant excerpts of webpages as well as useful additional information including a score for the confidence DeFacto has in the correctness of the input fact. On the other hand the subtask of data quality maintenance aims at evaluating and continuously improving the quality of data of the knowledge bases. We present a methodology for assessing the quality of knowledge bases’ data, which comprises of a manual and a semi-automatic process. The first phase includes the detection of common quality problems and their representation in a quality problem taxonomy. In the manual process, the second phase comprises of the evaluation of a large number of individual resources, according to the quality problem taxonomy via crowdsourcing. This process is accompanied by a tool wherein a user assesses an individual resource and evaluates each fact for correctness. The semi-automatic process involves the generation and verification of schema axioms. We report the results obtained by applying this methodology to DBpedia
Graph Processing in Main-Memory Column Stores
Evermore, novel and traditional business applications leverage the advantages of a graph data model, such as the offered schema flexibility and an explicit representation of relationships between entities. As a consequence, companies are confronted with the challenge of storing, manipulating, and querying terabytes of graph data for enterprise-critical applications. Although these business applications operate on graph-structured data, they still require direct access to the relational data and typically rely on an RDBMS to keep a single source of truth and access.
Existing solutions performing graph operations on business-critical data either use a combination of SQL and application logic or employ a graph data management system. For the first approach, relying solely on SQL results in poor execution performance caused by the functional mismatch between typical graph operations and the relational algebra. To the worse, graph algorithms expose a tremendous variety in structure and functionality caused by their often domain-specific implementations and therefore can be hardly integrated into a database management system other than with custom coding. Since the majority of these enterprise-critical applications exclusively run on relational DBMSs, employing a specialized system for storing and processing graph data is typically not sensible. Besides the maintenance overhead for keeping the systems in sync, combining graph and relational operations is hard to realize as it requires data transfer across system boundaries.
A basic ingredient of graph queries and algorithms are traversal operations and are a fundamental component of any database management system that aims at storing, manipulating, and querying graph data. Well-established graph traversal algorithms are standalone implementations relying on optimized data structures. The integration of graph traversals as an operator into a database management system requires a tight integration into the existing database environment and a development of new components, such as a graph topology-aware optimizer and accompanying graph statistics, graph-specific secondary index structures to speedup traversals, and an accompanying graph query language.
In this thesis, we introduce and describe GRAPHITE, a hybrid graph-relational data management system. GRAPHITE is a performance-oriented graph data management system as part of an RDBMS allowing to seamlessly combine processing of graph data with relational data in the same system. We propose a columnar storage representation for graph data to leverage the already existing and mature data management and query processing infrastructure of relational database management systems. At the core of GRAPHITE we propose an execution engine solely based on set operations and graph traversals.
Our design is driven by the observation that different graph topologies expose different algorithmic requirements to the design of a graph traversal operator. We derive two graph traversal implementations targeting the most common graph topologies and demonstrate how graph-specific statistics can be leveraged to select the optimal physical traversal operator. To accelerate graph traversals, we devise a set of graph-specific, updateable secondary index structures to improve the performance of vertex neighborhood expansion. Finally, we introduce a domain-specific language with an intuitive programming model to extend graph traversals with custom application logic at runtime. We use the LLVM compiler framework to generate efficient code that tightly integrates the user-specified application logic with our highly optimized built-in graph traversal operators.
Our experimental evaluation shows that GRAPHITE can outperform native graph management systems by several orders of magnitude while providing all the features of an RDBMS, such as transaction support, backup and recovery, security and user management, effectively providing a promising alternative to specialized graph management systems that lack many of these features and require expensive data replication and maintenance processes
Thinking outside the graph: scholarly knowledge graph construction leveraging natural language processing
Despite improved digital access to scholarly knowledge in recent decades, scholarly communication remains exclusively document-based.
The document-oriented workflows in science publication have reached the limits of adequacy as highlighted by recent discussions on the increasing proliferation of scientific literature, the deficiency of peer-review and the reproducibility crisis.
In this form, scientific knowledge remains locked in representations that are inadequate for machine processing.
As long as scholarly communication remains in this form, we cannot take advantage of all the advancements taking place in machine learning and natural language processing techniques.
Such techniques would facilitate the transformation from pure text based into (semi-)structured semantic descriptions that are interlinked in a collection of big federated graphs.
We are in dire need for a new age of semantically enabled infrastructure adept at storing, manipulating, and querying scholarly knowledge.
Equally important is a suite of machine assistance tools designed to populate, curate, and explore the resulting scholarly knowledge graph.
In this thesis, we address the issue of constructing a scholarly knowledge graph using natural language processing techniques.
First, we tackle the issue of developing a scholarly knowledge graph for structured scholarly communication, that can be populated and constructed automatically.
We co-design and co-implement the Open Research Knowledge Graph (ORKG), an infrastructure capable of modeling, storing, and automatically curating scholarly communications.
Then, we propose a method to automatically extract information into knowledge graphs.
With Plumber, we create a framework to dynamically compose open information extraction pipelines based on the input text.
Such pipelines are composed from community-created information extraction components in an effort to consolidate individual research contributions under one umbrella.
We further present MORTY as a more targeted approach that leverages automatic text summarization to create from the scholarly article's text structured summaries containing all required information.
In contrast to the pipeline approach, MORTY only extracts the information it is instructed to, making it a more valuable tool for various curation and contribution use cases.
Moreover, we study the problem of knowledge graph completion.
exBERT is able to perform knowledge graph completion tasks such as relation and entity prediction tasks on scholarly knowledge graphs by means of textual triple classification.
Lastly, we use the structured descriptions collected from manual and automated sources alike with a question answering approach that builds on the machine-actionable descriptions in the ORKG.
We propose JarvisQA, a question answering interface operating on tabular views of scholarly knowledge graphs i.e., ORKG comparisons.
JarvisQA is able to answer a variety of natural language questions, and retrieve complex answers on pre-selected sub-graphs.
These contributions are key in the broader agenda of studying the feasibility of natural language processing methods on scholarly knowledge graphs, and lays the foundation of which methods can be used on which cases.
Our work indicates what are the challenges and issues with automatically constructing scholarly knowledge graphs, and opens up future research directions
Ontology Ranking: Finding the Right Ontologies on the Web
Ontology search, which is the process of finding ontologies or
ontological terms for users’ defined queries from an ontology
collection, is an important task to facilitate ontology reuse of
ontology engineering. Ontology reuse is desired to avoid the
tedious process of building an ontology from scratch and to limit
the design of several competing ontologies that represent similar
knowledge. Since many organisations in both the private and
public sectors are publishing their data in RDF, they
increasingly require to find or design ontologies for data
annotation and/or integration. In general, there exist multiple
ontologies representing a domain, therefore, finding the best
matching ontologies or their terms is required to facilitate
manual or dynamic ontology selection for both ontology design and
data annotation.
The ranking is a crucial component in the ontology retrieval
process which aims at listing the ‘relevant0 ontologies or
their terms as high as possible in the search results to reduce
the human intervention. Most existing ontology ranking techniques
inherit one or more information retrieval ranking parameter(s).
They linearly combine the values of these parameters for each
ontology to compute the relevance score against a user query and
rank the results in descending order of the relevance score. A
significant aspect of achieving an effective ontology ranking
model is to develop novel metrics and dynamic techniques that can
optimise the relevance score of the most relevant ontology for a
user query.
In this thesis, we present extensive research in ontology
retrieval and ranking, where several research gaps in the
existing literature are identified and addressed. First, we begin
the thesis with a review of the literature and propose a taxonomy
of Semantic Web data (i.e., ontologies and linked data) retrieval
approaches. That allows us to identify potential research
directions in the field. In the remainder of the thesis, we
address several of the identified shortcomings in the ontology
retrieval domain. We develop a framework for the empirical and
comparative evaluation of different ontology ranking solutions,
which has not been studied in the literature so far. Second, we
propose an effective relationship-based concept retrieval
framework and a concept ranking model through the use of learning
to rank approach which addresses the limitation of the existing
linear ranking models. Third, we propose RecOn, a framework that
helps users in finding the best matching ontologies to a
multi-keyword query. There the relevance score of an ontology to
the query is computed by formulating and solving the ontology
recommendation problem as a linear and an optimisation problem.
Finally, the thesis also reports on an extensive comparative
evaluation of our proposed solutions with several other
state-of-the-art techniques using real-world ontologies. This
thesis will be useful for researchers and practitioners
interested in ontology search, for methods and performance
benchmark on ranking approaches to ontology search
Design of knowledge-based systems for automated deployment of building management services
Despite its high potential, the building's sector lags behind in reducing its energy demand. Tremendous savings can be achieved by deploying building management services during operation, however, the manual deployment of these services needs to be undertaken by experts and it is a tedious, time and cost consuming task. It requires detailed expert knowledge to match the diverse requirements of services with the present constellation of envelope, equipment and automation system in a target building. To enable the widespread deployment of these services, this knowledge-intensive task needs to be automated. Knowledge-based methods solve this task, however, their widespread adoption is hampered and solutions proposed in the past do not stick to basic principles of state of the art knowledge engineering methods. To fill this gap we present a novel methodological approach for the design of knowledge-based systems for the automated deployment of building management services. The approach covers the essential steps and best practices: (1) representation of terminological knowledge of a building and its systems based on well-established knowledge engineering methods; (2) representation and capturing of assertional knowledge on a real building portfolio based on open standards; and (3) use of the acquired knowledge for the automated deployment of building management services to increase the energy efficiency of buildings during operation. We validate the methodological approach by deploying it in a real-world large-scale European pilot on a diverse portfolio of buildings and a novel set of building management services. In addition, a novel ontology, which reuses and extends existing ontologies is presented.The authors would like to gratefully acknowledge the generous funding provided by the European Union’s Horizon 2020 research and innovation programme through the MOEEBIUS project under grant agreement No. 680517
Data integration support for offshore decommissioning waste management
Offshore oil and gas platforms have a design life of about 25 years whereas the techniques and tools
used for managing their data are constantly evolving. Therefore, data captured about platforms during
their lifetimes will be in varying forms. Additionally, due to the many stakeholders involved with a facility
over its life cycle, information representation of its components varies. These challenges make data
integration difficult. Over the years, data integration technology application in the oil and gas industry
has focused on meeting the needs of asset life cycle stages other than decommissioning. This is the
case because most assets are just reaching the end of their design lives.
Currently, limited work has
been done on integrating life cycle data for offshore decommissioning purposes, and reports by industry
stakeholders underscore this need.
This thesis proposes a method for the integration of the common data types relevant in oil and gas
decommissioning. The key features of the method are that it (i) ensures semantic homogeneity using
knowledge representation languages (Semantic Web) and domain specific reference data (ISO 15926);
and (ii) allows stakeholders to continue to use their current applications. Prototypes of the framework
have been implemented using open source software applications and performance measures made.
The work of this thesis has been motivated by the business case of reusing offshore decommissioning
waste items. The framework developed is generic and can be applied whenever there is a need to
integrate and query disparate data involving oil and gas assets. The prototypes presented show how
the data management challenges associated with assessing the suitability of decommissioned offshore
facility items for reuse can be addressed. The performance of the prototypes show that significant time
and effort is saved compared to the state-of‐the‐art solution. The ability to do this effectively and
efficiently during decommissioning will advance the oil the oil and gas industry’s transition toward a
circular economy and help save on cost
- …