164 research outputs found
Enhancing Virtual Ontology Based Access over Tabular Data with Morph-CSV
Ontology-Based Data Access (OBDA) has traditionally focused on providing a
unified view of heterogeneous datasets, either by materializing integrated data
into RDF or by performing on-the fly querying via SPARQL query translation. In
the specific case of tabular datasets represented as several CSV or Excel
files, query translation approaches have been applied by considering each
source as a single table that can be loaded into a relational database
management system (RDBMS). Nevertheless, constraints over these tables are not
represented; thus, neither consistency among attributes nor indexes over tables
are enforced. As a consequence, efficiency of the SPARQL-to-SQL translation
process may be affected, as well as the completeness of the answers produced
during the evaluation of the generated SQL query. Our work is focused on
applying implicit constraints on the OBDA query translation process over
tabular data. We propose Morph-CSV, a framework for querying tabular data that
exploits information from typical OBDA inputs (e.g., mappings, queries) to
enforce constraints that can be used together with any SPARQL-to-SQL OBDA
engine. Morph-CSV relies on both a constraint component and a set of constraint
operators. For a given set of constraints, the operators are applied to each
type of constraint with the aim of enhancing query completeness and
performance. We evaluate Morph-CSV in several domains: e-commerce with the BSBM
benchmark; transportation with a benchmark using the GTFS dataset from the
Madrid subway; and biology with a use case extracted from the Bio2RDF project.
We compare and report the performance of two SPARQL-to-SQL OBDA engines,
without and with the incorporation of MorphCSV. The observed results suggest
that Morph-CSV is able to speed up the total query execution time by up to two
orders of magnitude, while it is able to produce all the query answers
Spider4SPARQL: A Complex Benchmark for Evaluating Knowledge Graph Question Answering Systems
With the recent spike in the number and availability of Large Language Models
(LLMs), it has become increasingly important to provide large and realistic
benchmarks for evaluating Knowledge Graph Question Answering (KBQA) systems. So
far the majority of benchmarks rely on pattern-based SPARQL query generation
approaches. The subsequent natural language (NL) question generation is
conducted through crowdsourcing or other automated methods, such as rule-based
paraphrasing or NL question templates. Although some of these datasets are of
considerable size, their pitfall lies in their pattern-based generation
approaches, which do not always generalize well to the vague and linguistically
diverse questions asked by humans in real-world contexts.
In this paper, we introduce Spider4SPARQL - a new SPARQL benchmark dataset
featuring 9,693 previously existing manually generated NL questions and 4,721
unique, novel, and complex SPARQL queries of varying complexity. In addition to
the NL/SPARQL pairs, we also provide their corresponding 166 knowledge graphs
and ontologies, which cover 138 different domains. Our complex benchmark
enables novel ways of evaluating the strengths and weaknesses of modern KGQA
systems. We evaluate the system with state-of-the-art KGQA systems as well as
LLMs, which achieve only up to 45\% execution accuracy, demonstrating that
Spider4SPARQL is a challenging benchmark for future research
Virtual Knowledge Graphs: An Overview of Systems and Use Cases
In this paper, we present the virtual knowledge graph (VKG) paradigm for data integration and access, also known in the literature as Ontology-based Data Access. Instead of structuring the integration layer as a collection of relational tables, the VKG paradigm replaces the rigid structure of tables with the flexibility of graphs that are kept virtual and embed domain knowledge. We explain the main notions of this paradigm, its tooling ecosystem and significant use cases in a wide range of applications. Finally, we discuss future research directions
Semantic Data Management in Data Lakes
In recent years, data lakes emerged as away to manage large amounts of
heterogeneous data for modern data analytics. One way to prevent data lakes
from turning into inoperable data swamps is semantic data management. Some
approaches propose the linkage of metadata to knowledge graphs based on the
Linked Data principles to provide more meaning and semantics to the data in the
lake. Such a semantic layer may be utilized not only for data management but
also to tackle the problem of data integration from heterogeneous sources, in
order to make data access more expressive and interoperable. In this survey, we
review recent approaches with a specific focus on the application within data
lake systems and scalability to Big Data. We classify the approaches into (i)
basic semantic data management, (ii) semantic modeling approaches for enriching
metadata in data lakes, and (iii) methods for ontologybased data access. In
each category, we cover the main techniques and their background, and compare
latest research. Finally, we point out challenges for future work in this
research area, which needs a closer integration of Big Data and Semantic Web
technologies
Towards Analytics Aware Ontology Based Access to Static and Streaming Data (Extended Version)
Real-time analytics that requires integration and aggregation of
heterogeneous and distributed streaming and static data is a typical task in
many industrial scenarios such as diagnostics of turbines in Siemens. OBDA
approach has a great potential to facilitate such tasks; however, it has a
number of limitations in dealing with analytics that restrict its use in
important industrial applications. Based on our experience with Siemens, we
argue that in order to overcome those limitations OBDA should be extended and
become analytics, source, and cost aware. In this work we propose such an
extension. In particular, we propose an ontology, mapping, and query language
for OBDA, where aggregate and other analytical functions are first class
citizens. Moreover, we develop query optimisation techniques that allow to
efficiently process analytical tasks over static and streaming data. We
implement our approach in a system and evaluate our system with Siemens turbine
data
Optique: Zooming in on Big Data
Despite the dramatic growth of data accumulated by enterprises, obtaining value out of it is extremely challenging. In particular, the data access bottleneck prevents domain experts from getting the right piece of data within a constrained time frame. The Optique Platform unlocks the access to Big Data by providing end users support for directly formulating their information needs through an intuitive visual query interface. The submitted query is then transformed into highly optimized queries over the data sources, which may include streaming data, and exploiting massive parallelism in the backend whenever possible. The Optique Platform thus responds to one major challenge posed by Big Data in data-intensive industrial settings
- …