27 research outputs found
Federated Query Processing over Heterogeneous Data Sources in a Semantic Data Lake
Data provides the basis for emerging scientific and interdisciplinary data-centric applications with the potential of improving the quality of life for citizens. Big Data plays an important role in promoting both manufacturing and scientific development through industrial digitization and emerging interdisciplinary research. Open data initiatives have encouraged the publication of Big Data by exploiting the decentralized nature of the Web, allowing for the availability of heterogeneous data generated and maintained by autonomous data providers. Consequently, the growing volume of data consumed by different applications raise the need for effective data integration approaches able to process a large volume of data that is represented in different format, schema and model, which may also include sensitive data, e.g., financial transactions, medical procedures, or personal data. Data Lakes are composed of heterogeneous data sources in their original format, that reduce the overhead of materialized data integration. Query processing over Data Lakes require the semantic description of data collected from heterogeneous data sources. A Data Lake with such semantic annotations is referred to as a Semantic Data Lake. Transforming Big Data into actionable knowledge demands novel and scalable techniques for enabling not only Big Data ingestion and curation to the Semantic Data Lake, but also for efficient large-scale semantic data integration, exploration, and discovery. Federated query processing techniques utilize source descriptions to find relevant data sources and find efficient execution plan that minimize the total execution time and maximize the completeness of answers. Existing federated query processing engines employ a coarse-grained description model where the semantics encoded in data sources are ignored. Such descriptions may lead to the erroneous selection of data sources for a query and unnecessary retrieval of data, affecting thus the performance of query processing engine. In this thesis, we address the problem of federated query processing against heterogeneous data sources in a Semantic Data Lake. First, we tackle the challenge of knowledge representation and propose a novel source description model, RDF Molecule Templates, that describe knowledge available in a Semantic Data Lake. RDF Molecule Templates (RDF-MTs) describes data sources in terms of an abstract description of entities belonging to the same semantic concept. Then, we propose a technique for data source selection and query decomposition, the MULDER approach, and query planning and optimization techniques, Ontario, that exploit the characteristics of heterogeneous data sources described using RDF-MTs and provide a uniform access to heterogeneous data sources. We then address the challenge of enforcing privacy and access control requirements imposed by data providers. We introduce a privacy-aware federated query technique, BOUNCER, able to enforce privacy and access control regulations during query processing over data sources in a Semantic Data Lake. In particular, BOUNCER exploits RDF-MTs based source descriptions in order to express privacy and access control policies as well as their automatic enforcement during source selection, query decomposition, and planning. Furthermore, BOUNCER implements query decomposition and optimization techniques able to identify query plans over data sources that not only contain the relevant entities to answer a query, but also are regulated by policies that allow for accessing these relevant entities. Finally, we tackle the problem of interest based update propagation and co-evolution of data sources. We present a novel approach for interest-based RDF update propagation that consistently maintains a full or partial replication of large datasets and deal with co-evolution
Compliance Using Metadata
Everybody talks about the data economy. Data is collected stored, processed and re-used. In the EU, the GDPR creates a framework with conditions (e.g. consent) for the processing of personal data. But there are also other legal provisions containing requirements and conditions for the processing of data. Even today, most of those are hard-coded into workflows or database schemes, if at all. Data lakes are polluted with unusable data because nobody knows about usage rights or data quality. The approach presented here makes the data lake intelligent. It remembers usage limitations and promises made to the data subject or the contractual partner. Data can be used as risk can be assessed. Such a system easily reacts on new requirements. If processing is recorded back into the data lake, the recording of this information allows to prove compliance. This can be shown to authorities on demand as an audit trail. The concept is best exemplified by the SPECIAL project https://specialprivacy.eu (Scalable Policy-aware Linked Data Architecture For PrivacyPrivacy, TransparencyTransparency and ComplianceCompliance). SPECIAL has several use cases, but the basic framework is applicable beyond those cases
Semantic Data Management in Data Lakes
In recent years, data lakes emerged as away to manage large amounts of
heterogeneous data for modern data analytics. One way to prevent data lakes
from turning into inoperable data swamps is semantic data management. Some
approaches propose the linkage of metadata to knowledge graphs based on the
Linked Data principles to provide more meaning and semantics to the data in the
lake. Such a semantic layer may be utilized not only for data management but
also to tackle the problem of data integration from heterogeneous sources, in
order to make data access more expressive and interoperable. In this survey, we
review recent approaches with a specific focus on the application within data
lake systems and scalability to Big Data. We classify the approaches into (i)
basic semantic data management, (ii) semantic modeling approaches for enriching
metadata in data lakes, and (iii) methods for ontologybased data access. In
each category, we cover the main techniques and their background, and compare
latest research. Finally, we point out challenges for future work in this
research area, which needs a closer integration of Big Data and Semantic Web
technologies
How to feed the squerall with RDF and other data nuts?
Advances in Data Management methods have resulted in a wide array of storage solutions having varying query capabilities and supporting different data formats. Traditionally, heterogeneous data was transformed off-line into a unique format and migrated to a unique data management system, before being uniformly queried. However, with the increasing amount of heterogeneous data sources, many of which are dynamic, modern applications prefer accessing directly the original fresh data. Addressing this requirement, we designed and developed Squerall, a software framework that enables the querying of original large and heterogeneous data on-the-fly without prior data transformation. Squerall is built from the ground up with extensibility in consideration, e.g., supporting more data sources. Here, we explain Squerall’s extensibility aspect and demonstrate step-by-step how to add support for RDF data, a new extension to the previously supported range of data sources
Conceptual Model of a Federated Data Lake
Valuable insights are frequently only available after combining and analysing data from multiple sources. This paper presents a Conceptual Model of a Federated Data Lake, as a contribution to formalize the required components and their relationships, in order to identify and address them in the implementation of a comprehensive system that supports on-the-fly query processing over multiple heterogeneous sources and provides an adequate data management by highlighting the concepts of a Data Lake and focusing on the Metadata Management domain as an engine to the integration of several Data Lakes
Query Optimization Techniques For Scaling Up To Data Variety
Even though Data Lakes are efficient in terms of data storage, they increase the complexity of query processing; this can lead to expensive query execution. Hence, novel techniques for generating query execution plans are demanded. Those techniques have to be able to exploit the main characteristics of Data Lakes. Ontario is a federated query engine capable of processing queries over heterogeneous data sources. Ontario uses source descriptions based on RDF Molecule Templates, i.e., an
abstract description of the properties belonging to the entities in the unified schema of the data in the Data Lake. This thesis proposes new heuristics tailored to the problem of query processing over heterogeneous data sources including heuristics specifically designed for certain data models. The proposed heuristics are integrated into the Ontario query optimizer. Ontario is compared to state-of-the-art RDF query engines in order to study the overhead introduced by considering heterogeneity during query processing. The results of the empirical evaluation suggest that there is no significant overhead when considering heterogeneity. Furthermore, the baseline version of Ontario is compared to two different sets of additional heuristics, i.e., heuristics specifically designed for certain data models and heuristics that do not consider the data model. The analysis of the obtained experimental results shows that source-specific heuristics are able to improve query performance. Ontario optimization techniques are able to generate effective and efficient query plans that can be executed over heterogeneous data sources in a Data Lake
Optimizing Federated Queries Based on the Physical Design of a Data Lake
The optimization of query execution plans is known to be crucial
for reducing the query execution time. In particular, query optimization
has been studied thoroughly for relational databases
over the past decades. Recently, the Resource Description Framework
(RDF) became popular for publishing data on the Web. As
a consequence, federations composed of different data models
like RDF and relational databases evolved. One type of these
federations are Semantic Data Lakes where every data source is
kept in its original data model and semantically annotated with
ontologies or controlled vocabularies. However, state-of-the-art
query engines for federated query processing over Semantic Data
Lakes often rely on optimization techniques tailored for RDF. In
this paper, we present query optimization techniques guided
by heuristics that take the physical design of a Data Lake into
account. The heuristics are implemented on top of Ontario, a
SPARQL query engine for Semantic Data Lakes. Using sourcespecific
heuristics, the query engine is able to generate more efficient
query execution plans by exploiting the knowledge about
indexes and normalization in relational databases. We show that
heuristics which take the physical design of the Data Lake into
account are able to speed up query processing
Enhancing Virtual Ontology Based Access over Tabular Data with Morph-CSV
Ontology-Based Data Access (OBDA) has traditionally focused on providing a
unified view of heterogeneous datasets, either by materializing integrated data
into RDF or by performing on-the fly querying via SPARQL query translation. In
the specific case of tabular datasets represented as several CSV or Excel
files, query translation approaches have been applied by considering each
source as a single table that can be loaded into a relational database
management system (RDBMS). Nevertheless, constraints over these tables are not
represented; thus, neither consistency among attributes nor indexes over tables
are enforced. As a consequence, efficiency of the SPARQL-to-SQL translation
process may be affected, as well as the completeness of the answers produced
during the evaluation of the generated SQL query. Our work is focused on
applying implicit constraints on the OBDA query translation process over
tabular data. We propose Morph-CSV, a framework for querying tabular data that
exploits information from typical OBDA inputs (e.g., mappings, queries) to
enforce constraints that can be used together with any SPARQL-to-SQL OBDA
engine. Morph-CSV relies on both a constraint component and a set of constraint
operators. For a given set of constraints, the operators are applied to each
type of constraint with the aim of enhancing query completeness and
performance. We evaluate Morph-CSV in several domains: e-commerce with the BSBM
benchmark; transportation with a benchmark using the GTFS dataset from the
Madrid subway; and biology with a use case extracted from the Bio2RDF project.
We compare and report the performance of two SPARQL-to-SQL OBDA engines,
without and with the incorporation of MorphCSV. The observed results suggest
that Morph-CSV is able to speed up the total query execution time by up to two
orders of magnitude, while it is able to produce all the query answers
Responsible Knowledge Management in Energy Data Ecosystems
This paper analyzes the challenges and requirements of establishing energy data ecosystems (EDEs) as data-driven infrastructures that overcome the limitations of currently fragmented energy applications. It proposes a new data-and knowledge-driven approach for management and process-ing. This approach aims to extend the analytics services portfolio of various energy stakeholders and achieve two-way flows of electricity and information for optimized generation, distribution, and electricity consumption. The approach is based on semantic technologies to create knowledge-based systems that will aid machines in integrating and processing resources contextually and intelligently. Thus, a paradigm shift in the energy data value chain is proposed towards transparency and the responsible management of data and knowledge exchanged by the various stakeholders of an energy data space. The approach can contribute to innovative energy management and the adoption of new business models in future energy data spaces. © 2022 by the authors. Licensee MDPI, Basel, Switzerland