5,682 research outputs found
Combining Provenance Management and Schema Evolution
The combination of provenance management and schema evolution using the CHASE algorithm is the focus of our research in the area of research data management. The aim is to combine the construc- tion of a CHASE inverse mapping to calculate the minimal part of the original database — the minimal sub-database — with a CHASE-based schema mapping for schema evolution
Explain3D: Explaining Disagreements in Disjoint Datasets
Data plays an important role in applications, analytic processes, and many
aspects of human activity. As data grows in size and complexity, we are met
with an imperative need for tools that promote understanding and explanations
over data-related operations. Data management research on explanations has
focused on the assumption that data resides in a single dataset, under one
common schema. But the reality of today's data is that it is frequently
un-integrated, coming from different sources with different schemas. When
different datasets provide different answers to semantically similar questions,
understanding the reasons for the discrepancies is challenging and cannot be
handled by the existing single-dataset solutions.
In this paper, we propose Explain3D, a framework for explaining the
disagreements across disjoint datasets (3D). Explain3D focuses on identifying
the reasons for the differences in the results of two semantically similar
queries operating on two datasets with potentially different schemas. Our
framework leverages the queries to perform a semantic mapping across the
relevant parts of their provenance; discrepancies in this mapping point to
causes of the queries' differences. Exploiting the queries gives Explain3D an
edge over traditional schema matching and record linkage techniques, which are
query-agnostic. Our work makes the following contributions: (1) We formalize
the problem of deriving optimal explanations for the differences of the results
of semantically similar queries over disjoint datasets. (2) We design a 3-stage
framework for solving the optimal explanation problem. (3) We develop a
smart-partitioning optimizer that improves the efficiency of the framework by
orders of magnitude. (4)~We experiment with real-world and synthetic data to
demonstrate that Explain3D can derive precise explanations efficiently
Enhanced Inversion of Schema Evolution with Provenance
Long-term data-driven studies have become indispensable in many areas of
science. Often, the data formats, structures and semantics of data change over
time, the data sets evolve. Therefore, studies over several decades in
particular have to consider changing database schemas. The evolution of these
databases lead at some point to a large number of schemas, which have to be
stored and managed, costly and time-consuming. However, in the sense of
reproducibility of research data each database version must be reconstructable
with little effort. So a previously published result can be validated and
reproduced at any time.
Nevertheless, in many cases, such an evolution can not be fully
reconstructed. This article classifies the 15 most frequently used schema
modification operators and defines the associated inverses for each operation.
For avoiding an information loss, it furthermore defines which additional
provenance information have to be stored. We define four classes dealing with
dangling tuples, duplicates and provenance-invariant operators. Each class will
be presented by one representative.
By using and extending the theory of schema mappings and their inverses for
queries, data analysis, why-provenance, and schema evolution, we are able to
combine data analysis applications with provenance under evolving database
structures, in order to enable the reproducibility of scientific results over
longer periods of time. While most of the inverses of schema mappings used for
analysis or evolution are not exact, but only quasi-inverses, adding provenance
information enables us to reconstruct a sub-database of research data that is
sufficient to guarantee reproducibility
Linked Data - the story so far
The term “Linked Data” refers to a set of best practices for publishing and connecting structured data on the Web. These best practices have been adopted by an increasing number of data providers over the last three years, leading to the creation of a global data space containing billions of assertions— the Web of Data. In this article, the authors present the concept and technical principles of Linked Data, and situate these within the broader context of related technological developments. They describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked Data community as it moves forward
e-Social Science and Evidence-Based Policy Assessment : Challenges and Solutions
Peer reviewedPreprin
- …