Long-term data-driven studies have become indispensable in many areas of
science. Often, the data formats, structures and semantics of data change over
time, the data sets evolve. Therefore, studies over several decades in
particular have to consider changing database schemas. The evolution of these
databases lead at some point to a large number of schemas, which have to be
stored and managed, costly and time-consuming. However, in the sense of
reproducibility of research data each database version must be reconstructable
with little effort. So a previously published result can be validated and
reproduced at any time.
Nevertheless, in many cases, such an evolution can not be fully
reconstructed. This article classifies the 15 most frequently used schema
modification operators and defines the associated inverses for each operation.
For avoiding an information loss, it furthermore defines which additional
provenance information have to be stored. We define four classes dealing with
dangling tuples, duplicates and provenance-invariant operators. Each class will
be presented by one representative.
By using and extending the theory of schema mappings and their inverses for
queries, data analysis, why-provenance, and schema evolution, we are able to
combine data analysis applications with provenance under evolving database
structures, in order to enable the reproducibility of scientific results over
longer periods of time. While most of the inverses of schema mappings used for
analysis or evolution are not exact, but only quasi-inverses, adding provenance
information enables us to reconstruct a sub-database of research data that is
sufficient to guarantee reproducibility