11 research outputs found
Abstracting PROV provenance graphs:A validity-preserving approach
Data provenance is a structured form of metadata designed to record the activities and datasets involved in data production, as well as their dependency relationships. The PROV data model, released by the W3C in 2013, defines a schema and constraints that together provide a structural and semantic foundation for provenance. This enables the interoperable exchange of provenance between data producers and consumers. When the provenance content is sensitive and subject to disclosure restrictions, however, a way of hiding parts of the provenance in a principled way before communicating it to certain parties is required. In this paper we present a provenance abstraction operator that achieves this goal. It maps a graphical representation of a PROV document PG1 to a new abstract version PG2, ensuring that (i) PG2 is a valid PROV graph, and (ii) the dependencies that appear in PG2 are justified by those that appear in PG1. These two properties ensure that further abstraction of abstract PROV graphs is possible. A guiding principle of the work is that of minimum damage: the resultant graph is altered as little as possible, while ensuring that the two properties are maintained. The operator developed is implemented as part of a user tool, described in a separate paper, that lets owners of sensitive provenance information control the abstraction by specifying an abstraction policy.</p
Hypothetical Reasoning via Provenance Abstraction
Data analytics often involves hypothetical reasoning: repeatedly modifying
the data and observing the induced effect on the computation result of a
data-centric application. Previous work has shown that fine-grained data
provenance can help make such an analysis more efficient: instead of a costly
re-execution of the underlying application, hypothetical scenarios are applied
to a pre-computed provenance expression. However, storing provenance for
complex queries and large-scale data leads to a significant overhead, which is
often a barrier to the incorporation of provenance-based solutions.
To this end, we present a framework that allows to reduce provenance size.
Our approach is based on reducing the provenance granularity using user defined
abstraction trees over the provenance variables; the granularity is based on
the anticipated hypothetical scenarios. We formalize the tradeoff between
provenance size and supported granularity of the hypothetical reasoning, and
study the complexity of the resulting optimization problem, provide efficient
algorithms for tractable cases and heuristics for others. We experimentally
study the performance of our solution for various queries and abstraction
trees. Our study shows that the algorithms generally lead to substantial
speedup of hypothetical reasoning, with a reasonable loss of accuracy
Approximated Summarization of Data Provenance
International audienceMany modern applications involve collecting large amounts of data from multiple sources, and then aggregating and manipulating it in intricate ways. The complexity of such applications, combined with the size of the collected data, makes it difficult to understand how the resulting information was derived. Data provenance has proven helpful in this respect, however, maintaining and presenting the full and exact provenance information may be infeasible due to its size and complexity. We therefore introduce the notion of approximated summarized provenance, which provides a compact representation of the provenance at the possible cost of information loss. Based on this notion, we present a novel provenance summarization algorithm which, based on the semantics of the underlying data and the intended use of provenance, outputs a summary of the input provenance. Experiments measure the conciseness and accuracy of the resulting provenance summaries, and improvement in provenance usage time