1 research outputs found
Efficiently Processing Workflow Provenance Queries on SPARK
In this paper, we investigate how we can leverage Spark platform for
efficiently processing provenance queries on large volumes of workflow
provenance data. We focus on processing provenance queries at attribute-value
level which is the finest granularity available. We propose a novel weakly
connected component based framework which is carefully engineered to quickly
determine a minimal volume of data containing the entire lineage of the queried
attribute-value. This minimal volume of data is then processed to figure out
the provenance of the queried attribute-value. The proposed framework computes
weakly connected components on the workflow provenance graph and further
partitions the large components as a collection of weakly connected sets. The
framework exploits the workflow dependency graph to effectively partition the
large components into a collection of weakly connected sets. We study the
effectiveness of the proposed framework through experiments on a provenance
trace obtained from a real-life unstructured text curation workflow. On
provenance graphs containing upto 500M nodes and edges, we show that the
proposed framework answers provenance queries in real-time and easily
outperforms the naive approaches