7,663 research outputs found
Automatic vs Manual Provenance Abstractions: Mind the Gap
In recent years the need to simplify or to hide sensitive information in
provenance has given way to research on provenance abstraction. In the context
of scientific workflows, existing research provides techniques to semi
automatically create abstractions of a given workflow description, which is in
turn used as filters over the workflow's provenance traces. An alternative
approach that is commonly adopted by scientists is to build workflows with
abstractions embedded into the workflow's design, such as using sub-workflows.
This paper reports on the comparison of manual versus semi-automated approaches
in a context where result abstractions are used to filter report-worthy results
of computational scientific analyses. Specifically; we take a real-world
workflow containing user-created design abstractions and compare these with
abstractions created by ZOOM UserViews and Workflow Summaries systems. Our
comparison shows that semi-automatic and manual approaches largely overlap from
a process perspective, meanwhile, there is a dramatic mismatch in terms of data
artefacts retained in an abstracted account of derivation. We discuss reasons
and suggest future research directions.Comment: Preprint accepted to the 2016 workshop on the Theory and Applications
of Provenance, TAPP 201
PROV-IO+: A Cross-Platform Provenance Framework for Scientific Data on HPC Systems
Data provenance, or data lineage, describes the life cycle of data. In
scientific workflows on HPC systems, scientists often seek diverse provenance
(e.g., origins of data products, usage patterns of datasets). Unfortunately,
existing provenance solutions cannot address the challenges due to their
incompatible provenance models and/or system implementations. In this paper, we
analyze four representative scientific workflows in collaboration with the
domain scientists to identify concrete provenance needs. Based on the
first-hand analysis, we propose a provenance framework called PROV-IO+, which
includes an I/O-centric provenance model for describing scientific data and the
associated I/O operations and environments precisely. Moreover, we build a
prototype of PROV-IO+ to enable end-to-end provenance support on real HPC
systems with little manual effort. The PROV-IO+ framework can support both
containerized and non-containerized workflows on different HPC platforms with
flexibility in selecting various classes of provenance. Our experiments with
realistic workflows show that PROV-IO+ can address the provenance needs of the
domain scientists effectively with reasonable performance (e.g., less than 3.5%
tracking overhead for most experiments). Moreover, PROV-IO+ outperforms a
state-of-the-art system (i.e., ProvLake) in our experiments
Provenance in bioinformatics workflows
In this work, we used the PROV-DM model to manage data provenance in workflows of genome projects. This provenance model allows the storage of details of one workflow execution, e.g., raw and produced data and computational tools, their versions and parameters. Using this model, biologists can access details of one particular execution of a workflow, compare results produced by different executions, and plan new experiments more efficiently. In addition to this, a provenance simulator was created, which facilitates the inclusion of provenance data of one genome project workflow execution. Finally, we discuss one case study, which aims to identify genes involved in specific metabolic pathways of Bacillus cereus, as well as to compare this isolate with other phylogenetic related bacteria from the Bacillus group. B. cereus is an extremophilic bacteria, collectemd in warm water in the Midwestern Region of Brazil, its DNA samples having been sequenced with an NGS machine
Using a Model-driven Approach in Building a Provenance Framework for Tracking Policy-making Processes in Smart Cities
The significance of provenance in various settings has emphasised its
potential in the policy-making process for analytics in Smart Cities. At
present, there exists no framework that can capture the provenance in a
policy-making setting. This research therefore aims at defining a novel
framework, namely, the Policy Cycle Provenance (PCP) Framework, to capture the
provenance of the policy-making process. However, it is not straightforward to
design the provenance framework due to a number of associated policy design
challenges. The design challenges revealed the need for an adaptive system for
tracking policies therefore a model-driven approach has been considered in
designing the PCP framework. Also, suitability of a networking approach is
proposed for designing workflows for tracking the policy-making process.Comment: 15 pages, 5 figures, 2 tables, Proc of the 21st International
Database Engineering & Applications Symposium (IDEAS 2017
Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv
Background: The automation of data analysis in the form of scientific workflows has become a widely adopted practice in many fields of research. Computationally driven data-intensive experiments using workflows enable Automation, Scaling, Adaption and Provenance support (ASAP). However, there are still several challenges associated with the effective sharing, publication and reproducibility of such workflows due to the incomplete capture of provenance and lack of interoperability between different technical (software) platforms.
Results: Based on best practice recommendations identified from literature on workflow design, sharing and publishing, we define a hierarchical provenance framework to achieve uniformity in the provenance and support comprehensive and fully re-executable workflows equipped with domain-specific information. To realise this framework, we present CWLProv, a standard-based format to represent any workflow-based computational analysis to produce workflow output artefacts that satisfy the various levels of provenance. We utilise open source community-driven standards; interoperable workflow definitions in Common Workflow Language (CWL), structured provenance representation using the W3C PROV model, and resource aggregation and sharing as workflow-centric Research Objects (RO) generated along with the final outputs of a given workflow enactment. We demonstrate the utility of this approach through a practical implementation of CWLProv and evaluation using real-life genomic workflows developed by independent groups.
Conclusions: The underlying principles of the standards utilised by CWLProv enable semantically-rich and executable Research Objects that capture computational workflows with retrospective provenance such that any platform supporting CWL will be able to understand the analysis, re-use the methods for partial re-runs, or reproduce the analysis to validate the published findings.Submitted to GigaScience (GIGA-D-18-00483
Causality and the semantics of provenance
Provenance, or information about the sources, derivation, custody or history
of data, has been studied recently in a number of contexts, including
databases, scientific workflows and the Semantic Web. Many provenance
mechanisms have been developed, motivated by informal notions such as
influence, dependence, explanation and causality. However, there has been
little study of whether these mechanisms formally satisfy appropriate policies
or even how to formalize relevant motivating concepts such as causality. We
contend that mathematical models of these concepts are needed to justify and
compare provenance techniques. In this paper we review a theory of causality
based on structural models that has been developed in artificial intelligence,
and describe work in progress on a causal semantics for provenance graphs.Comment: Workshop submissio
- …