118 research outputs found
The lifecycle of provenance metadata and its associated challenges and opportunities
This chapter outlines some of the challenges and opportunities associated
with adopting provenance principles and standards in a variety of disciplines,
including data publication and reuse, and information sciences
DataHub: Collaborative Data Science & Dataset Version Management at Scale
Relational databases have limited support for data collaboration, where teams
collaboratively curate and analyze large datasets. Inspired by software version
control systems like git, we propose (a) a dataset version control system,
giving users the ability to create, branch, merge, difference and search large,
divergent collections of datasets, and (b) a platform, DataHub, that gives
users the ability to perform collaborative data analysis building on this
version control system. We outline the challenges in providing dataset version
control at scale.Comment: 7 page
Scientific Workflow Repeatability through Cloud-Aware Provenance
The transformations, analyses and interpretations of data in scientific
workflows are vital for the repeatability and reliability of scientific
workflows. This provenance of scientific workflows has been effectively carried
out in Grid based scientific workflow systems. However, recent adoption of
Cloud-based scientific workflows present an opportunity to investigate the
suitability of existing approaches or propose new approaches to collect
provenance information from the Cloud and to utilize it for workflow
repeatability in the Cloud infrastructure. The dynamic nature of the Cloud in
comparison to the Grid makes it difficult because resources are provisioned
on-demand unlike the Grid. This paper presents a novel approach that can assist
in mitigating this challenge. This approach can collect Cloud infrastructure
information along with workflow provenance and can establish a mapping between
them. This mapping is later used to re-provision resources on the Cloud. The
repeatability of the workflow execution is performed by: (a) capturing the
Cloud infrastructure information (virtual machine configuration) along with the
workflow provenance, and (b) re-provisioning the similar resources on the Cloud
and re-executing the workflow on them. The evaluation of an initial prototype
suggests that the proposed approach is feasible and can be investigated
further.Comment: 6 pages; 5 figures; 3 tables in Proceedings of the Recomputability
2014 workshop of the 7th IEEE/ACM International Conference on Utility and
Cloud Computing (UCC 2014). London December 201
Distilling Structure in Scientific Workflows
International audienceIn this work, we have conducted a series of experiments to better understand the structure of scientific workflows. In particular, we have investigated techniques to understand why scientific workflows may or may not have a series-parallel structure
Reproducibility of scientific workflows execution using cloud-aware provenance (ReCAP)
© 2018, Springer-Verlag GmbH Austria, part of Springer Nature. Provenance of scientific workflows has been considered a mean to provide workflow reproducibility. However, the provenance approaches adopted so far are not applicable in the context of Cloud because the provenance trace lacks the Cloud information. This paper presents a novel approach that collects the Cloud-aware provenance and represents it as a graph. The workflow execution reproducibility on the Cloud is determined by comparing the workflow provenance at three levels i.e., workflow structure, execution infrastructure and workflow outputs. The experimental evaluation shows that the implemented approach can detect changes in the provenance traces and the outputs produced by the workflow
- …