118 research outputs found

    The lifecycle of provenance metadata and its associated challenges and opportunities

    Full text link
    This chapter outlines some of the challenges and opportunities associated with adopting provenance principles and standards in a variety of disciplines, including data publication and reuse, and information sciences

    DataHub: Collaborative Data Science & Dataset Version Management at Scale

    Get PDF
    Relational databases have limited support for data collaboration, where teams collaboratively curate and analyze large datasets. Inspired by software version control systems like git, we propose (a) a dataset version control system, giving users the ability to create, branch, merge, difference and search large, divergent collections of datasets, and (b) a platform, DataHub, that gives users the ability to perform collaborative data analysis building on this version control system. We outline the challenges in providing dataset version control at scale.Comment: 7 page

    Classification of Scientific Workflows Based on Reproducibility Analysis

    Get PDF

    Scientific Workflow Repeatability through Cloud-Aware Provenance

    Full text link
    The transformations, analyses and interpretations of data in scientific workflows are vital for the repeatability and reliability of scientific workflows. This provenance of scientific workflows has been effectively carried out in Grid based scientific workflow systems. However, recent adoption of Cloud-based scientific workflows present an opportunity to investigate the suitability of existing approaches or propose new approaches to collect provenance information from the Cloud and to utilize it for workflow repeatability in the Cloud infrastructure. The dynamic nature of the Cloud in comparison to the Grid makes it difficult because resources are provisioned on-demand unlike the Grid. This paper presents a novel approach that can assist in mitigating this challenge. This approach can collect Cloud infrastructure information along with workflow provenance and can establish a mapping between them. This mapping is later used to re-provision resources on the Cloud. The repeatability of the workflow execution is performed by: (a) capturing the Cloud infrastructure information (virtual machine configuration) along with the workflow provenance, and (b) re-provisioning the similar resources on the Cloud and re-executing the workflow on them. The evaluation of an initial prototype suggests that the proposed approach is feasible and can be investigated further.Comment: 6 pages; 5 figures; 3 tables in Proceedings of the Recomputability 2014 workshop of the 7th IEEE/ACM International Conference on Utility and Cloud Computing (UCC 2014). London December 201

    Distilling Structure in Scientific Workflows

    Get PDF
    International audienceIn this work, we have conducted a series of experiments to better understand the structure of scientific workflows. In particular, we have investigated techniques to understand why scientific workflows may or may not have a series-parallel structure

    Reproducibility of scientific workflows execution using cloud-aware provenance (ReCAP)

    Get PDF
    © 2018, Springer-Verlag GmbH Austria, part of Springer Nature. Provenance of scientific workflows has been considered a mean to provide workflow reproducibility. However, the provenance approaches adopted so far are not applicable in the context of Cloud because the provenance trace lacks the Cloud information. This paper presents a novel approach that collects the Cloud-aware provenance and represents it as a graph. The workflow execution reproducibility on the Cloud is determined by comparing the workflow provenance at three levels i.e., workflow structure, execution infrastructure and workflow outputs. The experimental evaluation shows that the implemented approach can detect changes in the provenance traces and the outputs produced by the workflow
    • …
    corecore