21 research outputs found
CWLProv - Interoperable Retrospective Provenance capture and its challenges
<p>The automation of data analysis in the form of scientific workflows is a widely adopted practice in many fields of research nowadays. Computationally driven data-intensive experiments using workflows enable <strong>A</strong>utomation, <strong>S</strong>caling, <strong>A</strong>daption and <strong>P</strong>rovenance support (ASAP).</p>
<p>However, there are still several challenges associated with the effective sharing, publication, understandability and reproducibility of such workflows due to the incomplete capture of provenance and the dependence on particular technical (software) platforms. This paper presents <strong>CWLProv</strong>, an approach for retrospective provenance capture utilizing open source community-driven standards involving application and customization of workflow-centric <a href="http://www.researchobject.org/">Research Objects</a> (ROs).</p>
<p>The ROs are produced as an output of a workflow enactment defined in the <a href="http://www.commonwl.org/">Common Workflow Language</a> (CWL) using the CWL reference implementation and its data structures. The approach aggregates and annotates all the resources involved in the scientific investigation including inputs, outputs, workflow specification, command line tool specifications and input parameter settings. The resources are linked within the RO to enable re-enactment of an analysis without depending on external resources.</p>
<p>The workflow provenance profile is represented in W3C recommended standard <a href="https://www.w3.org/TR/prov-n/">PROV-N</a> and <a href="https://www.w3.org/Submission/prov-json/">PROV-JSON</a> format to capture retrospective provenance of the workflow enactment. The workflow-centric RO produced as an output of a CWL workflow enactment is expected to be interoperable, reusable, shareable and portable across different plat-<br>
forms.</p>
<p>This paper describes the need and motivation for <a href="https://github.com/common-workflow-language/cwltool/tree/provenance">CWLProv</a> and the lessons learned in applying it for ROs using CWL in the bioinformatics domain.</p
Capturing interoperable reproducible workflows with Common Workflow Language
We present our ongoing work on integrating Research Object practices with Common Workflow Language, capturing and describing prospective and retrospective provenance.Accepted for talk at RO2018.
Web version at http://s11.no/2018/cwl.htm
Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv
Background: The automation of data analysis in the form of scientific workflows has become a widely adopted practice in many fields of research. Computationally driven data-intensive experiments using workflows enable Automation, Scaling, Adaption and Provenance support (ASAP). However, there are still several challenges associated with the effective sharing, publication and reproducibility of such workflows due to the incomplete capture of provenance and lack of interoperability between different technical (software) platforms.
Results: Based on best practice recommendations identified from literature on workflow design, sharing and publishing, we define a hierarchical provenance framework to achieve uniformity in the provenance and support comprehensive and fully re-executable workflows equipped with domain-specific information. To realise this framework, we present CWLProv, a standard-based format to represent any workflow-based computational analysis to produce workflow output artefacts that satisfy the various levels of provenance. We utilise open source community-driven standards; interoperable workflow definitions in Common Workflow Language (CWL), structured provenance representation using the W3C PROV model, and resource aggregation and sharing as workflow-centric Research Objects (RO) generated along with the final outputs of a given workflow enactment. We demonstrate the utility of this approach through a practical implementation of CWLProv and evaluation using real-life genomic workflows developed by independent groups.
Conclusions: The underlying principles of the standards utilised by CWLProv enable semantically-rich and executable Research Objects that capture computational workflows with retrospective provenance such that any platform supporting CWL will be able to understand the analysis, re-use the methods for partial re-runs, or reproduce the analysis to validate the published findings.Submitted to GigaScience (GIGA-D-18-00483
Applying the FAIR Principles to Computational Workflows
Recent trends within computational and data sciences show an increasing recognition and adoption of computational workflows as tools for productivity, reproducibility, and democratized access to platforms and processing know-how. As digital objects to be shared, discovered, and reused, computational workflows benefit from the FAIR principles, which stand for Findable, Accessible, Interoperable, and Reusable. The Workflows Community Initiative's FAIR Workflows Working Group (WCI-FW), a global and open community of researchers and developers working with computational workflows across disciplines and domains, has systematically addressed the application of both FAIR data and software principles to computational workflows. We present our recommendations with commentary that reflects our discussions and justifies our choices and adaptations. Like the software and data principles on which they are based, these are offered to workflow users and authors, workflow management system developers, and providers of workflow services as guide rails for adoption and fodder for discussion. Workflows are becoming more prevalent as documented, automated instruments for data analysis, data collection, AI-based predictions, and simulations. The FAIR recommendations for workflows that we propose in this paper will maximize their value as research assets and facilitate their adoption by the wider community
Understanding role of provenance in bioinformatics workflows and enabling interoperable computational analysis sharing
© 2018 Dr Farah Zaib KhanThe automation of computational analyses in data-intensive domains such as genomics through scientific workflows is a widely adopted practice in many fields of research nowadays. Computationally driven data-intensive experiments using workflows enable Automation, Scaling, Adaption and Provenance support (ASAP). Provenance data collection is an essential factor for any computational workflow-centric research to achieve reproducibility, transparency and support trust in the published results. At present capture of provenance information across the plethora of workflow management systems and custom software platforms in the bioinformatics domain is not well supported and as such, there exist numerous challenges associated with the effective sharing, publication, understandability, reproducibility and repeatability of scientific workflows.
This thesis focuses on providing a unified, interoperable and systematised view of provenance with specific focus on workflow environments in the bioinformatics domain. We identify and overcome the current disconnect between various workflows systems and their existing provenance representations. Through empirical analysis of complex genomic data analysis workflows using three exemplar workflow systems, we identify implicit assumptions that arise. These assumptions produce an incomplete view of provenance resulting in insufficient details that impact on workflow enactment requirements and ultimately on the reproducibility of the given analysis. We propose a set of recommendations to mitigate against such assumptions and enable workflow systems to document and capture complete provenance information that can subsequently be used for re-enacting workflows in other contexts and potentially using other workflow platforms.
Based on this empirical case study and pragmatic analysis of related literature, we define a hierarchical provenance framework offering `Levels of Provenance and Resource Sharing''. Each level of this framework addresses specific provenance recommendations and supports the capture of rich provenance information, with the topmost layer enabling the sharing of comprehensive and executable workflows utilising retrospective provenance. To realise this framework, we leverage community-driven, domain-neutral, platform-independent and open-source standards to implement ``CWLProv'' - a format for the methodical representation of provenance supporting workflow enactment aggregating resources specific to the given enactment and associated workflow configuration settings. We realise CWLProv through the Common Workflow Language (CWL) for workflow definition and utilise Research Objects (ROs) for resource aggregation and PROV-Data Model (PROV-DM) to support the capture of retrospective provenance information as required for subsequent workflow enactments.
To demonstrate the applicability of CWLProv, we extend an existing workflow executor (cwltool) to provide a reference implementation that generates metadata and provenance-rich interoperable workflow-centric ROs. This approach aggregates and preserves data and methods needed to support the coherent sharing of computational analyses and experiments. Evaluation of CWLProv using real-life bioinformatics pipelines is demonstrated to highlight the utility of the approach demonstrating the interoperability of workflow analyses and the benefits to research reproducibility more generally
CWL run of Somatic Variant Calling Workflow:CWLProv 0.5.0 Research Object
The somatic variant calling workflow included in this case study is designed by Blue Collar Bioinformatics (bcbio), a community-driven initiative to develop best-practice pipelines for variant calling, RNA-seq and small RNA analysis workflows. According to the documentation, the goal of this project is to facilitate the automated analysis of high throughput data by making the resources quantifiable, analyzable, scalable, accessible and reproducible.
All the underlying tools are containerized, facilitating software use in the workflow. The somatic variant calling workflow defined in CWL is available on GitHub and equipped with a well defined test dataset.
This dataset folder is a CWLProv Research Object that captures the Common Workflow Language execution provenance, see https://w3id.org/cwl/prov/0.5.0 or use https://pypi.org/project/cwlprov/ to explore
Steps to reproduce
To build the research object again, use Python 3 on macOS. Built on:
Processor 2.8GHz Intel Core i7
Memory: 16GB
OS: macOS High Sierra, Version 10.13.3
Storage: 250GB
To run the workflow:
pip3 install cwltool==1.0.20180912090223
git clone https://github.com/FarahZKhan/bcbio_test_cwlprov
cd bcbio_test_cwlprov/somatic/somatic-workflow/
cwltool --provenance somaticwf_0.5.0_mac main-somatic.cwl main-somatic-samples.json
To package the research object:
zip -r somaticwf_0.5.0_mac.zip somaticwf_0.5.0_mac/
sha256sum somaticwf_0.5.0_mac.zip > somaticwf_0.5.0_mac.zip.sha256
The cloned git repository is a fork of https://github.com/bcbio/test_bcbio_cwl. It was obtained using:
wget -O test_bcbio_cwl.tar.gz https://github.com/bcbio/test_bcbio_cwl/archive/master.tar.gz
The content is from an archived version from the documentation here: https://bcbio-nextgen.readthedocs.io/en/latest/contents/cwl.html#install-bcbio-vm-with-containersMirrored from Mendeley Data https://data.mendeley.com/datasets/97hj93mkfd/
CWL run of Alignment Workflow:CWLProv 0.6.0 Research Object
This dataset folder is a CWLProv Research Object that captures the Common Workflow Language execution provenance, see CWLProv 0.6.0 or use the cwlprov Python tool to explore.
The CWL alignment workflow included in this case study is designed by Data Biosphere. It adapts the alignment pipeline originally developed at Abecasis Lab, The University of Michigan. This workflow is part of NIH Data Commons initiative and comprises of four stages.
First step, Pre-align, accepts a Compressed Alignment Map (CRAM) file (a compressed format for BAM files developed by European Bioinformatics Institute (EBI)) and human genome reference sequence as input and using underlying software utilities of SAMtools such as view, sort and fixmate returns a list of fastq files which can be used as input for the next step.
The next step Align also accepts the human reference genome as input along with the output files from Pre-align and uses BWA-mem to generate aligned reads as BAM files. SAMBLASTER is used to mark duplicate reads and SAMtools view to convert read files from SAM to BAM format.
The BAM files generated after lign are sorted with SAMtool sort'.
Finally, these sorted alignment files are merged to produce single sorted BAM file using SAMtools merge in Post-align step.
Steps to reproduce
This analysis was run using a 16-core Linux cloud instance with 64GB RAM and pre-installed docker.
Install gsutils
export CLOUD_SDK_REPO="cloud-sdk-CLOUD_SDK_REPO main" | \
sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | \
sudo apt-key add -
sudo apt-get update && sudo apt-get install google-cloud-sdk
Get the data and make the analysis environment ready:
git clone https://github.com/FarahZKhan/topmed-workflows.git
cd topmed-workflows
git checkout cwlprov_testing
cd aligner/sbg-alignment-cwl
# this is a custom script download google bucket files from json files and create a local json
# it needs gsutil to be installed though
git clone https://github.com/DailyDreaming/fetch_gs_frm_json.git
# Wait... this should download ~18Gb.
python2.7 fetch_gs_frm_json/dl_gsfiles_frm_json.py topmed-alignment.sample.json
Run the following commands to create the CWLProv Research Object:
time cwltool --no-match-user --provenance alignmnentwf0.6.0 --tmp-outdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp --tmpdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp topmed-alignment.cwl topmed-alignment.sample.json.new
zip -r alignment_0.6.0_linux.zip alignment_0.6.0_linux
sha256sum alignment_0.6.0_linux.zip > alignment_0.6.0_linux.zip.sha25Mirror of Mendeley Data upload https://data.mendeley.com/datasets/6wtpgr3kbj/
CWL run of RNA-seq Analysis Workflow:CWLProv 0.5.0 Research Object
This workflow adapts the approach and parameter settings of Trans-Omics for precision Medicine (TOPMed). The RNA-seq pipeline originated from the Broad Institute. There are in total five steps in the workflow starting from:
Read alignment using STAR which produces aligned BAM files including the Genome BAM and Transcriptome BAM.
The Genome BAM file is processed using Picard MarkDuplicates. producing an updated BAM file containing information on duplicate reads (such reads can indicate biased interpretation).
SAMtools index is then employed to generate an index for the BAM file, in preparation for the next step.
The indexed BAM file is processed further with RNA-SeQC which takes the BAM file, human genome reference sequence and Gene Transfer Format (GTF) file as inputs to generate transcriptome-level expression quantifications and standard quality control metrics.
In parallel with transcript quantification, isoform expression levels are quantified by RSEM. This step depends only on the output of the STAR tool, and additional RSEM reference sequences.
For testing and analysis, the workflow author provided example data created by down-sampling the read files of a TOPMed public access data. Chromosome 12 was extracted from the Homo Sapien Assembly 38 reference sequence and provided by the workflow authors. The required GTF and RSEM reference data files are also provided. The workflow is well-documented with a detailed set of instructions of the steps performed to down-sample the data are also provided for transparency. The availability of example input data, use of containerization for underlying software and detailed documentation are important factors in choosing this specific CWL workflow for CWLProv evaluation.
This dataset folder is a CWLProv Research Object that captures the Common Workflow Language execution provenance, see https://w3id.org/cwl/prov/0.5.0 or use https://pypi.org/project/cwl
Steps to reproduce
To build the research object again, use Python 3 on macOS. Built with:
Processor 2.8GHz Intel Core i7
Memory: 16GB
OS: macOS High Sierra, Version 10.13.3
Storage: 250GB
Install cwltool
pip3 install cwltool==1.0.20180912090223
Install git lfs
The data download with the git repository requires the installation of Git lfs:
https://www.atlassian.com/git/tutorials/git-lfs#installing-git-lfs
Get the data and make the analysis environment ready:
git clone https://github.com/FarahZKhan/cwl_workflows.git
cd cwl_workflows/
git checkout CWLProvTesting
./topmed-workflows/TOPMed_RNAseq_pipeline/input-examples/download_examples.sh
Run the following commands to create the CWLProv Research Object:
cwltool --provenance rnaseqwf_0.6.0_linux --tmp-outdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp --tmpdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp topmed-workflows/TOPMed_RNAseq_pipeline/rnaseq_pipeline_fastq.cwl topmed-workflows/TOPMed_RNAseq_pipeline/input-examples/Dockstore.json
zip -r rnaseqwf_0.5.0_mac.zip rnaseqwf_0.5.0_mac
sha256sum rnaseqwf_0.5.0_mac.zip > rnaseqwf_0.5.0_mac_mac.zip.sha256
The https://github.com/FarahZKhan/cwl_workflows repository is a frozen snapshot from https://github.com/heliumdatacommons/TOPMed_RNAseq_CWL commit 027e8af41b906173aafdb791351fb29efc044120Mirror of Mendeley Data upload https://data.mendeley.com/datasets/xnwncxpw42/
common-workflow-language/cwlprov-py: cwlprov-py 0.1.1
The cwlprov Python tool is a command line interface to validate and inspect CWLProv Research Objects that capture workflow runs, typically executed in a Common Workflow Language implementation.
Installation
You'll need Python 3.
To install from pip try:
pip3 install cwlprov
If you would rather install from this source code:
pip3 install .
If you would like to use the cwltool rerun feature you may also need:
pip3 install cwlref-runner
Usage
Use cwlprov --help to see all options.
Checksums
SHA256:
c0bca8a038f130dc67f02d4a1c08757a3d7dd33b2219599e71267211a4e484b1 cwlprov-0.1.1-py3-none-any.whl
aa12a6fd99875fd1adc40a12fe3e75812318bb2edb9e087b228d73920cfbe7ab cwlprov-0.1.1.tar.g