Search CORE

21 research outputs found

CWLProv - Interoperable Retrospective Provenance capture and its challenges

Author: Crusoe Michael R.
Khan Farah Zaib
Lonie Andrew
Sinnott Richard
Soiland-Reyes Stian
Publication venue
Publication date: 27/03/2018
Field of study

The automation of data analysis in the form of scientific workflows is a widely adopted practice in many fields of research nowadays. Computationally driven data-intensive experiments using workflows enable Automation, Scaling, Adaption and Provenance support (ASAP). However, there are still several challenges associated with the effective sharing, publication, understandability and reproducibility of such workflows due to the incomplete capture of provenance and the dependence on particular technical (software) platforms. This paper presents CWLProv, an approach for retrospective provenance capture utilizing open source community-driven standards involving application and customization of workflow-centric <a href="http://www.researchobject.org/">Research Objects</a> (ROs). The ROs are produced as an output of a workflow enactment defined in the <a href="http://www.commonwl.org/">Common Workflow Language</a> (CWL) using the CWL reference implementation and its data structures. The approach aggregates and annotates all the resources involved in the scientific investigation including inputs, outputs, workflow specification, command line tool specifications and input parameter settings. The resources are linked within the RO to enable re-enactment of an analysis without depending on external resources. The workflow provenance profile is represented in W3C recommended standard <a href="https://www.w3.org/TR/prov-n/">PROV-N</a> and <a href="https://www.w3.org/Submission/prov-json/">PROV-JSON</a> format to capture retrospective provenance of the workflow enactment. The workflow-centric RO produced as an output of a CWL workflow enactment is expected to be interoperable, reusable, shareable and portable across different plat- forms. This paper describes the need and motivation for <a href="https://github.com/common-workflow-language/cwltool/tree/provenance">CWLProv</a> and the lessons learned in applying it for ROs using CWL in the bioinformatics domain.</p

ZENODO

The University of Manchester - Institutional Repository

The Francis Crick Institute

Capturing interoperable reproducible workflows with Common Workflow Language

Author: Crusoe Michael R.
Goble Carole
Khan Farah Zaib
Lonie Andrew
Sinnott Richard
Soiland-Reyes Stian
Publication venue
Publication date: 16/07/2018
Field of study

We present our ongoing work on integrating Research Object practices with Common Workflow Language, capturing and describing prospective and retrospective provenance.Accepted for talk at RO2018. Web version at http://s11.no/2018/cwl.htm

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

The University of Manchester - Institutional Repository

The Francis Crick Institute

Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv

Author: Crusoe Michael R.
Goble Carole
Khan Farah Zaib
Lonie Andrew
Sinnott Richard O.
Soiland-Reyes Stian
Publication venue
Publication date: 04/12/2018
Field of study

Background: The automation of data analysis in the form of scientific workflows has become a widely adopted practice in many fields of research. Computationally driven data-intensive experiments using workflows enable Automation, Scaling, Adaption and Provenance support (ASAP). However, there are still several challenges associated with the effective sharing, publication and reproducibility of such workflows due to the incomplete capture of provenance and lack of interoperability between different technical (software) platforms. Results: Based on best practice recommendations identified from literature on workflow design, sharing and publishing, we define a hierarchical provenance framework to achieve uniformity in the provenance and support comprehensive and fully re-executable workflows equipped with domain-specific information. To realise this framework, we present CWLProv, a standard-based format to represent any workflow-based computational analysis to produce workflow output artefacts that satisfy the various levels of provenance. We utilise open source community-driven standards; interoperable workflow definitions in Common Workflow Language (CWL), structured provenance representation using the W3C PROV model, and resource aggregation and sharing as workflow-centric Research Objects (RO) generated along with the final outputs of a given workflow enactment. We demonstrate the utility of this approach through a practical implementation of CWLProv and evaluation using real-life genomic workflows developed by independent groups. Conclusions: The underlying principles of the standards utilised by CWLProv enable semantically-rich and executable Research Objects that capture computational workflows with retrospective provenance such that any platform supporting CWL will be able to understand the analysis, re-use the methods for partial re-runs, or reproduce the analysis to validate the published findings.Submitted to GigaScience (GIGA-D-18-00483

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

The University of Manchester - Institutional Repository

Applying the FAIR Principles to Computational Workflows

Author: Aloqalaa Meznah
Belhajjame Khalid
Crusoe Michael R.
Gadelha Luiz
Garijo Daniel
Gehlen Karsten Peters-von
Goble Carole
Gustafsson Ove Johan Ragnar
Juty Nick
Kanwal Sehrish
Khan Farah Zaib
Kinoshita Bruno de Paula
Köster Johannes
Pouchard Line
Rannow Randy K.
Soiland-Reyes Stian
Soranzo Nicola
Sufi Shoaib
Sun Ziheng
Vilne Baiba
Wilkinson Sean R.
Wouters Merridee A.
Yuen Denis
Publication venue: arXiv
Publication date: 04/10/2024
Field of study

Recent trends within computational and data sciences show an increasing recognition and adoption of computational workflows as tools for productivity, reproducibility, and democratized access to platforms and processing know-how. As digital objects to be shared, discovered, and reused, computational workflows benefit from the FAIR principles, which stand for Findable, Accessible, Interoperable, and Reusable. The Workflows Community Initiative's FAIR Workflows Working Group (WCI-FW), a global and open community of researchers and developers working with computational workflows across disciplines and domains, has systematically addressed the application of both FAIR data and software principles to computational workflows. We present our recommendations with commentary that reflects our discussions and justifies our choices and adaptations. Like the software and data principles on which they are based, these are offered to workflow users and authors, workflow management system developers, and providers of workflow services as guide rails for adoption and fodder for discussion. Workflows are becoming more prevalent as documented, automated instruments for data analysis, data collection, AI-based predictions, and simulations. The FAIR recommendations for workflows that we propose in this paper will maximize their value as research assets and facilitate their adoption by the wider community

The University of Manchester - Institutional Repository

Understanding role of provenance in bioinformatics workflows and enabling interoperable computational analysis sharing

Author: Khan Farah Zaib
Publication venue
Publication date: 01/01/2018
Field of study

© 2018 Dr Farah Zaib KhanThe automation of computational analyses in data-intensive domains such as genomics through scientific workflows is a widely adopted practice in many fields of research nowadays. Computationally driven data-intensive experiments using workflows enable Automation, Scaling, Adaption and Provenance support (ASAP). Provenance data collection is an essential factor for any computational workflow-centric research to achieve reproducibility, transparency and support trust in the published results. At present capture of provenance information across the plethora of workflow management systems and custom software platforms in the bioinformatics domain is not well supported and as such, there exist numerous challenges associated with the effective sharing, publication, understandability, reproducibility and repeatability of scientific workflows. This thesis focuses on providing a unified, interoperable and systematised view of provenance with specific focus on workflow environments in the bioinformatics domain. We identify and overcome the current disconnect between various workflows systems and their existing provenance representations. Through empirical analysis of complex genomic data analysis workflows using three exemplar workflow systems, we identify implicit assumptions that arise. These assumptions produce an incomplete view of provenance resulting in insufficient details that impact on workflow enactment requirements and ultimately on the reproducibility of the given analysis. We propose a set of recommendations to mitigate against such assumptions and enable workflow systems to document and capture complete provenance information that can subsequently be used for re-enacting workflows in other contexts and potentially using other workflow platforms. Based on this empirical case study and pragmatic analysis of related literature, we define a hierarchical provenance framework offering `Levels of Provenance and Resource Sharing''. Each level of this framework addresses specific provenance recommendations and supports the capture of rich provenance information, with the topmost layer enabling the sharing of comprehensive and executable workflows utilising retrospective provenance. To realise this framework, we leverage community-driven, domain-neutral, platform-independent and open-source standards to implement ``CWLProv'' - a format for the methodical representation of provenance supporting workflow enactment aggregating resources specific to the given enactment and associated workflow configuration settings. We realise CWLProv through the Common Workflow Language (CWL) for workflow definition and utilise Research Objects (ROs) for resource aggregation and PROV-Data Model (PROV-DM) to support the capture of retrospective provenance information as required for subsequent workflow enactments. To demonstrate the applicability of CWLProv, we extend an existing workflow executor (cwltool) to provide a reference implementation that generates metadata and provenance-rich interoperable workflow-centric ROs. This approach aggregates and preserves data and methods needed to support the coherent sharing of computational analyses and experiments. Evaluation of CWLProv using real-life bioinformatics pipelines is demonstrated to highlight the utility of the approach demonstrating the interoperability of workflow analyses and the benefits to research reproducibility more generally

University of Melbourne Institutional Repository

CWL run of Somatic Variant Calling Workflow:CWLProv 0.5.0 Research Object

Author: Khan Farah Zaib
Soiland-Reyes Stian
Publication venue: Mendeley Data
Publication date: 04/12/2018
Field of study

The somatic variant calling workflow included in this case study is designed by Blue Collar Bioinformatics (bcbio), a community-driven initiative to develop best-practice pipelines for variant calling, RNA-seq and small RNA analysis workflows. According to the documentation, the goal of this project is to facilitate the automated analysis of high throughput data by making the resources quantifiable, analyzable, scalable, accessible and reproducible. All the underlying tools are containerized, facilitating software use in the workflow. The somatic variant calling workflow defined in CWL is available on GitHub and equipped with a well defined test dataset. This dataset folder is a CWLProv Research Object that captures the Common Workflow Language execution provenance, see https://w3id.org/cwl/prov/0.5.0 or use https://pypi.org/project/cwlprov/ to explore Steps to reproduce To build the research object again, use Python 3 on macOS. Built on: Processor 2.8GHz Intel Core i7 Memory: 16GB OS: macOS High Sierra, Version 10.13.3 Storage: 250GB To run the workflow: pip3 install cwltool==1.0.20180912090223 git clone https://github.com/FarahZKhan/bcbio_test_cwlprov cd bcbio_test_cwlprov/somatic/somatic-workflow/ cwltool --provenance somaticwf_0.5.0_mac main-somatic.cwl main-somatic-samples.json To package the research object: zip -r somaticwf_0.5.0_mac.zip somaticwf_0.5.0_mac/ sha256sum somaticwf_0.5.0_mac.zip > somaticwf_0.5.0_mac.zip.sha256 The cloned git repository is a fork of https://github.com/bcbio/test_bcbio_cwl. It was obtained using: wget -O test_bcbio_cwl.tar.gz https://github.com/bcbio/test_bcbio_cwl/archive/master.tar.gz The content is from an archived version from the documentation here: https://bcbio-nextgen.readthedocs.io/en/latest/contents/cwl.html#install-bcbio-vm-with-containersMirrored from Mendeley Data https://data.mendeley.com/datasets/97hj93mkfd/

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

The University of Manchester - Institutional Repository

CWL run of Alignment Workflow:CWLProv 0.6.0 Research Object

Author: Khan Farah Zaib
Soiland-Reyes Stian
Publication venue: Mendeley Data
Publication date: 04/12/2018
Field of study

This dataset folder is a CWLProv Research Object that captures the Common Workflow Language execution provenance, see CWLProv 0.6.0 or use the cwlprov Python tool to explore. The CWL alignment workflow included in this case study is designed by Data Biosphere. It adapts the alignment pipeline originally developed at Abecasis Lab, The University of Michigan. This workflow is part of NIH Data Commons initiative and comprises of four stages. First step, Pre-align, accepts a Compressed Alignment Map (CRAM) file (a compressed format for BAM files developed by European Bioinformatics Institute (EBI)) and human genome reference sequence as input and using underlying software utilities of SAMtools such as view, sort and fixmate returns a list of fastq files which can be used as input for the next step. The next step Align also accepts the human reference genome as input along with the output files from Pre-align and uses BWA-mem to generate aligned reads as BAM files. SAMBLASTER is used to mark duplicate reads and SAMtools view to convert read files from SAM to BAM format. The BAM files generated after lign are sorted with SAMtool sort'. Finally, these sorted alignment files are merged to produce single sorted BAM file using SAMtools merge in Post-align step. Steps to reproduce This analysis was run using a 16-core Linux cloud instance with 64GB RAM and pre-installed docker. Install gsutils export CLOUD_SDK_REPO="cloud-sdk-

(lsb_release -c -s)" echo "deb http://packages.cloud.google.com/apt

CLOUD_SDK_REPO main" | \ sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | \ sudo apt-key add - sudo apt-get update && sudo apt-get install google-cloud-sdk Get the data and make the analysis environment ready: git clone https://github.com/FarahZKhan/topmed-workflows.git cd topmed-workflows git checkout cwlprov_testing cd aligner/sbg-alignment-cwl # this is a custom script download google bucket files from json files and create a local json # it needs gsutil to be installed though git clone https://github.com/DailyDreaming/fetch_gs_frm_json.git # Wait... this should download ~18Gb. python2.7 fetch_gs_frm_json/dl_gsfiles_frm_json.py topmed-alignment.sample.json Run the following commands to create the CWLProv Research Object: time cwltool --no-match-user --provenance alignmnentwf0.6.0 --tmp-outdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp --tmpdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp topmed-alignment.cwl topmed-alignment.sample.json.new zip -r alignment_0.6.0_linux.zip alignment_0.6.0_linux sha256sum alignment_0.6.0_linux.zip > alignment_0.6.0_linux.zip.sha25Mirror of Mendeley Data upload https://data.mendeley.com/datasets/6wtpgr3kbj/

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

The University of Manchester - Institutional Repository

CWL run of RNA-seq Analysis Workflow:CWLProv 0.5.0 Research Object

Author: Khan Farah Zaib
Soiland-Reyes Stian
Publication venue: Mendeley Data
Publication date: 04/12/2018
Field of study

This workflow adapts the approach and parameter settings of Trans-Omics for precision Medicine (TOPMed). The RNA-seq pipeline originated from the Broad Institute. There are in total five steps in the workflow starting from: Read alignment using STAR which produces aligned BAM files including the Genome BAM and Transcriptome BAM. The Genome BAM file is processed using Picard MarkDuplicates. producing an updated BAM file containing information on duplicate reads (such reads can indicate biased interpretation). SAMtools index is then employed to generate an index for the BAM file, in preparation for the next step. The indexed BAM file is processed further with RNA-SeQC which takes the BAM file, human genome reference sequence and Gene Transfer Format (GTF) file as inputs to generate transcriptome-level expression quantifications and standard quality control metrics. In parallel with transcript quantification, isoform expression levels are quantified by RSEM. This step depends only on the output of the STAR tool, and additional RSEM reference sequences. For testing and analysis, the workflow author provided example data created by down-sampling the read files of a TOPMed public access data. Chromosome 12 was extracted from the Homo Sapien Assembly 38 reference sequence and provided by the workflow authors. The required GTF and RSEM reference data files are also provided. The workflow is well-documented with a detailed set of instructions of the steps performed to down-sample the data are also provided for transparency. The availability of example input data, use of containerization for underlying software and detailed documentation are important factors in choosing this specific CWL workflow for CWLProv evaluation. This dataset folder is a CWLProv Research Object that captures the Common Workflow Language execution provenance, see https://w3id.org/cwl/prov/0.5.0 or use https://pypi.org/project/cwl Steps to reproduce To build the research object again, use Python 3 on macOS. Built with: Processor 2.8GHz Intel Core i7 Memory: 16GB OS: macOS High Sierra, Version 10.13.3 Storage: 250GB Install cwltool pip3 install cwltool==1.0.20180912090223 Install git lfs The data download with the git repository requires the installation of Git lfs: https://www.atlassian.com/git/tutorials/git-lfs#installing-git-lfs Get the data and make the analysis environment ready: git clone https://github.com/FarahZKhan/cwl_workflows.git cd cwl_workflows/ git checkout CWLProvTesting ./topmed-workflows/TOPMed_RNAseq_pipeline/input-examples/download_examples.sh Run the following commands to create the CWLProv Research Object: cwltool --provenance rnaseqwf_0.6.0_linux --tmp-outdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp --tmpdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp topmed-workflows/TOPMed_RNAseq_pipeline/rnaseq_pipeline_fastq.cwl topmed-workflows/TOPMed_RNAseq_pipeline/input-examples/Dockstore.json zip -r rnaseqwf_0.5.0_mac.zip rnaseqwf_0.5.0_mac sha256sum rnaseqwf_0.5.0_mac.zip > rnaseqwf_0.5.0_mac_mac.zip.sha256 The https://github.com/FarahZKhan/cwl_workflows repository is a frozen snapshot from https://github.com/heliumdatacommons/TOPMed_RNAseq_CWL commit 027e8af41b906173aafdb791351fb29efc044120Mirror of Mendeley Data upload https://data.mendeley.com/datasets/xnwncxpw42/

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

The University of Manchester - Institutional Repository

common-workflow-language/cwlprov-py: cwlprov-py 0.1.1

Author: Farah Zaib Khan
Stian Soiland-Reyes
Publication venue
Publication date
Field of study

The cwlprov Python tool is a command line interface to validate and inspect CWLProv Research Objects that capture workflow runs, typically executed in a Common Workflow Language implementation. Installation You'll need Python 3. To install from pip try: pip3 install cwlprov If you would rather install from this source code: pip3 install . If you would like to use the cwltool rerun feature you may also need: pip3 install cwlref-runner Usage Use cwlprov --help to see all options. Checksums SHA256: c0bca8a038f130dc67f02d4a1c08757a3d7dd33b2219599e71267211a4e484b1 cwlprov-0.1.1-py3-none-any.whl aa12a6fd99875fd1adc40a12fe3e75812318bb2edb9e087b228d73920cfbe7ab cwlprov-0.1.1.tar.g

ZENODO

Common-Workflow-Language/Cwlprov-Py: Cwlprov-Py 0.1.1

Author: Information Management
Khan Farah Zaib
Soiland-Reyes Stian
Publication venue: Zenodo
Publication date: 25/10/2018
Field of study

The University of Manchester - Institutional Repository