72 research outputs found

    The khmer software package: enabling efficient nucleotide sequence analysis

    Get PDF
    The khmer package is a freely available software library for working efficiently with fixed length DNA words, or k-mers. khmer provides implementations of a probabilistic k-mer counting data structure, a compressible De Bruijn graph representation, De Bruijn graph partitioning, and digital normalization. khmer is implemented in C++ and Python, and is freely available under the BSD license at https://github.com/dib-lab/khmer/

    Supercomputing with MPI meets the Common Workflow Language standards: an experience report

    Get PDF
    Use of standards-based workflows is still somewhat unusual by high-performance computing users. In this paper we describe the experience of using the Common Workflow Language (CWL) standards to describe the execution, in parallel, of MPI-parallelised applications. In particular, we motivate and describe the simple extension to the specification which was required, as well as our implementation of this within the CWL reference runner. We discuss some of the unexpected benefits, such as simple use of HPC-oriented performance measurement tools, and CWL software requirements interfacing with HPC module systems. We close with a request for comment from the community on how these features could be adopted within versions of the CWL standards.Comment: Submitted to 15th Workshop on Workflows in Support of Large-Scale Science (WORKS20

    CWL Viewer:The Common Workflow Language Viewer

    Get PDF
    The Common Workflow Language (CWL) project emerged from the BOSC 2014 Codefest as a grassroots, multi-vendor working group to tackle the portability of data analysis workflows. It’s specification for describing workflows and command line tools aims to make them portable and scalable across a variety of computing platforms. At its heart CWL is a set of structured text files (YAML) with various extensibility points to the format. However, the CWL syntax and multi-file collections are not conducive to workflow browsing, exchange and understanding: for this we need a visualization suite. CWL Viewer is a richly featured CWL visualization suite that graphically presents and lists the details of CWL workflows with their inputs, outputs and steps. It also packages the CWL files into a downloadable Research Object Bundle including attribution, versioning and dependency metadata in the manifest, allowing it to be easily shared. The tool operates over any workflow held in a GitHub repository. Other features include: path visualization from parents and children nodes; nested workflows support; workflow graph download in a range of image formats; a gallery of previously submitted workflows; and support for private git repositories and public GitHub including live updates over versioned workflows. The CWL Viewer is the de facto standard CWL visualization suite and has been enthusiastically received by the CWL community. Project Website: https://view.commonwl.org/ Source Code: https://github.com/common-workflow-language/cwlviewer https://doi.org/10.5281/zenodo.823535 Software License: Apache License, Version 2.0 Submitted abstract: CWL Viewer: The Common Workflow Language Viewer Technical Report: Reproducible Research using Research Objects https://doi.org/10.5281/zenodo.823295CWL Viewer is live at https://view.commonwl.org/ Abstract peer-reviewed and accepted for poster+talk at BOSC 2017

    Channeling Community Contributions to Scientific Software: A Sprint Experience

    Get PDF
    In 2014, the khmer software project participated in a two-day global sprint coordinated by the Mozilla Science Lab. We offered a mentored experience in contributing to a scientific software project for anyone who was interested. We provided entry-level tasks and worked with contributors as they worked through our development process. The experience was successful on both a social and a technical level, bringing in 13 contributions from 9 new contributors and validating our development process. In this experience paper we describe the sprint preparation and process, relate anecdotal experiences, and draw conclusions about what other projects could do to enable a similar outcome. The khmer software is developed openly at http://github.com/dib-lab/khmer/

    CWLProv - Interoperable Retrospective Provenance capture and its challenges

    Get PDF
    <p>The automation of data analysis in the form of scientific workflows is a widely adopted practice in many fields of research nowadays. Computationally driven data-intensive experiments using workflows enable <strong>A</strong>utomation, <strong>S</strong>caling, <strong>A</strong>daption and <strong>P</strong>rovenance support (ASAP).</p> <p>However, there are still several challenges associated with the effective sharing, publication, understandability and reproducibility of such workflows due to the incomplete capture of provenance and the dependence on particular technical (software) platforms. This paper presents <strong>CWLProv</strong>, an approach for retrospective provenance capture utilizing open source community-driven standards involving application and customization of workflow-centric <a href="http://www.researchobject.org/">Research Objects</a> (ROs).</p> <p>The ROs are produced as an output of a workflow enactment defined in the <a href="http://www.commonwl.org/">Common Workflow Language</a> (CWL) using the CWL reference implementation and its data structures. The approach aggregates and annotates all the resources involved in the scientific investigation including inputs, outputs, workflow specification, command line tool specifications and input parameter settings. The resources are linked within the RO to enable re-enactment of an analysis without depending on external resources.</p> <p>The workflow provenance profile is represented in W3C recommended standard <a href="https://www.w3.org/TR/prov-n/">PROV-N</a> and <a href="https://www.w3.org/Submission/prov-json/">PROV-JSON</a> format to capture retrospective provenance of the workflow enactment. The workflow-centric RO produced as an output of a CWL workflow enactment is expected to be interoperable, reusable, shareable and portable across different plat-<br> forms.</p> <p>This paper describes the need and motivation for <a href="https://github.com/common-workflow-language/cwltool/tree/provenance">CWLProv</a> and the lessons learned in applying it for ROs using CWL in the bioinformatics domain.</p

    Capturing interoperable reproducible workflows with Common Workflow Language

    Get PDF
    We present our ongoing work on integrating Research Object practices with Common Workflow Language, capturing and describing prospective and retrospective provenance.Accepted for talk at RO2018. Web version at http://s11.no/2018/cwl.htm

    Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv

    Get PDF
    Background: The automation of data analysis in the form of scientific workflows has become a widely adopted practice in many fields of research. Computationally driven data-intensive experiments using workflows enable Automation, Scaling, Adaption and Provenance support (ASAP). However, there are still several challenges associated with the effective sharing, publication and reproducibility of such workflows due to the incomplete capture of provenance and lack of interoperability between different technical (software) platforms. Results: Based on best practice recommendations identified from literature on workflow design, sharing and publishing, we define a hierarchical provenance framework to achieve uniformity in the provenance and support comprehensive and fully re-executable workflows equipped with domain-specific information. To realise this framework, we present CWLProv, a standard-based format to represent any workflow-based computational analysis to produce workflow output artefacts that satisfy the various levels of provenance. We utilise open source community-driven standards; interoperable workflow definitions in Common Workflow Language (CWL), structured provenance representation using the W3C PROV model, and resource aggregation and sharing as workflow-centric Research Objects (RO) generated along with the final outputs of a given workflow enactment. We demonstrate the utility of this approach through a practical implementation of CWLProv and evaluation using real-life genomic workflows developed by independent groups. Conclusions: The underlying principles of the standards utilised by CWLProv enable semantically-rich and executable Research Objects that capture computational workflows with retrospective provenance such that any platform supporting CWL will be able to understand the analysis, re-use the methods for partial re-runs, or reproduce the analysis to validate the published findings.Submitted to GigaScience (GIGA-D-18-00483

    FAIR Computational Workflows

    Get PDF
    Computational workflows describe the complex multi-step methods that are used for data collection, data preparation, analytics, predictive modelling, and simulation that lead to new data products. They can inherently contribute to the FAIR data principles: by processing data according to established metadata; by creating metadata themselves during the processing of data; and by tracking and recording data provenance. These properties aid data quality assessment and contribute to secondary data usage. Moreover, workflows are digital objects in their own right. This paper argues that FAIR principles for workflows need to address their specific nature in terms of their composition of executable software steps, their provenance, and their development.Accepted for Data Intelligence special issue: FAIR best practices 2019. Carole Goble acknowledges funding by BioExcel2 (H2020 823830), IBISBA1.0 (H2020 730976) and EOSCLife (H2020 824087) . Daniel Schober's work was financed by Phenomenal (H2020 654241) at the initiation-phase of this effort, current work in kind contribution. Kristian Peters is funded by the German Network for Bioinformatics Infrastructure (de.NBI) and acknowledges BMBF funding under grant number 031L0107. Stian Soiland-Reyes is funded by BioExcel2 (H2020 823830). Daniel Garijo, Yolanda Gil, gratefully acknowledge support from DARPA award W911NF-18-1-0027, NIH award 1R01AG059874-01, and NSF award ICER-1740683

    Methods Included:Standardizing Computational Reuse and Portability with the Common Workflow Language

    Get PDF
    A widely used standard for portable multilingual data analysis pipelines would enable considerable benefits to scholarly publication reuse, research/industry collaboration, regulatory cost control, and to the environment. Published research that used multiple computer languages for their analysis pipelines would include a complete and reusable description of that analysis that is runnable on a diverse set of computing environments. Researchers would be able to easier collaborate and reuse these pipelines, adding or exchanging components regardless of programming language used; collaborations with and within the industry would be easier; approval of new medical interventions that rely on such pipelines would be faster. Time will be saved and environmental impact would also be reduced, as these descriptions contain enough information for advanced optimization without user intervention. Workflows are widely used in data analysis pipelines, enabling innovation and decision-making for the modern society. In many domains the analysis components are numerous and written in multiple different computer languages by third parties. However, lacking a standard for reusable and portable multilingual workflows, then reusing published multilingual workflows, collaborating on open problems, and optimizing their execution would be severely hampered. Moreover, only a standard for multilingual data analysis pipelines that was widely used would enable considerable benefits to research-industry collaboration, regulatory cost control, and to preserving the environment. Prior to the start of the CWL project, there was no standard for describing multilingual analysis pipelines in a portable and reusable manner. Even today / currently, although there exist hundreds of single-vendor and other single-source systems that run workflows, none is a general, community-driven, and consensus-built standard

    The khmer software package: enabling efficient nucleotide sequence analysis [version 1; referees: 2 approved, 1 approved with reservations]

    Get PDF
    The khmer package is a freely available software library for working efficiently with fixed length DNA words, or k-mers. khmer provides implementations of a probabilistic k-mer counting data structure, a compressible De Bruijn graph representation, De Bruijn graph partitioning, and digital normalization. khmer is implemented in C++ and Python, and is freely available under the BSD license at https://github.com/dib-lab/khmer/
    • …
    corecore