12 research outputs found

    An Open Science Peer Review Oath

    Get PDF
    One of the foundations of the scientific method is to be able to reproduce experiments and corroborate the results of research that has been done before. However, with the increasing complexities of new technologies and techniques, coupled with the specialisation of experiments, reproducing research findings has become a growing challenge. Clearly, scientific methods must be conveyed succinctly, and with clarity and rigour, in order for research to be reproducible. Here, we propose steps to help increase the transparency of the scientific method and the reproducibility of research results: specifically, we introduce a peer-review oath and accompanying manifesto. These have been designed to offer guidelines to enable reviewers (with the minimum friction or bias) to follow and apply open science principles, and support the ideas of transparency, reproducibility and ultimately greater societal impact. Introducing the oath and manifesto at the stage of peer review will help to check that the research being published includes everything that other researchers would need to successfully repeat the work. Peer review is the lynchpin of the publishing system: encouraging the community to consciously (and conscientiously) uphold these principles should help to improve published papers, increase confidence in the reproducibility of the work and, ultimately, provide strategic benefits to authors and their institutions

    Approaches for Distributing Large Scale Bioinformatic Analyses

    No full text
    Ever since high-throughput DNA sequencing became economically feasible, the amount of biological data has grown exponentially. This has been one of the biggest drivers in introducing high-performance computing (HPC) to the field of biology. Unlike physics and mathematics, biology education has not had a strong focus on programming or algorithmic development. This has forced many biology researchers to start learning a whole new skill set, and introduced new challenges for those managing the HPC clusters. The aim of this thesis is to investigate the problems that arise when novice users are using an HPC cluster for bioinformatics data analysis, and exploring approaches for how these can be mitigated. In paper 1 we quantify and visualise these problems and contrast them with the more computer experienced user groups already using the HPC cluster. In paper 2 we introduce a new workflow system (SciPipe), implemented as a Go library, as a way to organise and manage analysis steps. Paper 3 is aimed at cloud computing and how containerised tools can be used to run workflows without having to worry about software installations. In paper 4 we demonstrate a fully automated cloud-based system for image-based cell profiling. Starting with a robotic arm in a lab, it covers all the steps from cell culture and microscope to having the cell profiling results stored in a database and visualised in a web interface

    Approaches for Distributing Large Scale Bioinformatic Analyses

    No full text
    Ever since high-throughput DNA sequencing became economically feasible, the amount of biological data has grown exponentially. This has been one of the biggest drivers in introducing high-performance computing (HPC) to the field of biology. Unlike physics and mathematics, biology education has not had a strong focus on programming or algorithmic development. This has forced many biology researchers to start learning a whole new skill set, and introduced new challenges for those managing the HPC clusters. The aim of this thesis is to investigate the problems that arise when novice users are using an HPC cluster for bioinformatics data analysis, and exploring approaches for how these can be mitigated. In paper 1 we quantify and visualise these problems and contrast them with the more computer experienced user groups already using the HPC cluster. In paper 2 we introduce a new workflow system (SciPipe), implemented as a Go library, as a way to organise and manage analysis steps. Paper 3 is aimed at cloud computing and how containerised tools can be used to run workflows without having to worry about software installations. In paper 4 we demonstrate a fully automated cloud-based system for image-based cell profiling. Starting with a robotic arm in a lab, it covers all the steps from cell culture and microscope to having the cell profiling results stored in a database and visualised in a web interface

    Tracking the NGS revolution : managing life science research on shared high-performance computing clusters

    No full text
    Background Next-generation sequencing (NGS) has transformed the life sciences, and many research groups are newly dependent upon computer clusters to store and analyze large datasets. This creates challenges for e-infrastructures accustomed to hosting computationally mature research in other sciences. Using data gathered from our own clusters at UPPMAX computing center at Uppsala University, Sweden, where core hour usage of ∌800 NGS and ∌200 non-NGS projects is now similar, we compare and contrast the growth, administrative burden, and cluster usage of NGS projects with projects from other sciences. Results The number of NGS projects has grown rapidly since 2010, with growth driven by entry of new research groups. Storage used by NGS projects has grown more rapidly since 2013 and is now limited by disk capacity. NGS users submit nearly twice as many support tickets per user, and 11 more tools are installed each month for NGS projects than for non-NGS projects. We developed usage and efficiency metrics and show that computing jobs for NGS projects use more RAM than non-NGS projects, are more variable in core usage, and rarely span multiple nodes. NGS jobs use booked resources less efficiently for a variety of reasons. Active monitoring can improve this somewhat. Conclusions Hosting NGS projects imposes a large administrative burden at UPPMAX due to large numbers of inexperienced users and diverse and rapidly evolving research areas. We provide a set of recommendations for e-infrastructures that host NGS research projects. We provide anonymized versions of our storage, job, and efficiency databases

    MaRe : Processing Big Data with application containers on Apache Spark

    No full text
    Background: Life science is increasingly driven by Big Data analytics, and the MapReduce programming model has been proven successful for data-intensive analyses. However, current MapReduce frameworks offer poor support for reusing existing processing tools in bioinformatics pipelines. Furthermore, these frameworks do not have native support for application containers, which are becoming popular in scientific data processing. Results: Here we present MaRe, an open source programming library that introduces support for Docker containers in Apache Spark. Apache Spark and Docker are the MapReduce framework and container engine that have collected the largest open source community; thus, MaRe provides interoperability with the cutting-edge software ecosystem. We demonstrate MaRe on 2 data-intensive applications in life science, showing ease of use and scalability. Conclusions: MaRe enables scalable data-intensive processing in life science with Apache Spark and application containers. When compared with current best practices, which involve the use of workflow systems, MaRe has the advantage of providing data locality, ingestion from heterogeneous storage systems, and interactive processing. MaRe is generally applicable and available as open source software.eSSENC

    Lessons learned from implementing a national infrastructure in Sweden for storage and analysis of next-generation sequencing data

    Get PDF
    Analyzing and storing data and results from next-generation sequencing (NGS) experiments is a challenging task, hampered by ever-increasing data volumes and frequent updates of analysis methods and tools. Storage and computation have grown beyond the capacity of personal computers and there is a need for suitable e-infrastructures for processing. Here we describe UPPNEX, an implementation of such an infrastructure, tailored to the needs of data storage and analysis of NGS data in Sweden serving various labs and multiple instruments from the major sequencing technology platforms. UPPNEX comprises resources for high-performance computing, large-scale and high-availability storage, an extensive bioinformatics software suite, up-to-date reference genomes and annotations, a support function with system and application experts as well as a web portal and support ticket system. UPPNEX applications are numerous and diverse, and include whole genome-, de novo- and exome sequencing, targeted resequencing, SNP discovery, RNASeq, and methylation analysis. There are over 300 projects that utilize UPPNEX and include large undertakings such as the sequencing of the flycatcher and Norwegian spruce. We describe the strategic decisions made when investing in hardware, setting up maintenance and support, allocating resources, and illustrate major challenges such as managing data growth. We conclude with summarizing our experiences and observations with UPPNEX to date, providing insights into the successful and less successful decisions made

    Applications and raw data for SciPipe genomics and transcriptomics case studies

    No full text
    <p>Accompanying applications and raw data for the genomics and transcriptomics (RNA-Seq) case studies for SciPipe [1] available at https://github.com/pharmbio/scipipe-demo </p> <p>[1] http://scipipe.org</p

    Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit

    No full text
    BACKGROUND: Large streamed datasets, characteristic of life science applications, are often resource-intensive to process, transport and store. We propose a pipeline model, a design pattern for scientific pipelines, where an incoming stream of scientific data is organized into a tiered or ordered "data hierarchy". We introduce the HASTE Toolkit, a proof-of-concept cloud-native software toolkit based on this pipeline model, to partition and prioritize data streams to optimize use of limited computing resources. FINDINGS: In our pipeline model, an "interestingness function" assigns an interestingness score to data objects in the stream, inducing a data hierarchy. From this score, a "policy" guides decisions on how to prioritize computational resource use for a given object. The HASTE Toolkit is a collection of tools to adopt this approach. We evaluate with 2 microscopy imaging case studies. The first is a high content screening experiment, where images are analyzed in an on-premise container cloud to prioritize storage and subsequent computation. The second considers edge processing of images for upload into the public cloud for real-time control of a transmission electron microscope. CONCLUSIONS: Through our evaluation, we created smart data pipelines capable of effective use of storage, compute, and network resources, enabling more efficient data-intensive experiments. We note a beneficial separation between scientific concerns of data priority, and the implementation of this behaviour for different resources in different deployment contexts. The toolkit allows intelligent prioritization to be `bolted on' to new and existing systems - and is intended for use with a range of technologies in different deployment scenarios.Spjuth and Hellander shared senior authorship</p
    corecore