3,627 research outputs found

    Scalable genomic data management system on the cloud

    Get PDF
    Thanks to the huge amount of sequenced data that is becoming available, building scalable solutions for supporting query processing and data analysis over genomics datasets is increasingly important. This paper presents GDMS, a scalable Genomic Data Management System for querying region-based genomic datasets; the focus of the paper is on the deployment of the system on a cluster hosted by CINECA

    BioCloud Search EnGene: Surfing Biological Data on the Cloud

    Get PDF
    The massive production and spread of biomedical data around the web introduces new challenges related to identify computational approaches for providing quality search and browsing of web resources. This papers presents BioCloud Search EnGene (BSE), a cloud application that facilitates searching and integration of the many layers of biological information offered by public large-scale genomic repositories. Grounding on the concept of dataspace, BSE is built on top of a cloud platform that severely curtails issues associated with scalability and performance. Like popular online gene portals, BSE adopts a gene-centric approach: researchers can find their information of interest by means of a simple “Google-like” query interface that accepts standard gene identification as keywords. We present BSE architecture and functionality and discuss how our strategies contribute to successfully tackle big data problems in querying gene-based web resources. BSE is publically available at: http://biocloud-unica.appspot.com/

    Design and evaluation of a genomics variant analysis pipeline using GATK Spark tools

    Full text link
    Scalable and efficient processing of genome sequence data, i.e. for variant discovery, is key to the mainstream adoption of High Throughput technology for disease prevention and for clinical use. Achieving scalability, however, requires a significant effort to enable the parallel execution of the analysis tools that make up the pipelines. This is facilitated by the new Spark versions of the well-known GATK toolkit, which offer a black-box approach by transparently exploiting the underlying Map Reduce architecture. In this paper we report on our experience implementing a standard variant discovery pipeline using GATK 4.0 with Docker-based deployment over a cluster. We provide a preliminary performance analysis, comparing the processing times and cost to those of the new Microsoft Genomics Services

    Leveraging OpenStack and Ceph for a Controlled-Access Data Cloud

    Full text link
    While traditional HPC has and continues to satisfy most workflows, a new generation of researchers has emerged looking for sophisticated, scalable, on-demand, and self-service control of compute infrastructure in a cloud-like environment. Many also seek safe harbors to operate on or store sensitive and/or controlled-access data in a high capacity environment. To cater to these modern users, the Minnesota Supercomputing Institute designed and deployed Stratus, a locally-hosted cloud environment powered by the OpenStack platform, and backed by Ceph storage. The subscription-based service complements existing HPC systems by satisfying the following unmet needs of our users: a) on-demand availability of compute resources, b) long-running jobs (i.e., >30> 30 days), c) container-based computing with Docker, and d) adequate security controls to comply with controlled-access data requirements. This document provides an in-depth look at the design of Stratus with respect to security and compliance with the NIH's controlled-access data policy. Emphasis is placed on lessons learned while integrating OpenStack and Ceph features into a so-called "walled garden", and how those technologies influenced the security design. Many features of Stratus, including tiered secure storage with the introduction of a controlled-access data "cache", fault-tolerant live-migrations, and fully integrated two-factor authentication, depend on recent OpenStack and Ceph features.Comment: 7 pages, 5 figures, PEARC '18: Practice and Experience in Advanced Research Computing, July 22--26, 2018, Pittsburgh, PA, US
    • …
    corecore