46 research outputs found

    A case study for cloud based high throughput analysis of NGS data using the globus genomics system

    Get PDF
    AbstractNext generation sequencing (NGS) technologies produce massive amounts of data requiring a powerful computational infrastructure, high quality bioinformatics software, and skilled personnel to operate the tools. We present a case study of a practical solution to this data management and analysis challenge that simplifies terabyte scale data handling and provides advanced tools for NGS data analysis. These capabilities are implemented using the “Globus Genomics” system, which is an enhanced Galaxy workflow system made available as a service that offers users the capability to process and transfer data easily, reliably and quickly to address end-to-endNGS analysis requirements. The Globus Genomics system is built on Amazon's cloud computing infrastructure. The system takes advantage of elastic scaling of compute resources to run multiple workflows in parallel and it also helps meet the scale-out analysis needs of modern translational genomics research

    Cloud Computing for Next-Generation Sequencing Data Analysis

    Get PDF
    High-throughput next-generation sequencing (NGS) technologies have evolved rapidly and are reshaping the scope of genomics research. The substantial decrease in the cost of NGS techniques in the past decade has led to its rapid adoption in biological research and drug development. Genomics studies of large populations are producing a huge amount of data, giving rise to computational issues around the storage, transfer, and analysis of the data. Fortunately, cloud computing has recently emerged as a viable option to quickly and easily acquire the computational resources for large-scale NGS data analyses. Some cloud-based applications and resources have been developed specifically to address the computational challenges of working with very large volumes of data generated by NGS technology. In this chapter, we will review some cloud-based systems and solutions for NGS data analysis, discuss the practical hurdles and limitations in cloud computing, including data transfer and security, and share the lessons we learned from the implementation of Rainbow, a cloud-based tool for large-scale genome sequencing data analysis

    Experiences Building Globus Genomics: A Next-Generation Sequencing Analysis Service using Galaxy, Globus, and Amazon Web Services

    Get PDF
    ABSTRACT We describe Globus Genomics, a system that we have developed for rapid analysis of large quantities of next-generation sequencing (NGS) genomic data. This system achieves a high degree of end-to-end automation that encompasses every stage of data analysis including initial data retrieval from remote sequencing centers or storage (via the Globus file transfer system); specification, configuration, and reuse of multi-step processing pipelines (via the Galaxy workflow system); creation of custom Amazon Machine Images and on-demand resource acquisition via a specialized elastic provisioner (on Amazon EC2); and efficient scheduling of these pipelines over many processors (via the HTCondor scheduler). The system allows biomedical researchers to perform rapid analysis of large NGS datasets in a fully automated manner, without software installation or a need for any local computing infrastructure. We report performance and cost results for some representative workloads

    Improving the Performance of Cloud-based Scientific Services

    No full text
    Cloud computing provides access to a large scale set of readily available computing resources at the click of a button. The cloud paradigm has commoditised computing capacity and is often touted as a low-cost model for executing and scaling applications. However, there are significant technical challenges associated with selecting, acquiring, configuring, and managing cloud resources which can restrict the efficient utilisation of cloud capabilities. Scientific computing is increasingly hosted on cloud infrastructure—in which scientific capabilities are delivered to the broad scientific community via Internet-accessible services. This migration from on-premise to on-demand cloud infrastructure is motivated by the sporadic usage patterns of scientific workloads and the associated potential cost savings without the need to purchase, operate, and manage compute infrastructure—a task that few scientific users are trained to perform. However, cloud platforms are not an automatic solution. Their flexibility is derived from an enormous number of services and configuration options, which in turn result in significant complexity for the user. In fact, naïve cloud usage can result in poor performance and excessive costs, which are then directly passed on to researchers. This thesis presents methods for developing efficient cloud-based scientific services. Three real-world scientific services are analysed and a set of common requirements are derived. To address these requirements, this thesis explores automated and scalable methods for inferring network performance, considers various trade-offs (e.g., cost and performance) when provisioning instances, and profiles application performance, all in heterogeneous and dynamic cloud environments. Specifically, network tomography provides the mechanisms to infer network performance in dynamic and opaque cloud networks; cost-aware automated provisioning approaches enable services to consider, in real-time, various trade-offs such as cost, performance, and reliability; and automated application profiling allows a huge search space of applications, instance types, and configurations to be analysed to determine resource requirements and application performance. Finally, these contributions are integrated into an extensible and modular cloud provisioning and resource management service called SCRIMP. Cloud-based scientific applications and services can subscribe to SCRIMP to outsource their provisioning, usage, and management of cloud infrastructures. Collectively, the approaches presented in this thesis are shown to provide order of magnitude cost savings and significant performance improvement when employed by production scientific services

    High performance computing in the cloud

    Get PDF
    In recent years, the interest in both scientific and business workflows has increased. A workflow is composed of a series of tools, which should be executed in a predefined order to perform an analysis. Traditionally, these workflows were executed in a manual way, sending the output of one tool to the next one in the analysis process. Many applications to execute workflows automatically, appeared recently. These applications ease the work of the users while executing their analyses. In addition, from the computational point of view, some workflows require a significant amount of resources. Consequently, workflow execution moved from single workstations to distributed environments such as Grids or Clouds. Data management and tasks scheduling are required to execute workflows in an efficient way in such environments. In this thesis, we propose a cloud-based HPC environment, focusing on tasks scheduling, resources auto-scaling, data management and simplifying the access to the resources with software clients. First, the cloud computing infrastructure is devised, which includes the base software (i.e. OpenStack) plus several additional modules aimed at improving authentication (i.e. LDAP) and data management (i.e. GridFTP, Globus Online and CloudFuse). Second, built on top of the mentioned infrastructure, the TORQUE distributed resources manager and the Maui scheduler have been configured to schedule and distribute tasks to the cloud-based workers. To reduce the number of idle nodes and the incurred cost of the active cloud resources, we also propose a configurable auto-scaling technique, which is able to scale the execution cluster depending on the workload. Additionally, in order to simplify tasks submission to the TORQUE execution cluster, we have interconnected the Galaxy workflows management system with it, therefore users benefit from a simple way to execute their tasks. Finally, we conducted an experimental evaluation, composed by a number of different studies with synthetic and real-world applications, to show the behaviour of the auto-scaled execution cluster managed by TORQUE and Maui. All experiments have been performed by using an OpenStack cloud computing environment and the benchmarked applications correspond to the benchmarking suite, which is specially designed for workflows scheduling in the cloud computing environment. Cybershake, Ligo and Montage have been the selected synthetic applications from the benchmarking suite. GECKO and a GWAS pipeline represent the real-world test use cases, both having a diverse and heterogeneous set of tasks.The numerous technological advances in data acquisition techniques allow the massive production of enormous amounts of data in diverse fields such as astronomy, health and social networks. Nowadays, only a small part of this data can be analysed because of the lack of computational resources. High Performance Computing (HPC) strategies represent the single choice to analyse such overwhelming amount of data. However, in general, HPC techniques require the use of big and expensive computing and storage infrastructures, usually not affordable or available for most users. Cloud computing, where users pay for the resources they need and when they actually need them, appears as an interesting alternative. Besides the savings in hardware infrastructure, cloud computing offers further advantages such as the removal of installation, administration and supplying requirements. In addition, it enables users to use better hardware than the one they can usually afford, scale the resources depending on their needs, and a greater fault-tolerance, amongst others. The efficient utilisation of HPC resources becomes a fundamental task, particularly in cloud computing. We need to consider the cost of using HPC resources, specially in the case of cloud-based infrastructures, where users have to pay for storing, transferring and analysing data. Therefore, it is really important the usage of generic tasks scheduling and auto-scaling techniques to efficiently exploit the computational resources. It is equally important to make these tasks user-friendly through the development of tools/applications (software clients), which act as interface between the user and the infrastructure

    Laniakea : an open solution to provide Galaxy "on-demand" instances over heterogeneous cloud infrastructures

    Get PDF
    Background: While the popular workflow manager Galaxy is currently made available through several publicly accessible servers, there are scenarios where users can be better served by full administrative control over a private Galaxy instance, including, but not limited to, concerns about data privacy, customisation needs, prioritisation of particular job types, tools development, and training activities. In such cases, a cloud-based Galaxy virtual instance represents an alternative that equips the user with complete control over the Galaxy instance itself without the burden of the hardware and software infrastructure involved in running and maintaining a Galaxy server. Results: We present Laniakea, a complete software solution to set up a \u201cGalaxy on-demand\u201d platform as a service. Building on the INDIGO-DataCloud software stack, Laniakea can be deployed over common cloud architectures usually supported both by public and private e-infrastructures. The user interacts with a Laniakea-based service through a simple front-end that allows a general setup of a Galaxy instance, and then Laniakea takes care of the automatic deployment of the virtual hardware and the software components. At the end of the process, the user gains access with full administrative privileges to a private, production-grade, fully customisable, Galaxy virtual instance and to the underlying virtual machine (VM). Laniakea features deployment of single-server or cluster-backed Galaxy instances, sharing of reference data across multiple instances, data volume encryption, and support for VM image-based, Docker-based, and Ansible recipe-based Galaxy deployments. A Laniakea-based Galaxy on-demand service, named Laniakea@ReCaS, is currently hosted at the ELIXIR-IT ReCaS cloud facility. Conclusions: Laniakea offers to scientific e-infrastructures a complete and easy-to-use software solution to provide a Galaxy on-demand service to their users. Laniakea-based cloud services will help in making Galaxy more accessible to a broader user base by removing most of the burdens involved in deploying and running a Galaxy service. In turn, this will facilitate the adoption of Galaxy in scenarios where classic public instances do not represent an optimal solution. Finally, the implementation of Laniakea can be easily adapted and expanded to support different services and platforms beyond Galaxy

    Managing Workflows on top of a Cloud Computing Orchestrator for using heterogeneous environments on e-Science

    Full text link
    [EN] Scientific workflows (SWFs) are widely used to model processes in e-Science. SWFs are executed by means of workflow management systems (WMSs), which orchestrate the workload on top of computing infrastructures. The advent of cloud computing infrastructures has opened the door of using on-demand infrastructures to complement or even replace local infrastructures. However, new issues have arisen, such as the integration of hybrid resources or the compromise between infrastructure reutilisation and elasticity. In this article, we present an ad hoc solution for managing workflows exploiting the capabilities of cloud orchestrators to deploy resources on demand according to the workload and to combine heterogeneous cloud providers (such as on-premise clouds and public clouds) and traditional infrastructures (clusters) to minimise costs and response time. The work does not propose yet another WMS but demonstrates the benefits of the integration of cloud orchestration when running complex workflows. The article shows several configuration experiments from a realistic comparative genomics workflow called Orthosearch, to migrate memory-intensive workload to public infrastructures while keeping other blocks of the experiment running locally. The article computes running time and cost suggesting best practices.This paper wants to acknowledge the support of the EUBrazilCC project, funded by the European Commission (STREP 614048) and the Brazilian MCT/CNPq N. 13/2012, for the use of its infrastructure. The authors would like also to thank the Spanish 'Ministerio de Economia y Competitividad' for the project 'Clusters Virtuales Elasticos y Migrables sobre Infraestructuras Cloud Hibridas' with reference TIN2013-44390-R.Carrión Collado, AA.; Caballer Fernández, M.; Blanquer Espert, I.; Kotowski, N.; Jardim, R.; Dávila, AMR. (2017). Managing Workflows on top of a Cloud Computing Orchestrator for using heterogeneous environments on e-Science. International Journal of Web and Grid Services. 13(4):375-402. doi:10.1504/IJWGS.2017.10003225S37540213

    Jetstream: A self-provisoned, scalable science and engineering cloud environment

    Get PDF
    The paper describes the motivation behind Jetstream, its functions, hardware configuration, software environment, user interface, design, use cases, relationships with other projects such as Wrangler and iPlant, and challenges in implementation.Funded by the National Science Foundation Award #ACI - 144560

    Management of generic and multi-platform workflows for exploiting heterogeneous environments on e-Science

    Full text link
    Scientific Workflows (SWFs) are widely used to model applications in e-Science. In this programming model, scientific applications are described as a set of tasks that have dependencies among them. During the last decades, the execution of scientific workflows has been successfully performed in the available computing infrastructures (supercomputers, clusters and grids) using software programs called Workflow Management Systems (WMSs), which orchestrate the workload on top of these computing infrastructures. However, because each computing infrastructure has its own architecture and each scientific applications exploits efficiently one of these infrastructures, it is necessary to organize the way in which they are executed. WMSs need to get the most out of all the available computing and storage resources. Traditionally, scientific workflow applications have been extensively deployed in high-performance computing infrastructures (such as supercomputers and clusters) and grids. But, in the last years, the advent of cloud computing infrastructures has opened the door of using on-demand infrastructures to complement or even replace local infrastructures. However, new issues have arisen, such as the integration of hybrid resources or the compromise between infrastructure reutilization and elasticity, everything on the basis of cost-efficiency. The main contribution of this thesis is an ad-hoc solution for managing workflows exploiting the capabilities of cloud computing orchestrators to deploy resources on demand according to the workload and to combine heterogeneous cloud providers (such as on-premise clouds and public clouds) and traditional infrastructures (supercomputers and clusters) to minimize costs and response time. The thesis does not propose yet another WMS, but demonstrates the benefits of the integration of cloud orchestration when running complex workflows. The thesis shows several configuration experiments and multiple heterogeneous backends from a realistic comparative genomics workflow called Orthosearch, to migrate memory-intensive workload to public infrastructures while keeping other blocks of the experiment running locally. The running time and cost of the experiments is computed and best practices are suggested.Los flujos de trabajo científicos son comúnmente usados para modelar aplicaciones en e-Ciencia. En este modelo de programación, las aplicaciones científicas se describen como un conjunto de tareas que tienen dependencias entre ellas. Durante las últimas décadas, la ejecución de flujos de trabajo científicos se ha llevado a cabo con éxito en las infraestructuras de computación disponibles (supercomputadores, clústers y grids) haciendo uso de programas software llamados Gestores de Flujos de Trabajos, los cuales distribuyen la carga de trabajo en estas infraestructuras de computación. Sin embargo, debido a que cada infraestructura de computación posee su propia arquitectura y cada aplicación científica explota eficientemente una de estas infraestructuras, es necesario organizar la manera en que se ejecutan. Los Gestores de Flujos de Trabajo necesitan aprovechar el máximo todos los recursos de computación y almacenamiento disponibles. Habitualmente, las aplicaciones científicas de flujos de trabajos han sido ejecutadas en recursos de computación de altas prestaciones (tales como supercomputadores y clústers) y grids. Sin embargo, en los últimos años, la aparición de las infraestructuras de computación en la nube ha posibilitado el uso de infraestructuras bajo demanda para complementar o incluso reemplazar infraestructuras locales. No obstante, este hecho plantea nuevas cuestiones, tales como la integración de recursos híbridos o el compromiso entre la reutilización de la infraestructura y la elasticidad, todo ello teniendo en cuenta que sea eficiente en el coste. La principal contribución de esta tesis es una solución ad-hoc para gestionar flujos de trabajos explotando las capacidades de los orquestadores de recursos de computación en la nube para desplegar recursos bajo demando según la carga de trabajo y combinar proveedores de computación en la nube heterogéneos (privados y públicos) e infraestructuras tradicionales (supercomputadores y clústers) para minimizar el coste y el tiempo de respuesta. La tesis no propone otro gestor de flujos de trabajo más, sino que demuestra los beneficios de la integración de la orquestación de la computación en la nube cuando se ejecutan flujos de trabajo complejos. La tesis muestra experimentos con diferentes configuraciones y múltiples plataformas heterogéneas, haciendo uso de un flujo de trabajo real de genómica comparativa llamado Orthosearch, para traspasar cargas de trabajo intensivas de memoria a infraestructuras públicas mientras se mantienen otros bloques del experimento ejecutándose localmente. El tiempo de respuesta y el coste de los experimentos son calculados, además de sugerir buenas prácticas.Els fluxos de treball científics són comunament usats per a modelar aplicacions en e-Ciència. En aquest model de programació, les aplicacions científiques es descriuen com un conjunt de tasques que tenen dependències entre elles. Durant les últimes dècades, l'execució de fluxos de treball científics s'ha dut a terme amb èxit en les infraestructures de computació disponibles (supercomputadors, clústers i grids) fent ús de programari anomenat Gestors de Fluxos de Treballs, els quals distribueixen la càrrega de treball en aquestes infraestructures de computació. No obstant açò, a causa que cada infraestructura de computació posseeix la seua pròpia arquitectura i cada aplicació científica explota eficientment una d'aquestes infraestructures, és necessari organitzar la manera en què s'executen. Els Gestors de Fluxos de Treball necessiten aprofitar el màxim tots els recursos de computació i emmagatzematge disponibles. Habitualment, les aplicacions científiques de fluxos de treballs han sigut executades en recursos de computació d'altes prestacions (tals com supercomputadors i clústers) i grids. No obstant açò, en els últims anys, l'aparició de les infraestructures de computació en el núvol ha possibilitat l'ús d'infraestructures sota demanda per a complementar o fins i tot reemplaçar infraestructures locals. No obstant açò, aquest fet planteja noves qüestions, tals com la integració de recursos híbrids o el compromís entre la reutilització de la infraestructura i l'elasticitat, tot açò tenint en compte que siga eficient en el cost. La principal contribució d'aquesta tesi és una solució ad-hoc per a gestionar fluxos de treballs explotant les capacitats dels orquestadors de recursos de computació en el núvol per a desplegar recursos baix demande segons la càrrega de treball i combinar proveïdors de computació en el núvol heterogenis (privats i públics) i infraestructures tradicionals (supercomputadors i clústers) per a minimitzar el cost i el temps de resposta. La tesi no proposa un gestor de fluxos de treball més, sinó que demostra els beneficis de la integració de l'orquestració de la computació en el núvol quan s'executen fluxos de treball complexos. La tesi mostra experiments amb diferents configuracions i múltiples plataformes heterogènies, fent ús d'un flux de treball real de genòmica comparativa anomenat Orthosearch, per a traspassar càrregues de treball intensives de memòria a infraestructures públiques mentre es mantenen altres blocs de l'experiment executant-se localment. El temps de resposta i el cost dels experiments són calculats, a més de suggerir bones pràctiques.Carrión Collado, AA. (2017). Management of generic and multi-platform workflows for exploiting heterogeneous environments on e-Science [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/86179TESI