167 research outputs found

    Selected approaches and frameworks to carry out genomic data provision and analysis on the cloud

    Full text link
    While High Performance Computing clouds allow researchers to process large amounts of genomic data, complex resource and software configuration tasks must be carried out beforehand. The current trend exposes applications and data as services, simplifying access to clouds. This paper examines commonly used cloud-based genomic analysis services, introduces the approach of exposing data as services and proposes two new solutions (HPCaaS and Uncinus) which aim to automate service development, deployment process and data provision. By comparing and contrasting these solutions, we identify key mechanisms of service creation, execution and data access required to support non-computing specialists employing clouds

    Exploiting Parallel R in the Cloud with SPRINT

    Get PDF
    BACKGROUND: Advances in DNA Microarray devices and next-generation massively parallel DNA sequencing platforms have led to an exponential growth in data availability but the arising opportunities require adequate computing resources. High Performance Computing (HPC) in the Cloud offers an affordable way of meeting this need. OBJECTIVES: Bioconductor, a popular tool for high-throughput genomic data analysis, is distributed as add-on modules for the R statistical programming language but R has no native capabilities for exploiting multi-processor architectures. SPRINT is an R package that enables easy access to HPC for genomics researchers. This paper investigates: setting up and running SPRINT-enabled genomic analyses on Amazon’s Elastic Compute Cloud (EC2), the advantages of submitting applications to EC2 from different parts of the world and, if resource underutilization can improve application performance. METHODS: The SPRINT parallel implementations of correlation, permutation testing, partitioning around medoids and the multi-purpose papply have been benchmarked on data sets of various size on Amazon EC2. Jobs have been submitted from both the UK and Thailand to investigate monetary differences. RESULTS: It is possible to obtain good, scalable performance but the level of improvement is dependent upon the nature of algorithm. Resource underutilization can further improve the time to result. End-user’s location impacts on costs due to factors such as local taxation. Conclusions: Although not designed to satisfy HPC requirements, Amazon EC2 and cloud computing in general provides an interesting alternative and provides new possibilities for smaller organisations with limited funds

    Experiences with workflows for automating data-intensive bioinformatics

    Get PDF
    High-throughput technologies, such as next-generation sequencing, have turned molecular biology into a data-intensive discipline, requiring bioinformaticians to use high-performance computing resources and carry out data management and analysis tasks on large scale. Workflow systems can be useful to simplify construction of analysis pipelines that automate tasks, support reproducibility and provide measures for fault-tolerance. However, workflow systems can incur significant development and administration overhead so bioinformatics pipelines are often still built without them. We present the experiences with workflows and workflow systems within the bioinformatics community participating in a series of hackathons and workshops of the EU COST action SeqAhead. The organizations are working on similar problems, but we have addressed them with different strategies and solutions. This fragmentation of efforts is inefficient and leads to redundant and incompatible solutions. Based on our experiences we define a set of recommendations for future systems to enable efficient yet simple bioinformatics workflow construction and execution.Pubblicat

    The BioExcel methodology for developing dynamic, scalable, reliable and portable computational biomolecular workflows

    Get PDF
    Developing complex biomolecular workflows is not always straightforward. It requires tedious developments to enable the interoperability between the different biomolecular simulation and analysis tools. Moreover, the need to execute the pipelines on distributed systems increases the complexity of these developments. To address these issues, we propose a methodology to simplify the implementation of these workflows on HPC infrastructures. It combines a library, the BioExcel Building Blocks (BioBBs), that allows scientists to implement biomolecular pipelines as Python scripts, and the PyCOMPSs programming framework which allows to easily convert Python scripts into task-based parallel workflows executed in distributed computing systems such as HPC clusters, clouds, containerized platforms, etc. Using this methodology, we have implemented a set of computational molecular workflows and we have performed several experiments to validate its portability, scalability, reliability and malleability.This work has been supported by Spanish Ministry of Science and Innovation MCIN/AEI/10.13039/501100011033 under contract PID2019-107255GB-C21, by the Generalitat de Catalunya under contracts 2017-SGR-01414 and 2017-SGR1110, by the European Commission through the BioExcel Center of Excellence (Horizon 2020 Framework program) under contracts 823830, and 675728. This work is also partially supported by the CECH project which has been co-funded with 50% by the European Regional Development Fund under the framework of the ERFD Operative Programme for Catalunya 2014-2020, with a grant of 1.527.637,88€.Peer ReviewedPostprint (author's final draft

    Service and cloud computing supporting genomic analysis of the mammalian species

    Full text link
     This research focused on building Software as a Service clouds to support mammalian genomic applications such as personalized medicine. Outcomes of this research included a Software as a Service cloud framework, the Uncinus research cloud and novel genomic analysis software. Results have been published in high ranking peer-reviewed international journals

    StreamFlow: cross-breeding cloud with HPC

    Get PDF
    Workflows are among the most commonly used tools in a variety of execution environments. Many of them target a specific environment; few of them make it possible to execute an entire workflow in different environments, e.g. Kubernetes and batch clusters. We present a novel approach to workflow execution, called StreamFlow, that complements the workflow graph with the declarative description of potentially complex execution environments, and that makes it possible the execution onto multiple sites not sharing a common data space. StreamFlow is then exemplified on a novel bioinformatics pipeline for single-cell transcriptomic data analysis workflow.Comment: 30 pages - 2020 IEEE Transactions on Emerging Topics in Computin

    Large-Scale Data Management and Analysis (LSDMA) - Big Data in Science

    Get PDF

    CERN openlab Whitepaper on Future IT Challenges in Scientific Research

    Get PDF
    This whitepaper describes the major IT challenges in scientific research at CERN and several other European and international research laboratories and projects. Each challenge is exemplified through a set of concrete use cases drawn from the requirements of large-scale scientific programs. The paper is based on contributions from many researchers and IT experts of the participating laboratories and also input from the existing CERN openlab industrial sponsors. The views expressed in this document are those of the individual contributors and do not necessarily reflect the view of their organisations and/or affiliates

    Programming models to support data science workflows

    Get PDF
    Data Science workflows have become a must to progress in many scientific areas such as life, health, and earth sciences. In contrast to traditional HPC workflows, they are more heterogeneous; combining binary executions, MPI simulations, multi-threaded applications, custom analysis (possibly written in Java, Python, C/C++ or R), and real-time processing. Furthermore, in the past, field experts were capable of programming and running small simulations. However, nowadays, simulations requiring hundreds or thousands of cores are widely used and, to this point, efficiently programming them becomes a challenge even for computer sciences. Thus, programming languages and models make a considerable effort to ease the programmability while maintaining acceptable performance. This thesis contributes to the adaptation of High-Performance frameworks to support the needs and challenges of Data Science workflows by extending COMPSs, a mature, general-purpose, task-based, distributed programming model. First, we enhance our prototype to orchestrate different frameworks inside a single programming model so that non-expert users can build complex workflows where some steps require highly optimised state of the art frameworks. This extension includes the @binary, @OmpSs, @MPI, @COMPSs, and @MultiNode annotations for both Java and Python workflows. Second, we integrate container technologies to enable developers to easily port, distribute, and scale their applications to distributed computing platforms. This combination provides a straightforward methodology to parallelise applications from sequential codes along with efficient image management and application deployment that ease the packaging and distribution of applications. We distinguish between static, HPC, and dynamic container management and provide representative use cases for each scenario using Docker, Singularity, and Mesos. Third, we design, implement and integrate AutoParallel, a Python module to automatically find an appropriate task-based parallelisation of affine loop nests and execute them in parallel in a distributed computing infrastructure. It is based on sequential programming and requires one single annotation (the @parallel Python decorator) so that anyone with intermediate-level programming skills can scale up an application to hundreds of cores. Finally, we propose a way to extend task-based management systems to support continuous input and output data to enable the combination of task-based workflows and dataflows (Hybrid Workflows) using one single programming model. Hence, developers can build complex Data Science workflows with different approaches depending on the requirements without the effort of combining several frameworks at the same time. Also, to illustrate the capabilities of Hybrid Workflows, we have built a Distributed Stream Library that can be easily integrated with existing task-based frameworks to provide support for dataflows. The library provides a homogeneous, generic, and simple representation of object and file streams in both Java and Python; enabling complex workflows to handle any data type without dealing directly with the streaming back-end.Els fluxos de treball de Data Science s’han convertit en una necessitat per progressar en moltes àrees científiques com les ciències de la vida, la salut i la terra. A diferència dels fluxos de treball tradicionals per a la CAP, els fluxos de Data Science són més heterogenis; combinant l’execució de binaris, simulacions MPI, aplicacions multiprocés, anàlisi personalitzats (possiblement escrits en Java, Python, C / C ++ o R) i computacions en temps real. Mentre que en el passat els experts de cada camp eren capaços de programar i executar petites simulacions, avui dia, aquestes simulacions representen un repte fins i tot per als experts ja que requereixen centenars o milers de nuclis. Per aquesta raó, els llenguatges i models de programació actuals s’esforcen considerablement en incrementar la programabilitat mantenint un rendiment acceptable. Aquesta tesi contribueix a l’adaptació de models de programació per a la CAP per afrontar les necessitats i reptes dels fluxos de Data Science estenent COMPSs, un model de programació distribuïda madur, de propòsit general, i basat en tasques. En primer lloc, millorem el nostre prototip per orquestrar diferent programari per a que els usuaris no experts puguin crear fluxos complexos usant un únic model on alguns passos requereixin tecnologies altament optimitzades. Aquesta extensió inclou les anotacions de @binary, @OmpSs, @MPI, @COMPSs, i @MultiNode per a fluxos en Java i Python. En segon lloc, integrem tecnologies de contenidors per permetre als desenvolupadors portar, distribuir i escalar fàcilment les seves aplicacions en plataformes distribuïdes. A més d’una metodologia senzilla per a paral·lelitzar aplicacions a partir de codis seqüencials, aquesta combinació proporciona una gestió d’imatges i una implementació d’aplicacions eficients que faciliten l’empaquetat i la distribució d’aplicacions. Distingim entre la gestió de contenidors estàtica, CAP i dinàmica i proporcionem casos d’ús representatius per a cada escenari amb Docker, Singularity i Mesos. En tercer lloc, dissenyem, implementem i integrem AutoParallel, un mòdul de Python per determinar automàticament la paral·lelització basada en tasques de nius de bucles afins i executar-los en paral·lel en una infraestructura distribuïda. AutoParallel està basat en programació seqüencial, requereix una sola anotació (el decorador @parallel) i permet a un usuari intermig escalar una aplicació a centenars de nuclis. Finalment, proposem una forma d’estendre els sistemes basats en tasques per admetre dades d’entrada i sortida continus; permetent així la combinació de fluxos de treball i dades (Fluxos Híbrids) en un únic model. Conseqüentment, els desenvolupadors poden crear fluxos complexos seguint diferents patrons sense l’esforç de combinar diversos models al mateix temps. A més, per a il·lustrar les capacitats dels Fluxos Híbrids, hem creat una biblioteca (DistroStreamLib) que s’integra fàcilment amb els models basats en tasques per suportar fluxos de dades. La biblioteca proporciona una representació homogènia, genèrica i simple de seqüències contínues d’objectes i arxius en Java i Python; permetent gestionar qualsevol tipus de dades sense tractar directament amb el back-end de streaming.Los flujos de trabajo de Data Science se han convertido en una necesidad para progresar en muchas áreas científicas como las ciencias de la vida, la salud y la tierra. A diferencia de los flujos de trabajo tradicionales para la CAP, los flujos de Data Science son más heterogéneos; combinando la ejecución de binarios, simulaciones MPI, aplicaciones multiproceso, análisis personalizados (posiblemente escritos en Java, Python, C/C++ o R) y computaciones en tiempo real. Mientras que en el pasado los expertos de cada campo eran capaces de programar y ejecutar pequeñas simulaciones, hoy en día, estas simulaciones representan un desafío incluso para los expertos ya que requieren cientos o miles de núcleos. Por esta razón, los lenguajes y modelos de programación actuales se esfuerzan considerablemente en incrementar la programabilidad manteniendo un rendimiento aceptable. Esta tesis contribuye a la adaptación de modelos de programación para la CAP para afrontar las necesidades y desafíos de los flujos de Data Science extendiendo COMPSs, un modelo de programación distribuida maduro, de propósito general, y basado en tareas. En primer lugar, mejoramos nuestro prototipo para orquestar diferentes software para que los usuarios no expertos puedan crear flujos complejos usando un único modelo donde algunos pasos requieran tecnologías altamente optimizadas. Esta extensión incluye las anotaciones de @binary, @OmpSs, @MPI, @COMPSs, y @MultiNode para flujos en Java y Python. En segundo lugar, integramos tecnologías de contenedores para permitir a los desarrolladores portar, distribuir y escalar fácilmente sus aplicaciones en plataformas distribuidas. Además de una metodología sencilla para paralelizar aplicaciones a partir de códigos secuenciales, esta combinación proporciona una gestión de imágenes y una implementación de aplicaciones eficientes que facilitan el empaquetado y la distribución de aplicaciones. Distinguimos entre gestión de contenedores estática, CAP y dinámica y proporcionamos casos de uso representativos para cada escenario con Docker, Singularity y Mesos. En tercer lugar, diseñamos, implementamos e integramos AutoParallel, un módulo de Python para determinar automáticamente la paralelización basada en tareas de nidos de bucles afines y ejecutarlos en paralelo en una infraestructura distribuida. AutoParallel está basado en programación secuencial, requiere una sola anotación (el decorador @parallel) y permite a un usuario intermedio escalar una aplicación a cientos de núcleos. Finalmente, proponemos una forma de extender los sistemas basados en tareas para admitir datos de entrada y salida continuos; permitiendo así la combinación de flujos de trabajo y datos (Flujos Híbridos) en un único modelo. Consecuentemente, los desarrolladores pueden crear flujos complejos siguiendo diferentes patrones sin el esfuerzo de combinar varios modelos al mismo tiempo. Además, para ilustrar las capacidades de los Flujos Híbridos, hemos creado una biblioteca (DistroStreamLib) que se integra fácilmente a los modelos basados en tareas para soportar flujos de datos. La biblioteca proporciona una representación homogénea, genérica y simple de secuencias continuas de objetos y archivos en Java y Python; permitiendo manejar cualquier tipo de datos sin tratar directamente con el back-end de streaming
    corecore