5 research outputs found

    BioWorkbench: A High-Performance Framework for Managing and Analyzing Bioinformatics Experiments

    Get PDF
    Advances in sequencing techniques have led to exponential growth in biological data, demanding the development of large-scale bioinformatics experiments. Because these experiments are computation- and data-intensive, they require high-performance computing (HPC) techniques and can benefit from specialized technologies such as Scientific Workflow Management Systems (SWfMS) and databases. In this work, we present BioWorkbench, a framework for managing and analyzing bioinformatics experiments. This framework automatically collects provenance data, including both performance data from workflow execution and data from the scientific domain of the workflow application. Provenance data can be analyzed through a web application that abstracts a set of queries to the provenance database, simplifying access to provenance information. We evaluate BioWorkbench using three case studies: SwiftPhylo, a phylogenetic tree assembly workflow; SwiftGECKO, a comparative genomics workflow; and RASflow, a RASopathy analysis workflow. We analyze each workflow from both computational and scientific domain perspectives, by using queries to a provenance and annotation database. Some of these queries are available as a pre-built feature of the BioWorkbench web application. Through the provenance data, we show that the framework is scalable and achieves high-performance, reducing up to 98% of the case studies execution time. We also show how the application of machine learning techniques can enrich the analysis process

    Irregular alignment of arbitrarily long DNA sequences on GPU

    Get PDF
    The use of Graphics Processing Units to accelerate computational applications is increasingly being adopted due to its affordability, flexibility and performance. However, achieving top performance comes at the price of restricted data-parallelism models. In the case of sequence alignment, most GPU-based approaches focus on accelerating the Smith-Waterman dynamic programming algorithm due to its regularity. Nevertheless, because of its quadratic complexity, it becomes impractical when comparing long sequences, and therefore heuristic methods are required to reduce the search space. We present GPUGECKO, a CUDA implementation for the sequential, seed-and-extend sequence-comparison algorithm, GECKO. Our proposal includes optimized kernels based on collective operations capable of producing arbitrarily long alignments while dealing with heterogeneous and unpredictable load. Contrary to other state-of-the-art methods, GPUGECKO employs a batching mechanism that prevents memory exhaustion by not requiring to fit all alignments at once into the device memory, therefore enabling to run massive comparisons exhaustively with improved sensitivity while also providing up to 6x average speedup w.r.t. the CUDA acceleration of BLASTN.Funding for open access publishing: Universidad Málaga/CBUA /// This work has been partially supported by the European project ELIXIR-EXCELERATE (grant no. 676559), the Spanish national project Plataforma de Recursos Biomoleculares y Bioinformáticos (ISCIII-PT13.0001.0012 and ISCIII-PT17.0009.0022), the Fondo Europeo de Desarrollo Regional (UMA18-FEDERJA-156, UMA20-FEDERJA-059), the Junta de Andalucía (P18-FR-3130), the Instituto de Investigación Biomédica de Málaga IBIMA and the University of Málaga

    Computational Workflow for the FineGrained Analysis of Metagenomic Samples

    Get PDF
    El desarrollo de nuevas tecnologías de adquisición de datos ha propiciado una enorme disponibilidad de información en casi todos los campos existentes de la investigación científica, permitiendo a la vez una especialización que resulta en desarrollos software particulares. Con motivo de facilitar al usuario final la obtención de resultados a partir de sus datos, un nuevo paradigma de computación ha surgido con fuerza: los flujos de trabajo automáticos para procesar la información, que han conseguido imponerse gracias al soporte que proporcionan para ensamblar un sistema de procesamiento completo y robusto. La bioinformática es un claro ejemplo donde muchas instituciones ofrecen servicios específicos de procesamiento que, en general, necesitan combinarse para obtener un resultado global. Los ‘gestores de flujos de trabajo’ como Galaxy [1], Swift [2] o Taverna [3] se utilizan para el análisis de datos (entre otros) obtenidos por las nuevas tecnologías de secuenciación del ADN, como Next Generation Sequencing [4], las cuales producen ingentes cantidades de datos en el campos de la genómica, y en particular, metagenómica. La metagenómica estudia las especies presentes en una muestra no cultivada, directamente recolectada del entorno, y los estudios de interés tratan de observar variaciones en la composición de las muestras con objeto de identificar diferencias significativas que correlacionen con características (fenotipo)de los individuos a los que pertenecen las muestras; lo que incluye el análisis funcional de las especies presentes en un metagenoma para comprender las consecuencias derivadas de éstas. Analizar genomas completos ya resulta una tarea importante computacionalmente, por lo que analizar metagenomas en los que no solo está presente el genoma de una especie sino de las varias que conviven en la muestra, resulta una tarea hercúlea. Por ello, el análisis metagenómico requiere algoritmos eficientes capaces de procesar estos datos de forma efectiva y eficiente, en tiempo razonable. Algunas de las dificultades que deben salvarse son (1) el proceso de comparación de muestras contra bases de datos patrón, (2) la asignación (m apping ) de lecturas (r eads ) a genomas mediante estimadores de parecido, (3) los datos procesados suelen ser pesados y necesitan formas de acceso funcionales, (4) la particularidad de cada muestra requiere programas específicos y nuevos para su análisis; (5) la representación visual de resultados ndimensionales para la comprensión y (6) los procesos de verificación de calidad y certidumbre de cada etapa. Para ello presentamos un flujo de trabajo completo pero adaptable, dividido en módulos acoplables y reutilizables mediante estructuras de datos definidas, lo que además permite fácil extensión y customización para satisfacer la demanda de nuevos experimentos

    A Modular Parallel Pipeline Architecture for GWAS Applications in a Cluster Environment

    Get PDF
    A Genome Wide Association Study (GWAS) is an important bioinformatics method to associate variants with traits, identify causes of diseases and increase plant and crop production. There are several optimizations for improving GWAS performance, including running applications in parallel. However, it can be difficult for researchers to utilize different data types and workflows using existing approaches. A potential solution for this problem is to model GWAS algorithms as a set of modular tasks. In this thesis, a modular pipeline architecture for GWAS applications is proposed that can leverage a parallel computing environment as well as store and retrieve data using a shared data cache. To show that the proposed architecture increases performance of GWAS applications, two case studies are conducted in which the proposed architecture is implemented on a bioinformatics pipeline package called TASSEL and a GWAS application called FaST-LMM using both Apache Spark and Dask as the parallel processing framework and Redis as the shared data cache. The case studies implement parallel processing modules and shared data cache modules according to the specifications of the proposed architecture. Based on the case studies, a number of experiments are conducted that compare the performance of the implemented architecture on a cluster environment with the original programs. The experiments reveal that the modified applications indeed perform faster than the original sequential programs. However, the modified applications do not scale with cluster resources, as the sequential part of the operations prevent the parallelization from having linear scalability. Finally, an evaluation of the architecture was conducted based on feedback from software developers and bioinformaticians. The evaluation reveals that the domain experts find the architecture useful; the implementations have sufficient performance improvement and they are also easy to use, although a GUI based implementation would be preferable
    corecore