92 research outputs found

    Genome sequence alignment in processing-In-memory architectures

    Get PDF
    Finalmente, también realizamos un estudio experimental de varias arquitecturas con diferentes tecnologías de memoria (DDR y HBM) y núcleos de procesamiento de distintos tipos, explotando, en algunos casos, procesamiento en la memoria (PIM). La aplicación de referencia es Bowtie2, una aplicación completa para el alineamiento de secuencias en el genoma. La implementación y evaluación de estas arquitecturas se realiza utilizando un simulador arquitectural basado en gem5.La combinación de la aparición de un cuello de botella en el acceso a los datos y la creciente importancia de las aplicaciones de procesamiento intensivo de datos, muy limitadas por el sistema de memoria, crea un importante problema que debe ser abordado. Por ello, en esta tesis nos proponemos afrontar este problema e intentar reducir su efecto en la medida de lo posible. El principal objetivo de esta tesis es el diseño de nuevas soluciones arquitecturales y algorítmicas para superar el problema del cuello de botella conocido como memory-wall y mejorar el rendimiento de aplicaciones con gran uso de memoria que no son capaces de beneficiarse lo suficiente de las jerarquías de memoria actuales. Además, creemos que es esencial centrarse en la eficiencia energética, un factor cuya importancia crece cada día y uno de los factores más limitantes en la computación de alto rendimiento. Las principales contribuciones de esta tesis son: Primero, analizamos el comportamiento de aplicaciones con accesos de memoria aleatorios, que no aprovechan correctamente las nuevas arquitecturas de memoria con jerarquías cache profundas. Específicamente, analizamos la estructura de datos FM-index y un algoritmo de búsqueda de secuencias basado en esa estructura, ampliamente usado en el alineamiento de secuencias en el genoma. Después de este análisis y de obtener un conocimiento más detallado del cuello de botella de la memoria, proponemos una nueva versión de FM-index que permite reducir el consumo de ancho de banda de memoria, de forma que mejora significativamente el rendimiento computacional. Posteriormente, proponemos una nueva arquitectura energéticamente eficiente, basada en un cubo de memoria en 3D (3D-Stacked) al que añadimos unos núcleos sencillos de bajo consumo en su capa lógica. Esta arquitectura permite la ejecución cerca de los datos (near-data-processing

    FPGA acceleration of DNA sequence alignment: design analysis and optimization

    Get PDF
    Existing FPGA accelerators for short read mapping often fail to utilize the complete biological information in sequencing data for simple hardware design, leading to missed or incorrect alignment. In this work, we propose a runtime reconfigurable alignment pipeline that considers all information in sequencing data for the biologically accurate acceleration of short read mapping. We focus our efforts on accelerating two string matching techniques: FM-index and the Smith-Waterman algorithm with the affine-gap model which are commonly used in short read mapping. We further optimize the FPGA hardware using a design analyzer and merger to improve alignment performance. The contributions of this work are as follows. 1. We accelerate the exact-match and mismatch alignment by leveraging the FM-index technique. We optimize memory access by compressing the data structure and interleaving the access with multiple short reads. The FM-index hardware also considers complete information in the read data to maximize accuracy. 2. We propose a seed-and-extend model to accelerate alignment with indels. The FM-index hardware is extended to support the seeding stage while a Smith-Waterman implementation with the affine-gap model is developed on FPGA for the extension stage. This model can improve the efficiency of indel alignment with comparable accuracy versus state-of-the-art software. 3. We present an approach for merging multiple FPGA designs into a single hardware design, so that multiple place-and-route tasks can be replaced by a single task to speed up functional evaluation of designs. We first experiment with this approach to demonstrate its feasibility for different designs. Then we apply this approach to optimize one of the proposed FPGA aligners for better alignment performance.Open Acces

    High Performance Computing for DNA Sequence Alignment and Assembly

    Get PDF
    Recent advances in DNA sequencing technology have dramatically increased the scale and scope of DNA sequencing. These data are used for a wide variety of important biological analyzes, including genome sequencing, comparative genomics, transcriptome analysis, and personalized medicine but are complicated by the volume and complexity of the data involved. Given the massive size of these datasets, computational biology must draw on the advances of high performance computing. Two fundamental computations in computational biology are read alignment and genome assembly. Read alignment maps short DNA sequences to a reference genome to discover conserved and polymorphic regions of the genome. Genome assembly computes the sequence of a genome from many short DNA sequences. Both computations benefit from recent advances in high performance computing to efficiently process the huge datasets involved, including using highly parallel graphics processing units (GPUs) as high performance desktop processors, and using the MapReduce framework coupled with cloud computing to parallelize computation to large compute grids. This dissertation demonstrates how these technologies can be used to accelerate these computations by orders of magnitude, and have the potential to make otherwise infeasible computations practical

    Porting and optimizing BWA-MEM2 using the Fujitsu A64FX processor

    Get PDF
    Sequence alignment pipelines for human genomes are an emerging workload that will dominate in the precision medicine field. BWA-MEM2 is a tool widely used in the scientific community to perform read mapping studies. In this paper, we port BWA-MEM2 to the AArch64 architecture using the ARMv8-A specification, and we compare the resulting version against an Intel Skylake system both in performance and in energy-to-solution. The porting effort entails numerous code modifications, since BWA-MEM2 implements certain kernels using x86_64 specific intrinsics, e.g., AVX-512. To adapt this code we use the recently introduced Arm's Scalable Vector Extensions (SVE). More specifically, we use Fujitsu's A64FX processor, the first to implement SVE. The A64FX powers the Fugaku Supercomputer that led the Top500 ranking from June 2020 to November 2021. After porting BWA-MEM2 we define and implement a number of optimizations to improve performance in the A64FX target architecture. We show that while the A64FX performance is lower than that of the Skylake system, A64FX delivers 11.6% better energy-to-solution on average. All the code used for this article is available at https://gitlab.bsc.es/rlangari/bwa-a64fxThis work has been partially supported by the Spanish Ministry of Economy and Competitiveness (contracts PID2019-107255GB-C21 / AEI /10.13039/501100011033 and PID2019-105660RB-C21 / AEI / 10.13039/501100011033), Gobierno de Aragon (T5820R research group), the Generalitat de Catalunya (contracts 2017-SGR-1328 and 2017-SGR1414), and the European Union’s Horizon 2020 research and innovation program (Mont-Blanc 2020 project, grant agreement 779877). Finally, A. Armejach and M. Moreto have been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Juan de la Cierva fellowship no. IJCI-2017-33945 and Ramon y Cajal fellowship no. RYC-2016-21104, respectively.Peer ReviewedPostprint (author's final draft

    Porting and optimizing BWA-MEM2 using the Fujitsu A64FX processor

    Get PDF
    Sequence alignment pipelines for human genomes are an emerging workload that will dominate in the precision medicine field. BWA-MEM2 is a tool widely used in the scientific community to perform read mapping studies. In this paper, we port BWA-MEM2 to the AArch64 architecture using the ARMv8-A specification, and we compare the resulting version against an Intel Skylake system both in performance and in energy-to-solution. The porting effort entails numerous code modifications, since BWA-MEM2 implements certain kernels using x86 64 specific intrinsics, e.g., AVX-512. To adapt this code we use the recently introduced Arm’s Scalable Vector Extensions (SVE). More specifically, we use Fujitsu’s A64FX processor, the first to implement SVE. The A64FX powers the Fugaku Supercomputer that led the Top500 ranking from June 2020 to November 2021. After porting BWA-MEM2 we define and implement a number of optimizations to improve performance in the A64FX target architecture. We show that while the A64FX performance is lower than that of the Skylake system, A64FX delivers 11.6% better energy-to-solution on average. All the code used for this article is available at https://gitlab.bsc.es/rlangari/bwa-a64fx

    Castell: a heterogeneous cmp architecture scalable to hundreds of processors

    Get PDF
    Technology improvements and power constrains have taken multicore architectures to dominate microprocessor designs over uniprocessors. At the same time, accelerator based architectures have shown that heterogeneous multicores are very efficient and can provide high throughput for parallel applications, but with a high-programming effort. We propose Castell a scalable chip multiprocessor architecture that can be programmed as uniprocessors, and provides the high throughput of accelerator-based architectures. Castell relies on task-based programming models that simplify software development. These models use a runtime system that dynamically finds, schedules, and adds hardware-specific features to parallel tasks. One of these features is DMA transfers to overlap computation and data movement, which is known as double buffering. This feature allows applications on Castell to tolerate large memory latencies and lets us design the memory system focusing on memory bandwidth. In addition to provide programmability and the design of the memory system, we have used a hierarchical NoC and added a synchronization module. The NoC design distributes memory traffic efficiently to allow the architecture to scale. The synchronization module is a consequence of the large performance degradation of application for large synchronization latencies. Castell is mainly an architecture framework that enables the definition of domain-specific implementations, fine-tuned to a particular problem or application. So far, Castell has been successfully used to propose heterogeneous multicore architectures for scientific kernels, video decoding (using H.264), and protein sequence alignment (using Smith-Waterman and clustalW). It has also been used to explore a number of architecture optimizations such as enhanced DMA controllers, and architecture support for task-based programming models. ii

    MR-CUDASW - GPU accelerated Smith-Waterman algorithm for medium-length (meta)genomic data

    Get PDF
    The idea of using a graphics processing unit (GPU) for more than simply graphic output purposes has been around for quite some time in scientific communities. However, it is only recently that its benefits for a range of bioinformatics and life sciences compute-intensive tasks has been recognized. This thesis investigates the possibility of improving the performance of the overlap determination stage of an Overlap Layout Consensus (OLC)-based assembler by using a GPU-based implementation of the Smith-Waterman algorithm. In this thesis an existing GPU-accelerated sequence alignment algorithm is adapted and expanded to reduce its completion time. A number of improvements and changes are made to the original software. Workload distribution, query profile construction, and thread scheduling techniques implemented by the original program are replaced by custom methods specifically designed to handle medium-length reads. Accordingly, this algorithm is the first highly parallel solution that has been specifically optimized to process medium-length nucleotide reads (DNA/RNA) from modern sequencing machines (i.e. Ion Torrent). Results show that the software reaches up to 82 GCUPS (Giga Cell Updates Per Second) on a single-GPU graphic card running on a commodity desktop hardware. As a result it is the fastest GPU-based implemen- tation of the Smith-Waterman algorithm tailored for processing medium-length nucleotide reads. Despite being designed for performing the Smith-Waterman algorithm on medium-length nucleotide sequences, this program also presents great potential for improving heterogeneous computing with CUDA-enabled GPUs in general and is expected to make contributions to other research problems that require sensitive pairwise alignment to be applied to a large number of reads. Our results show that it is possible to improve the performance of bioinformatics algorithms by taking full advantage of the compute resources of the underlying commodity hardware and further, these results are especially encouraging since GPU performance grows faster than multi-core CPUs
    • …
    corecore