15 research outputs found

    Porting and optimizing BWA-MEM2 using the Fujitsu A64FX processor

    Get PDF
    Sequence alignment pipelines for human genomes are an emerging workload that will dominate in the precision medicine field. BWA-MEM2 is a tool widely used in the scientific community to perform read mapping studies. In this paper, we port BWA-MEM2 to the AArch64 architecture using the ARMv8-A specification, and we compare the resulting version against an Intel Skylake system both in performance and in energy-to-solution. The porting effort entails numerous code modifications, since BWA-MEM2 implements certain kernels using x86_64 specific intrinsics, e.g., AVX-512. To adapt this code we use the recently introduced Arm's Scalable Vector Extensions (SVE). More specifically, we use Fujitsu's A64FX processor, the first to implement SVE. The A64FX powers the Fugaku Supercomputer that led the Top500 ranking from June 2020 to November 2021. After porting BWA-MEM2 we define and implement a number of optimizations to improve performance in the A64FX target architecture. We show that while the A64FX performance is lower than that of the Skylake system, A64FX delivers 11.6% better energy-to-solution on average. All the code used for this article is available at https://gitlab.bsc.es/rlangari/bwa-a64fxThis work has been partially supported by the Spanish Ministry of Economy and Competitiveness (contracts PID2019-107255GB-C21 / AEI /10.13039/501100011033 and PID2019-105660RB-C21 / AEI / 10.13039/501100011033), Gobierno de Aragon (T5820R research group), the Generalitat de Catalunya (contracts 2017-SGR-1328 and 2017-SGR1414), and the European Union’s Horizon 2020 research and innovation program (Mont-Blanc 2020 project, grant agreement 779877). Finally, A. Armejach and M. Moreto have been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Juan de la Cierva fellowship no. IJCI-2017-33945 and Ramon y Cajal fellowship no. RYC-2016-21104, respectively.Peer ReviewedPostprint (author's final draft

    Porting and optimizing BWA-MEM2 using the Fujitsu A64FX processor

    Get PDF
    Sequence alignment pipelines for human genomes are an emerging workload that will dominate in the precision medicine field. BWA-MEM2 is a tool widely used in the scientific community to perform read mapping studies. In this paper, we port BWA-MEM2 to the AArch64 architecture using the ARMv8-A specification, and we compare the resulting version against an Intel Skylake system both in performance and in energy-to-solution. The porting effort entails numerous code modifications, since BWA-MEM2 implements certain kernels using x86 64 specific intrinsics, e.g., AVX-512. To adapt this code we use the recently introduced Arm’s Scalable Vector Extensions (SVE). More specifically, we use Fujitsu’s A64FX processor, the first to implement SVE. The A64FX powers the Fugaku Supercomputer that led the Top500 ranking from June 2020 to November 2021. After porting BWA-MEM2 we define and implement a number of optimizations to improve performance in the A64FX target architecture. We show that while the A64FX performance is lower than that of the Skylake system, A64FX delivers 11.6% better energy-to-solution on average. All the code used for this article is available at https://gitlab.bsc.es/rlangari/bwa-a64fx

    Reconfigurable acceleration of genetic sequence alignment: A survey of two decades of efforts

    Get PDF
    Genetic sequence alignment has always been a computational challenge in bioinformatics. Depending on the problem size, software-based aligners can take multiple CPU-days to process the sequence data, creating a bottleneck point in bioinformatic analysis flow. Reconfigurable accelerator can achieve high performance for such computation by providing massive parallelism, but at the expense of programming flexibility and thus has not been commensurately used by practitioners. Therefore, this paper aims to provide a thorough survey of the proposed accelerators by giving a qualitative categorization based on their algorithms and speedup. A comprehensive comparison between work is also presented so as to guide selection for biologist, and to provide insight on future research direction for FPGA scientists

    FPGA acceleration of DNA sequence alignment: design analysis and optimization

    Get PDF
    Existing FPGA accelerators for short read mapping often fail to utilize the complete biological information in sequencing data for simple hardware design, leading to missed or incorrect alignment. In this work, we propose a runtime reconfigurable alignment pipeline that considers all information in sequencing data for the biologically accurate acceleration of short read mapping. We focus our efforts on accelerating two string matching techniques: FM-index and the Smith-Waterman algorithm with the affine-gap model which are commonly used in short read mapping. We further optimize the FPGA hardware using a design analyzer and merger to improve alignment performance. The contributions of this work are as follows. 1. We accelerate the exact-match and mismatch alignment by leveraging the FM-index technique. We optimize memory access by compressing the data structure and interleaving the access with multiple short reads. The FM-index hardware also considers complete information in the read data to maximize accuracy. 2. We propose a seed-and-extend model to accelerate alignment with indels. The FM-index hardware is extended to support the seeding stage while a Smith-Waterman implementation with the affine-gap model is developed on FPGA for the extension stage. This model can improve the efficiency of indel alignment with comparable accuracy versus state-of-the-art software. 3. We present an approach for merging multiple FPGA designs into a single hardware design, so that multiple place-and-route tasks can be replaced by a single task to speed up functional evaluation of designs. We first experiment with this approach to demonstrate its feasibility for different designs. Then we apply this approach to optimize one of the proposed FPGA aligners for better alignment performance.Open Acces

    Performance characterization and acceleration of genome-mapping tools on HPC environments

    Get PDF
    Nowadays, the efficient analysis and exploitation of genomic information is paramount to future advancements in the healthcare sector, such as better diagnosis techniques and the development of improved disease treatments. In the past decades, the exponential increase in the biological data production has fostered the development of more efficient genomic pipelines. For that, modern genome analysis requires better and more scalable algorithms, and improved high-performance implementations that can exploit current hardware accelerators. For most genome analysis pipelines, sequence mapping is one of the most computationally intensive and time-consuming processing stages. The ultimate goal of this work is to propose techniques to accelerate read mapping, leveraging novel algorithms and hardware vector extensions. In this thesis, we present a thorough performance characterization of the most widely-used genome-mapping tools and propose acceleration techniques that can effectively improve the performance of these tools. To that end, first, we identify the most time-consuming kernels, their performance bottlenecks, and the underlying causes of inefficiency. Afterwards, we design and implement an accelerated version of one of the most time-consuming steps: pairwise sequence alignment. For that, we propose to replace the classical dynamic-programming algorithm, used within these tools, with the recently proposed wavefront alignment algorithm (WFA). Moreover, we design and implement the first fully-vectorized version of the WFA, leveraging Intel's AVX2 and AVX-512 instructions, to further accelerate sequence-to-sequence alignment. As a result, we demonstrate that our vectorized WFA implementation outperforms the original scalar WFA implementation between 1.1x-2.4x. In turn, this renders speedups from 2.4x up to 826.7x compared to the most widely-used alignment algorithm, KSW2 (used within Minimap2 and Bwa-Mem2). We conclude that these tools can be significantly accelerated by selecting better algorithms (like the WFA) and leveraging fine-tuned implementations that can exploit hardware resources available in current high performance computing (HPC) processors

    IMPROVING BWA-MEM WITH GPU PARALLEL COMPUTING

    Get PDF
    Due to the many advances made in designing algorithms, especially the ones used in bioinformatics, it is becoming harder and harder to improve their efficiencies. Therefore, hardware acceleration using General-Purpose computing on Graphics Processing Unit has become a popular choice. BWA-MEM is an important part of the BWA software package for sequence mapping. Because of its high speed and accuracy, we choose to parallelize the popular short DNA sequence mapper. BWA has been a prevalent single node tool in genome alignment, and it has been widely studied for acceleration for a long time since the first version of the BWA package came out. This thesis presents the Big Data GPGPU distributed BWA-MEM, a tool that combines GPGPU acceleration and distributed computing. The four hardware parallelization techniques used are CPU multi-threading, GPU paralleled, CPU distributed, and GPU distributed. The GPGPU distributed software typically outperforms other parallelization versions. The alignment is performed on a distributed network, and each node in the network executes a separate GPGPU paralleled version of the software. We parallelize the chain2aln function in three levels. In Level 1, the function ksw\_extend2, an algorithm based on Smith-Waterman, is parallelized to handle extension on one side of the seed. In Level 2, the function chain2aln is parallelized to handle chain extension, where all seeds within the same chain are extended. In Level 3, part of the function mem\_align1\_core is parallelized for extending multiple chains. Due to the program's complexity, the parallelization work was limited at the GPU version of ksw\_extend2 parallelization Level 3. However, we have successfully combined Spark with BWA-MEM and ksw\_extend2 at parallelization Level 1, which has shown that the proposed framework is possible. The paralleled Level 3 GPU version of ksw\_extend2 demonstrated noticeable speed improvement with the test data set

    ALFALFA : fast and accurate mapping of long next generation sequencing reads

    Get PDF
    corecore