14 research outputs found
WFA-GPU: Gap-affine pairwise read-alignment using GPUs
Motivation: Advances in genomics and sequencing technologies demand faster and more scalable analysis methods that can process longer sequences with higher accuracy. However, classical pairwise alignment methods, based on dynamic programming (DP), impose impractical computational requirements to align long and noisy sequences like those produced by PacBio, and Nanopore technologies. The recently proposed WFA algorithm paves the way for more efficient alignment tools, improving time and memory complexity over previous methods. However, high-performance computing (HPC) platforms require efficient parallel algorithms and tools to exploit the computing resources available on modern accelerator-based architectures.
Results: This paper presents WFA-GPU, a GPU (Graphics Processing Unit)-accelerated tool to compute exact gap-affine alignments based on the WFA algorithm. We present the algorithmic adaptations and performance optimizations that allow exploiting the massively parallel capabilities of modern GPU devices to accelerate the alignment computations. In particular, we propose a CPU-GPU co-design capable of performing inter-sequence and intra-sequence parallel sequence alignment, combining a succinct WFA-data representation with an efficient GPU implementation. As a result, we demonstrate that our implementation outperforms the original multi-threaded WFA implementation by up to 4.3 × and up to 18.2 × when using heuristic methods on long and noisy sequences. Compared to other state-of-the-art tools and libraries, the WFA-GPU is up to 29 × faster than other GPU implementations and up to four orders of magnitude faster than other CPU implementations. Furthermore, WFA-GPU is the only GPU solution capable of correctly aligning long reads using a commodity GPU.This research was supported by the European Union Regional Development Fund within the framework of the ERDF Operational Program of Catalonia 2014-2020 with a grant of 50% of total cost eligible under theDRAC project [001-P-001723] and Lenovo-BSC Contract-Framework Contract (2020). This work has also been granted by the Spanish Ministerio de Ciencia e Innovacion MCIN AEI/10.13039/501100011033 under contracts [PID2020-113614RB-C21] and [TIN2015-65316-P],NextGenerationEU/PRTR (projectTED2021132634A-I00), and by the Generalitat de Catalunya GenCat-DIUiE (GRR) (contracts [2021-SGR-00574], [2017-SGR-1328], [2017-SGR313], and [2017-SGR-1414]). M.M. was partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship number [RYC-2016-21104]. S.M. was supported by Juan de la Cierva fellowship grant [IJC2020-045916-I] funded by MCIN/AEI/10.13039/501100011033 and by European Union NextGenerationEU/PRTR. Q.A.was supported by the Spanish Ministerio de Ciencia e Innovacion under grant [PRE2021-101059].Peer ReviewedPostprint (published version
Accelerating edit-distance sequence alignment on GPU using the wavefront algorithm
Sequence alignment remains a fundamental problem with practical applications ranging from pattern recognition to computational biology. Traditional algorithms based on dynamic programming are hard to parallelize, require significant amounts of memory, and fail to scale for large inputs. This work presents eWFA-GPU, a GPU (graphics processing unit)-accelerated tool to compute the exact edit-distance sequence alignment based on the wavefront alignment algorithm (WFA). This approach exploits the similarities between the input sequences to accelerate the alignment process while requiring less memory than other algorithms. Our implementation takes full advantage of the massive parallel capabilities of modern GPUs to accelerate the alignment process. In addition, we propose a succinct representation of the alignment data that successfully reduces the overall amount of memory required, allowing the exploitation of the fast shared memory of a GPU. Our results show that our GPU implementation outperforms by 3- 9× the baseline edit-distance WFA implementation running on a 20 core machine. As a result, eWFA-GPU is up to 265 times faster than state-of-the-art CPU implementation, and up to 56 times faster than state-of-the-art GPU implementations.This work was supported in part by the European Unions’s Horizon 2020 Framework Program through the DeepHealth Project under Grant 825111; in part by the European Union Regional Development Fund within the Framework of the European Regional Development Fund (ERDF) Operational Program of Catalonia 2014–2020 with a Grant of 50% of Total Cost Eligible through the Designing RISC-V-based Accelerators for next-generation Computers Project under Grant 001-P-001723; in part by the Ministerio de Ciencia e Innovacion (MCIN) Agencia Estatal de Investigación (AEI)/10.13039/501100011033 under Contract PID2020-113614RB-C21 and Contract TIN2015-65316-P; and in part by the Generalitat de Catalunya (GenCat)-Departament de Recerca i Universitats (DIUiE) (GRR) under Contract 2017-SGR-313, Contract 2017-SGR-1328, and Contract 2017-SGR-1414. The work of Miquel Moreto was supported in part by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal Fellowship under Grant RYC-2016-21104.Peer ReviewedPostprint (published version
OpenCL-based FPGA accelerator for semi-global approximate string matching using diagonal bit-vectors
An FPGA accelerator for the computation of the semi-global Levenshtein distance between a pattern and a reference text is presented. The accelerator provides an important benefit to reduce the execution time of read-mappers used in short-read genomic sequencing. Previous attempts to solve the same problem in FPGA use the Myers algorithm following a column approach to compute the dynamic programming table. We use an approach based on diagonals that allows for some resource savings while maintaining a very high throughput of 1 alignment per clock cycle. The design is implemented in OpenCL and tested on two FPGA accelerators. The maximum performance obtained is 91.5 MPairs/s for 100 × 120 sequences and 47 MPairs/s for 300 × 360 sequences, the highest ever reported for this problem.This research was supported by the EU Regional Development Fund under the DRAC project [001-P-001723], by the MINECO-Spain (contract TIN2017-84553-C2-1-R), by the MICIU-Spain (contract RTI2018-095209-B-C22) and by the Catalan government (contracts 2017-SGR-1624, 2017-SGR313, 2017-SGR-1328). M.M. was partially supported by the MINECO under RYC-2016-21104. We thank Intel for granting us access to the DevCloud system and let us join the HARP research program. The presented HARP-2 results were obtained on resources hosted at the Paderborn Center for Parallel Computing (PC2) in the Intel Hardware Accelerator Research Program (HARP2).Peer ReviewedPostprint (author's final draft
GenArchBench: A genomics benchmark suite for arm HPC processors
Arm usage has substantially grown in the High-Performance Computing (HPC) community. Japanese supercomputer Fugaku, powered by Arm-based A64FX processors, held the top position on the Top500 list between June 2020 and June 2022, currently sitting in the fourth position. The recently released 7th generation of Amazon EC2 instances for compute-intensive workloads (C7 g) is also powered by Arm Graviton3 processors. Projects like European Mont-Blanc and U.S. DOE/NNSA Astra are further examples of Arm irruption in HPC. In parallel, over the last decade, the rapid improvement of genomic sequencing technologies and the exponential growth of sequencing data has placed a significant bottleneck on the computational side. While most genomics applications have been thoroughly tested and optimized for x86 systems, just a few are prepared to perform efficiently on Arm machines. Moreover, these applications do not exploit the newly introduced Scalable Vector Extensions (SVE).
This paper presents GenArchBench, the first genome analysis benchmark suite targeting Arm architectures. We have selected computationally demanding kernels from the most widely used tools in genome data analysis and ported them to Arm-based A64FX and Graviton3 processors. Overall, the GenArch benchmark suite comprises 13 multi-core kernels from critical stages of widely-used genome analysis pipelines, including base-calling, read mapping, variant calling, and genome assembly. Our benchmark suite includes different input data sets per kernel (small and large), each with a corresponding regression test to verify the correctness of each execution automatically. Moreover, the porting features the usage of the novel Arm SVE instructions, algorithmic and code optimizations, and the exploitation of Arm-optimized libraries. We present the optimizations implemented in each kernel and a detailed performance evaluation and comparison of their performance on four different HPC machines (i.e., A64FX, Graviton3, Intel Xeon Skylake Platinum, and AMD EPYC Rome). Overall, the experimental evaluation shows that Graviton3 outperforms other machines on average. Moreover, we observed that the performance of the A64FX is significantly constrained by its small memory hierarchy and latencies. Additionally, as proof of concept, we study the performance of a production-ready tool that exploits two of the ported and optimized genomic kernels.This work has been partially supported by the Spanish Ministry of Science and Innovation MCIN/AEI/10.13039/501100011033 (contracts PID2019-107255GB-C21, PID2019-105660RB-C21, PID2022136454NB-C22, and TED2021-132634A-I00), by the Generalitat de Catalunya, Spain (contract 2021-SGR-763), by the Gobierno de Aragón (T58_23R research group), by the European Union NextGenerationEU/ PRTR, and by Lenovo BSC Contract-Framework Contract (2020).Peer ReviewedPostprint (published version
Efecte de l’addició de sals aniòniques en la ració de vaques lleteres prepart
S’ha realitzat un estudi comparatiu per valorar les diferències observades en canviar la composició de la ració prepart, en un grup vaques lleteres de raça frisona de finals de gestació. L’experiència és realitzada en una explotació comercial.
Les dades de pH juntament amb les dades de producció, qualitat de la llet i incidències veterinàries, han estat tractades estadísticament amb el programa SAS, per contraposar-les amb dades anteriors al canvi de la ració prepart. L’evolució del pH urinari va esdevenir significatiu quan aquest es comparava amb el dia de presa de mostra de l’orina, és a dir, conforme les vaques s’aproximaven al part el seu pH disminuia fins a valors intermitjos de 6,3. S’ha trobat diferències significatives entre la ració amb sals i la que no en conté (p<0,05) per als 30 primers dies de producció, creient que les sals contribueixen a incrementar les produccions de les vaques multípares; per contra, no s’han trobat diferències significatives quan es comptabilitzen els 60 primers dies de lactació. Cal estar atent a l’evolució de les produccions de les vaques primípares, les quals disminueixen la producció de llet a causa de la introducció de la ració prepart. Per últim, no s’han detectat diferències en la concentració de greix, proteïna, lactosa i ESM (extracte sec magre) de la llet de les vaques produïdes pel canvi de ració. Després d’haver analitzat el conjunt de malalties no s’han copsat canvis significatius entre animals que hagin ingerit una ració convencional i altres que, a més, hagin menjat sals aniòniques també
Efecte de l’addició de sals aniòniques en la ració de vaques lleteres prepart
S’ha realitzat un estudi comparatiu per valorar les diferències observades en canviar la composició de la ració prepart, en un grup vaques lleteres de raça frisona de finals de gestació. L’experiència és realitzada en una explotació comercial.
Les dades de pH juntament amb les dades de producció, qualitat de la llet i incidències veterinàries, han estat tractades estadísticament amb el programa SAS, per contraposar-les amb dades anteriors al canvi de la ració prepart. L’evolució del pH urinari va esdevenir significatiu quan aquest es comparava amb el dia de presa de mostra de l’orina, és a dir, conforme les vaques s’aproximaven al part el seu pH disminuia fins a valors intermitjos de 6,3. S’ha trobat diferències significatives entre la ració amb sals i la que no en conté (p<0,05) per als 30 primers dies de producció, creient que les sals contribueixen a incrementar les produccions de les vaques multípares; per contra, no s’han trobat diferències significatives quan es comptabilitzen els 60 primers dies de lactació. Cal estar atent a l’evolució de les produccions de les vaques primípares, les quals disminueixen la producció de llet a causa de la introducció de la ració prepart. Per últim, no s’han detectat diferències en la concentració de greix, proteïna, lactosa i ESM (extracte sec magre) de la llet de les vaques produïdes pel canvi de ració. Després d’haver analitzat el conjunt de malalties no s’han copsat canvis significatius entre animals que hagin ingerit una ració convencional i altres que, a més, hagin menjat sals aniòniques també