131 research outputs found
Porting and optimizing BWA-MEM2 using the Fujitsu A64FX processor
Sequence alignment pipelines for human genomes are an emerging workload that will dominate in the precision medicine field. BWA-MEM2 is a tool widely used in the scientific community to perform read mapping studies. In this paper, we port BWA-MEM2 to the AArch64 architecture using the ARMv8-A specification, and we compare the resulting version against an Intel Skylake system both in performance and in energy-to-solution. The porting effort entails numerous code modifications, since BWA-MEM2 implements certain kernels using x86 64 specific intrinsics, e.g., AVX-512. To adapt this code we use the recently introduced Arm’s Scalable Vector Extensions (SVE). More specifically, we use Fujitsu’s A64FX processor, the first to implement SVE. The A64FX powers the Fugaku Supercomputer that led the Top500 ranking from June 2020 to November 2021. After porting BWA-MEM2 we define and implement a number of optimizations to improve performance in the A64FX target architecture. We show that while the A64FX performance is lower than that of the Skylake system, A64FX delivers 11.6% better energy-to-solution on average. All the code used for this article is available at https://gitlab.bsc.es/rlangari/bwa-a64fx
Porting and optimizing BWA-MEM2 using the Fujitsu A64FX processor
Sequence alignment pipelines for human genomes are an emerging workload that will dominate in the precision medicine field. BWA-MEM2 is a tool widely used in the scientific community to perform read mapping studies. In this paper, we port BWA-MEM2 to the AArch64 architecture using the ARMv8-A specification, and we compare the resulting version against an Intel Skylake system both in performance and in energy-to-solution. The porting effort entails numerous code modifications, since BWA-MEM2 implements certain kernels using x86_64 specific intrinsics, e.g., AVX-512. To adapt this code we use the recently introduced Arm's Scalable Vector Extensions (SVE). More specifically, we use Fujitsu's A64FX processor, the first to implement SVE. The A64FX powers the Fugaku Supercomputer that led the Top500 ranking from June 2020 to November 2021. After porting BWA-MEM2 we define and implement a number of optimizations to improve performance in the A64FX target architecture. We show that while the A64FX performance is lower than that of the Skylake system, A64FX delivers 11.6% better energy-to-solution on average. All the code used for this article is available at https://gitlab.bsc.es/rlangari/bwa-a64fxThis work has been partially supported by the Spanish Ministry of Economy and Competitiveness (contracts PID2019-107255GB-C21 / AEI /10.13039/501100011033 and PID2019-105660RB-C21 / AEI / 10.13039/501100011033), Gobierno de Aragon (T5820R research group), the Generalitat de Catalunya (contracts 2017-SGR-1328 and 2017-SGR1414), and the European Union’s Horizon 2020 research and innovation program (Mont-Blanc 2020 project, grant agreement 779877). Finally, A. Armejach and M. Moreto have been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Juan de la Cierva fellowship no. IJCI-2017-33945 and Ramon y Cajal fellowship no. RYC-2016-21104, respectively.Peer ReviewedPostprint (author's final draft
Multi-view information fusion using multi-view variational autoencoders to predict proximal femoral strength
The aim of this paper is to design a deep learning-based model to predict
proximal femoral strength using multi-view information fusion. Method: We
developed new models using multi-view variational autoencoder (MVAE) for
feature representation learning and a product of expert (PoE) model for
multi-view information fusion. We applied the proposed models to an in-house
Louisiana Osteoporosis Study (LOS) cohort with 931 male subjects, including 345
African Americans and 586 Caucasians. With an analytical solution of the
product of Gaussian distribution, we adopted variational inference to train the
designed MVAE-PoE model to perform common latent feature extraction. We
performed genome-wide association studies (GWAS) to select 256 genetic variants
with the lowest p-values for each proximal femoral strength and integrated
whole genome sequence (WGS) features and DXA-derived imaging features to
predict proximal femoral strength. Results: The best prediction model for fall
fracture load was acquired by integrating WGS features and DXA-derived imaging
features. The designed models achieved the mean absolute percentage error of
18.04%, 6.84% and 7.95% for predicting proximal femoral fracture loads using
linear models of fall loading, nonlinear models of fall loading, and nonlinear
models of stance loading, respectively. Compared to existing multi-view
information fusion methods, the proposed MVAE-PoE achieved the best
performance. Conclusion: The proposed models are capable of predicting proximal
femoral strength using WGS features and DXA-derived imaging features. Though
this tool is not a substitute for FEA using QCT images, it would make improved
assessment of hip fracture risk more widely available while avoiding the
increased radiation dosage and clinical costs from QCT.Comment: 16 pages, 3 figure
LIPIcs, Volume 274, ESA 2023, Complete Volume
LIPIcs, Volume 274, ESA 2023, Complete Volum
GenArchBench: Porting and Optimizing a Genomics Benchmark Suite to Arm-based HPC Processors
Arm usage has substantially grown in the High-Performance Computing (HPC) community. Japanese supercomputer Fugaku, powered by Arm-based A64FX processors, held the top position on the Top500 list between June 2020 and June 2022, currently sitting in the second position. The recently released 7th generation of Amazon EC2 instances for compute-intensive workloads (C7g) is also powered by Arm Graviton3 processors. Projects like European Mont-Blanc and U.S. DOE/NNSA Astra are further examples of Arm irruption in HPC. In parallel, over the last decade, the rapid improvement of genomic sequencing technologies and the exponential growth of sequencing data has placed a significant bottleneck on the computational side. While the majority of genomics applications have been thoroughly tested and optimized for x86 systems, just a few are prepared to perform efficiently on Arm machines, let alone exploit the advantages of the newly introduced Scalable Vector Extensions (SVE). This thesis presents GenArchBench, the first genome analysis benchmark suite targeting Arm architectures. We have selected a set of computationally demanding kernels from the most widely used tools in genome data analysis and ported them to Arm-based A64FX and Graviton3 processors. The porting features the usage of the novel Arm SVE instructions, algorithmic and code optimizations, and the exploitation of Arm-optimized libraries. All in all, the GenArch benchmark suite comprises 13 multi-core kernels from critical stages of widely-used genome analysis pipelines, including base-calling, read mapping, variant calling, and genome assembly. Moreover, our benchmark suite includes different input data sets per kernel (small and large), each with a corresponding regression test to verify the correctness of each execution automatically. In this work, we present the optimizations implemented in each kernel and a detailed performance evaluation and comparison of their performance on four different architectures (i.e., A64FX, Graviton3, Intel Xeon Platinum, and AMD EPYC). Additionally, as proof of the impact of this work, we study the performance improvement in a production-ready genomics pipeline using the GenArchBench optimized kernels
RawHash: Enabling Fast and Accurate Real-Time Analysis of Raw Nanopore Signals for Large Genomes
Nanopore sequencers generate electrical raw signals in real-time while
sequencing long genomic strands. These raw signals can be analyzed as they are
generated, providing an opportunity for real-time genome analysis. An important
feature of nanopore sequencing, Read Until, can eject strands from sequencers
without fully sequencing them, which provides opportunities to computationally
reduce the sequencing time and cost. However, existing works utilizing Read
Until either 1) require powerful computational resources that may not be
available for portable sequencers or 2) lack scalability for large genomes,
rendering them inaccurate or ineffective.
We propose RawHash, the first mechanism that can accurately and efficiently
perform real-time analysis of nanopore raw signals for large genomes using a
hash-based similarity search. To enable this, RawHash ensures the signals
corresponding to the same DNA content lead to the same hash value, regardless
of the slight variations in these signals. RawHash achieves an accurate
hash-based similarity search via an effective quantization of the raw signals
such that signals corresponding to the same DNA content have the same quantized
value and, subsequently, the same hash value.
We evaluate RawHash on three applications: 1) read mapping, 2) relative
abundance estimation, and 3) contamination analysis. Our evaluations show that
RawHash is the only tool that can provide high accuracy and high throughput for
analyzing large genomes in real-time. When compared to the state-of-the-art
techniques, UNCALLED and Sigmap, RawHash provides 1) 25.8x and 3.4x better
average throughput and 2) an average speedup of 32.1x and 2.1x in the mapping
time, respectively.
Source code is available at https://github.com/CMU-SAFARI/RawHash
LIPIcs, Volume 244, ESA 2022, Complete Volume
LIPIcs, Volume 244, ESA 2022, Complete Volum
- …