1,000 research outputs found
More Bang for Your Buck: Improved use of GPU Nodes for GROMACS 2018
We identify hardware that is optimal to produce molecular dynamics
trajectories on Linux compute clusters with the GROMACS 2018 simulation
package. Therefore, we benchmark the GROMACS performance on a diverse set of
compute nodes and relate it to the costs of the nodes, which may include their
lifetime costs for energy and cooling. In agreement with our earlier
investigation using GROMACS 4.6 on hardware of 2014, the performance to price
ratio of consumer GPU nodes is considerably higher than that of CPU nodes.
However, with GROMACS 2018, the optimal CPU to GPU processing power balance has
shifted even more towards the GPU. Hence, nodes optimized for GROMACS 2018 and
later versions enable a significantly higher performance to price ratio than
nodes optimized for older GROMACS versions. Moreover, the shift towards GPU
processing allows to cheaply upgrade old nodes with recent GPUs, yielding
essentially the same performance as comparable brand-new hardware.Comment: 41 pages, 13 figures, 4 tables. This updated version includes the
following improvements: - most notably, added benchmarks for two coarse grain
MARTINI systems VES and BIG, resulting in a new Figure 13 - fixed typos -
made text clearer in some places - added two more benchmarks for MEM and RIB
systems (E3-1240v6 + RTX 2080 / 2080Ti
An investigation of the performance portability of OpenCL
This paper reports on the development of an MPI/OpenCL implementation of LU, an application-level benchmark from the NAS Parallel Benchmark Suite. An account of the design decisions addressed during the development of this code is presented, demonstrating the importance of memory arrangement and work-item/work-group distribution strategies when applications are deployed on different device types. The resulting platform-agnostic, single source application is benchmarked on a number of different architectures, and is shown to be 1.3â1.5Ă slower than native FORTRAN 77 or CUDA implementations on a single node and 1.3â3.1Ă slower on multiple nodes. We also explore the potential performance gains of OpenCLâs device fissioning capability, demonstrating up to a 3Ă speed-up over our original OpenCL implementation
Developing performance-portable molecular dynamics kernels in Open CL
This paper investigates the development of a molecular dynamics code that is highly portable between architectures. Using OpenCL, we develop an implementation of Sandiaâs miniMD benchmark that achieves good levels of performance across a wide range of hardware: CPUs, discrete GPUs and integrated GPUs.
We demonstrate that the performance bottlenecks of miniMDâs short-range force calculation kernel are the same across these architectures, and detail a number of platform- agnostic optimisations that improve its performance by at least 2x on all hardware considered. Our complete code is shown to be 1.7x faster than the original miniMD, and at most 2x slower than implementations individually hand-tuned for a specific architecture
Best bang for your buck: GPU nodes for GROMACS biomolecular simulations
The molecular dynamics simulation package GROMACS runs efficiently on a wide
variety of hardware from commodity workstations to high performance computing
clusters. Hardware features are well exploited with a combination of SIMD,
multi-threading, and MPI-based SPMD/MPMD parallelism, while GPUs can be used as
accelerators to compute interactions offloaded from the CPU. Here we evaluate
which hardware produces trajectories with GROMACS 4.6 or 5.0 in the most
economical way. We have assembled and benchmarked compute nodes with various
CPU/GPU combinations to identify optimal compositions in terms of raw
trajectory production rate, performance-to-price ratio, energy efficiency, and
several other criteria. Though hardware prices are naturally subject to trends
and fluctuations, general tendencies are clearly visible. Adding any type of
GPU significantly boosts a node's simulation performance. For inexpensive
consumer-class GPUs this improvement equally reflects in the
performance-to-price ratio. Although memory issues in consumer-class GPUs could
pass unnoticed since these cards do not support ECC memory, unreliable GPUs can
be sorted out with memory checking tools. Apart from the obvious determinants
for cost-efficiency like hardware expenses and raw performance, the energy
consumption of a node is a major cost factor. Over the typical hardware
lifetime until replacement of a few years, the costs for electrical power and
cooling can become larger than the costs of the hardware itself. Taking that
into account, nodes with a well-balanced ratio of CPU and consumer-class GPU
resources produce the maximum amount of GROMACS trajectory over their lifetime
State-of-the-art in Smith-Waterman Protein Database Search on HPC Platforms
Searching biological sequence database is a common and repeated task in bioinformatics and molecular biology. The SmithâWaterman algorithm is the most accurate method for this kind of search. Unfortunately, this algorithm is computationally demanding and the situation gets worse due to the exponential growth of biological data in the last years. For that reason, the scientific community has made great efforts to accelerate SmithâWaterman biological database searches in a wide variety of hardware platforms. We give a survey of the state-of-the-art in SmithâWaterman protein database search, focusing on four hardware architectures: central processing units, graphics processing units, field programmable gate arrays and Xeon Phi coprocessors. After briefly describing each hardware platform, we analyse temporal evolution, contributions, limitations and experimental work and the results of each implementation. Additionally, as energy efficiency is becoming more important every day, we also survey performance/power consumption works. Finally, we give our view on the future of SmithâWaterman protein searches considering next generations of hardware architectures and its upcoming technologies.Instituto de InvestigaciĂłn en InformĂĄticaUniversidad Complutense de Madri
Comparing Performance and Portability between CUDA and SYCL for Protein Database Search on NVIDIA, AMD, and Intel GPUs
The heterogeneous computing paradigm has led to the need for portable and
efficient programming solutions that can leverage the capabilities of various
hardware devices, such as NVIDIA, Intel, and AMD GPUs. This study evaluates the
portability and performance of the SYCL and CUDA languages for one fundamental
bioinformatics application (Smith-Waterman protein database search) across
different GPU architectures, considering single and multi-GPU configurations
from different vendors. The experimental work showed that, while both CUDA and
SYCL versions achieve similar performance on NVIDIA devices, the latter
demonstrated remarkable code portability to other GPU architectures, such as
AMD and Intel. Furthermore, the architectural efficiency rates achieved on
these devices were superior in 3 of the 4 cases tested. This brief study
highlights the potential of SYCL as a viable solution for achieving both
performance and portability in the heterogeneous computing ecosystem.Comment: This article was accepted for publication in 2023 IEEE 35th
International Symposium on Computer Architecture and High Performance
Computing (SBAC-PAD
- âŠ