3 research outputs found
Towards enhancing coding productivity for GPU programming using static graphs
The main contribution of this work is to increase the coding productivity of GPU programming by using the concept of Static Graphs. GPU capabilities have been increasing significantly in terms of performance and memory capacity. However, there are still some problems in terms of scalability and limitations to the amount of work that a GPU can perform at a time. To minimize the overhead associated with the launch of GPU kernels, as well as to maximize the use of GPU capacity, we have combined the new CUDA Graph API with the CUDA programming model (including CUDA math libraries) and the OpenACC programming model. We use as test cases two different, well-known and widely used problems in HPC and AI: the Conjugate Gradient method and the Particle Swarm Optimization. In the first test case (Conjugate Gradient) we focus on the integration of Static Graphs with CUDA. In this case, we are able to significantly outperform the NVIDIA reference code, reaching an acceleration of up to 11× thanks to a better implementation, which can benefit from the new CUDA Graph capabilities. In the second test case (Particle Swarm Optimization), we complement the OpenACC functionality with the use of CUDA Graph, achieving again accelerations of up to one order of magnitude, with average speedups ranging from 2× to 4×, and performance very close to a reference and optimized CUDA code. Our main target is to achieve a higher coding productivity model for GPU programming by using Static Graphs, which provides, in a very transparent way, a better exploitation of the GPU capacity. The combination of using Static Graphs with two of the current most important GPU programming models (CUDA and OpenACC) is able to reduce considerably the execution time w.r.t. the use of CUDA and OpenACC only, achieving accelerations of up to more than one order of magnitude. Finally, we propose an interface to incorporate the concept of Static Graphs into the OpenACC Specifications.his research was funded by EPEEC project from the European Union’s Horizon 2020 Research and Innovation program under grant agreement No. 801051. This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan, accessed on 13 April 2022).Peer ReviewedPostprint (published version
CLAIRE -- Parallelized Diffeomorphic Image Registration for Large-Scale Biomedical Imaging Applications
We study the performance of CLAIRE -- a diffeomorphic multi-node, multi-GPU
image-registration algorithm, and software -- in large-scale biomedical imaging
applications with billions of voxels. At such resolutions, most existing
software packages for diffeomorphic image registration are prohibitively
expensive. As a result, practitioners first significantly downsample the
original images and then register them using existing tools. Our main
contribution is an extensive analysis of the impact of downsampling on
registration performance. We study this impact by comparing full-resolution
registrations obtained with CLAIRE to lower-resolution registrations for
synthetic and real-world imaging datasets. Our results suggest that
registration at full resolution can yield a superior registration quality --
but not always. For example, downsampling a synthetic image from to
decreases the Dice coefficient from 92% to 79%. However, the
differences are less pronounced for noisy or low-contrast high-resolution
images. CLAIRE allows us not only to register images of clinically relevant
size in a few seconds but also to register images at unprecedented resolution
in a reasonable time. The highest resolution considered is CLARITY images of
size . To the best of our knowledge, this is the
first study on image registration quality at such resolutions.Comment: 32 pages, 9 tables, 8 figure