Exploring Fully Offloaded GPU Stream-Aware Message Passing

Kandalla, Krishna; Kaplan, Larry; Namashivayam, Naveen; Pagel, Mark; White III, James B

Exploring Fully Offloaded GPU Stream-Aware Message Passing

Authors: Krishna Kandalla
Larry Kaplan
Naveen Namashivayam
Mark Pagel
James B White III
Publication date: 27 June 2023
Publisher

Abstract

Modern heterogeneous supercomputing systems are comprised of CPUs, GPUs, and high-speed network interconnects. Communication libraries supporting efficient data transfers involving memory buffers from the GPU memory typically require the CPU to orchestrate the data transfer operations. A new offload-friendly communication strategy, stream-triggered (ST) communication, was explored to allow offloading the synchronization and data movement operations from the CPU to the GPU. A Message Passing Interface (MPI) one-sided active target synchronization based implementation was used as an exemplar to illustrate the proposed strategy. A latency-sensitive nearest neighbor microbenchmark was used to explore the various performance aspects of the implementation. The offloaded implementation shows significant on-node performance advantages over standard MPI active RMA (36%) and point-to-point (61%) communication. The current multi-node improvement is less (23% faster than standard active RMA but 11% slower than point-to-point), but plans are in progress to purse further improvements.Comment: 12 pages, 17 figure

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2306.15773

Last time updated on 02/07/2023