Irregular communication often limits both the performance and scalability of
parallel applications. Typically, applications individually implement irregular
messages using point-to-point communications, and any optimizations are added
directly into the application. As a result, these optimizations lack
portability. There is no easy way to optimize point-to-point messages within
MPI, as the interface for single messages provides no information on the
collection of all communication to be performed. However, the persistent
neighbor collective API, released in the MPI 4 standard, provides an interface
for portable optimizations of irregular communication within MPI libraries.
This paper presents methods for optimizing irregular communication within
neighborhood collectives, analyzes the impact of replacing point-to-point
communication in existing codebases such as Hypre BoomerAMG with neighborhood
collectives, and finally shows an up to 1.32x speedup on sparse matrix-vector
multiplication within a BoomerAMG solve through the use of our optimized
neighbor collectives. The authors analyze multiple implementations of
neighborhood collectives, including a standard implementation, which simply
wraps standard point-to-point communication, as well as multiple
implementations of locality-aware aggregation. All optimizations are available
in an open-source codebase, MPI Advance, which sits on top of MPI, allowing for
optimizations to be added into existing codebases regardless of the system MPI
install