2,410 research outputs found
Desynchronization and Wave Pattern Formation in MPI-Parallel and Hybrid Memory-Bound Programs
Analytic, first-principles performance modeling of distributed-memory
parallel codes is notoriously imprecise. Even for applications with extremely
regular and homogeneous compute-communicate phases, simply adding communication
time to computation time does often not yield a satisfactory prediction of
parallel runtime due to deviations from the expected simple lockstep pattern
caused by system noise, variations in communication time, and inherent load
imbalance. In this paper, we highlight the specific cases of provoked and
spontaneous desynchronization of memory-bound, bulk-synchronous pure MPI and
hybrid MPI+OpenMP programs. Using simple microbenchmarks we observe that
although desynchronization can introduce increased waiting time per process, it
does not necessarily cause lower resource utilization but can lead to an
increase in available bandwidth per core. In case of significant communication
overhead, even natural noise can shove the system into a state of automatic
overlap of communication and computation, improving the overall time to
solution. The saturation point, i.e., the number of processes per memory domain
required to achieve full memory bandwidth, is pivotal in the dynamics of this
process and the emerging stable wave pattern. We also demonstrate how hybrid
MPI-OpenMP programming can prevent desirable desynchronization by eliminating
the bandwidth bottleneck among processes. A Chebyshev filter diagonalization
application is used to demonstrate some of the observed effects in a realistic
setting.Comment: 18 pages, 8 figure
The HPCG benchmark: analysis, shared memory preliminary improvements and evaluation on an Arm-based platform
The High-Performance Conjugate Gradient (HPCG) benchmark complements the LINPACK benchmark in the performance evaluation coverage of large High-Performance Computing (HPC) systems. Due to its lower arithmetic intensity and higher memory pressure, HPCG is recognized as a more representative benchmark for data-center and irregular memory access pattern workloads, therefore its popularity and acceptance is raising within the HPC community. As only a small fraction of the reference version of the HPCG benchmark is parallelized with shared memory techniques (OpenMP), we introduce in this report two OpenMP parallelization methods. Due to the increasing importance of Arm architecture in the HPC scenario, we evaluate our HPCG code at scale on a state-of-the-art HPC system based on Cavium ThunderX2 SoC. We consider our work as a contribution to the Arm ecosystem: along with this technical report, we plan in fact to release our code for boosting the tuning of the HPCG benchmark within the Arm community.Postprint (author's final draft
- …