Search CORE

2 research outputs found

Auto-Tuning MPI Collective Operations on Large-Scale Parallel Systems

Author: Fang J
Huang C
Juan C
Pan X
Sun X
Tang T
Wang H
Wang Z
Wu F
Xie M
Yuan Y
Zheng W
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 03/10/2019
Field of study

MPI libraries are widely used in applications of high performance computing. Yet, effective tuning of MPI collectives on large parallel systems is an outstanding challenge. This process often follows a trial-and-error approach and requires expert insights into the subtle interactions between software and the underlying hardware. This paper presents an empirical approach to choose and switch MPI communication algorithms at runtime to optimize the application performance. We achieve this by first modeling offline, through microbenchmarks, to find how the runtime parameters with different message sizes affect the choice of MPI communication algorithms. We then apply the knowledge to automatically optimize new unseen MPI programs. We evaluate our approach by applying it to NPB and HPCC benchmarks on a 384-node computer cluster of the Tianhe-2 supercomputer. Experimental results show that our approach achieves, on average, 22.7% (up to 40.7%) improvement over the default setting

Crossref

White Rose Research Online

Auto-tuning MPI Collective Operations on Large-Scale Parallel Systems

Author: Chen Juan
Fang Jianbin
Huang Chun
Pan Xiaodong
Sun Xiaole
Tang Tao
Wang Hao
Wang Zheng
Zheng Wenxu
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 03/10/2019
Field of study

MPI libraries are widely used in applications of high performance computing. Yet, effective tuning of MPI colletives on large parallel systems is an outstanding challenge. This process often follows a trial-and-error approach and requires expert insights into the subtle interactions between software and the underlying hardware. This paper presents an empirical approach to choose and switch MPI communication algorithms at runtime to optimize the application performance. We achieve this by first modeling offline, through microbenchmarks, to find how the runtime parameters with different message sizes affect the choice of MPI communication algorithms. We then apply the knowledge to automatically optimize new unseen MPI programs. We evaluate our approach by applying it to NPB and HPCC benchmarks on a 384-node computer cluster of the Tianhe-2 supercomputer. Experimental results show that our approach achieves, on average, 22.7% (up to 40.7%) improvement over the default setting

Lancaster E-Prints