Towards a deeper understanding of hybrid programming

Abstract

With the end of Dennard scaling, future high performance computers are expected to consist of distributed nodes that comprise more cores with direct access to shared memory on a node. However, many parallel applications still use a pure message-passing programming model based on the message-passing interface (MPI). Thereby, they potentially do not make optimal use of shared memory resources. The pure message-passing approach---as argued in this work---is not necessarily the best fit to current and future supercomputing architectures. In this thesis, I therefore present a detailed performance analysis of so-called hybrid programming models that aim at improving performance by combining a shared memory model with the message-passing model on current symmetric multiprocessor (SMP) systems. First, inter-node communication performance is investigated in the context of (hybrid) message-passing programs. A novel performance model for estimating communication performance on current SMP nodes is presented. As is demonstrated, in contrast to the typically used classic postal performance model, the new model allows to more accurately predict inter-node communication performance in the presence of simultaneously communicating processes and saturation of the network interface controller on current multicore architectures. The implications of the new model on hybrid programs are discussed. In addition, I demonstrate the (current) difficulties of multithreaded MPI communication based on results obtained for a multithreaded ping pong benchmark. Moreover, I show how intra-node MPI communication performance can significantly be improved upon for small to medium size messages by saving message-passing overhead and/or superior cache usage. This is achieved through a direct copy in shared memory using either the hybrid MPI+MPI or the MPI+OpenMP programming method. Furthermore, I contrast and evaluate several (pure and hybrid) implementation options for a structured grid sparse matrix-vector multiplication in depth. These choices differ in how hybrid parallelism is exploited at the application level (coarse-grained vs. fine-grained problem decomposition) and with respect to the hybrid programming systems (pure MPI vs. MPI+MPI vs. MPI+OpenMP). I discuss their performance factors such as locality, overhead, efficient use of MPI's derived datatypes, and the serial fraction in Amdahl's law. Moreover, I experimentally demonstrate how a coarse-grained hybrid application design can be used to control these factors, resulting in significant performance improvements (compared to a pure MPI parallelization) in communication and/or synchronization for both the hybrid MPI+MPI and MPI+OpenMP parallel programming approaches for different grid decompositions.U of I OnlyGraduate College Thesis Office approved request from author to change restriction to U of I Access for a period of two years. Implemented by [email protected] on 2018-08-08 at 11:35 AM CD

    Similar works

    Full text

    thumbnail-image

    Available Versions