2 research outputs found
Node-Aware Improvements to Allreduce
The \texttt{MPI\_Allreduce} collective operation is a core kernel of many
parallel codebases, particularly for reductions over a single value per
process. The commonly used allreduce recursive-doubling algorithm obtains the
lower bound message count, yielding optimality for small reduction sizes based
on node-agnostic performance models. However, this algorithm yields duplicate
messages between sets of nodes. Node-aware optimizations in MPICH remove
duplicate messages through use of a single master process per node, yielding a
large number of inactive processes at each inter-node step. In this paper, we
present an algorithm that uses the multiple processes available per node to
reduce the maximum number of inter-node messages communicated by a single
process, improving the performance of allreduce operations, particularly for
small message sizes.Comment: 10 pages, 11 figures, ExaMPI Workshop at SC1
Improving the Performance and Resilience of MPI Parallel Jobs with Topology and Fault-Aware Process Placement
HPC systems keep growing in size to meet the ever-increasing demand for
performance and computational resources. Apart from increased performance,
large scale systems face two challenges that hinder further growth: energy
efficiency and resiliency. At the same time, applications seeking increased
performance rely on advanced parallelism for exploiting system resources, which
leads to increased pressure on system interconnects. At large system scales,
increased communication locality can be beneficial both in terms of application
performance and energy consumption. Towards this direction, several studies
focus on deriving a mapping of an application's processes to system nodes in a
way that communication cost is reduced. A common approach is to express both
the application's communication patterns and the system architecture as graphs
and then solve the corresponding mapping problem. Apart from communication
cost, the completion time of a job can also be affected by node failures. Node
failures may result in job abortions, requiring job restarts. In this paper, we
address the problem of assigning processes to system resources with the goal of
reducing communication cost while also taking into account node failures. The
proposed approach is integrated into the Slurm resource manager. Evaluation
results show that, in scenarios where few nodes have a low outage probability,
the proposed process placement approach achieves a notable decrease in the
completion time of batches of MPI jobs. Compared to the default process
placement approach in Slurm, the reduction is 18.9% and 31%, respectively for
two different MPI applications.Comment: 21 pages, 8 figures, added Acknowledgements sectio