Search CORE

3 research outputs found

Understanding Performance Variability on the Aries Dragonfly Network

Author: Groves Taylor
Gu Yizi
Wright Nicholas J
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

This work evaluates performance variability in the Cray Aries dragonfly network and characterizes its impact on MPI Allreduce. The execution time of Allreduce is limited by the performance of the slowest participating process, which can vary by more than an order of magnitude. We utilize counters from the network routers to provide a better understanding of how competing workloads can influence performance. Specifically, we examine the relationships between message size, process counts, Aries counters and the Allreduce communication-Time. Our results suggest that competing traffic from other jobs can significantly impact performance on the Aries Dragonfly Network. Furthermore, we show that Aries network counters are a valuable tool, explaining up to 70% of the performance variability for our experiments on a large-scale production system

Crossref

eScholarship - University of California

Task Packing: Efficient task scheduling in unbalanced parallel programs to maximize CPU utilization

Author: Farreras Esclusa Montse
Fornés de Juan Jordi
Utrera Iglesias Gladys Miriam
Publication venue: 'Elsevier BV'
Publication date: 01/12/2019
Field of study

Load imbalance in parallel systems can be generated by external factors to the currently running applications like operating system noise or the underlying hardware like a heterogeneous cluster. HPC applications working on irregular data structures can also have difficulties to balance their computations across the parallel tasks. In this article we extend, improve and evaluate more deeply the Task Packing mechanism proposed in a previous work. The main idea of the mechanism is to concentrate the idle cycles of unbalanced applications in such a way that one or more CPUs are freed from execution. To achieve this, CPUs are stressed with just useful work of the parallel application tasks, provided performance is not degraded. The packing is solved by an algorithm based on the Knapsack problem, in a minimum number of CPUs and using oversubscription. We design and implement a more efficient version of such mechanism. To that end, we perform the Task Packing “in place”, taking advantage of idle cycles generated at synchronization points of unbalanced applications. Evaluations are carried out on a heterogeneous platform using FT and miniFE benchmarks. Results showed that our proposal generates low overhead. In addition the amount of freed CPUs are related to a load imbalance metric which can be used as a prediction for it.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Optimization of MPI Collective Communication Operations

Author: Luo Xi
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 15/05/2020
Field of study

High-performance computing (HPC) systems keep growing in scale and heterogeneity to satisfy the increasing need for computation, and this brings new challenges to the design of Message Passing Interface (MPI) libraries, especially with regard to collective operations.The implementations of state-of-the-art MPI collective operations heavily rely on synchronizations, and these implementations magnify noise across the participating processes, resulting in significant performance slowdowns. Therefore, I create a new collective communication framework in Open MPI, using an event-driven design to relax synchronizations and maintain the minimal data dependencies of MPI collective operations.The recent growth in hardware heterogeneity results in increasingly complex hardware hierarchies and larger communication performance differences.Hence, in this dissertation, I present two approaches to perform hierarchical collective operations, and both can exploit the different bandwidths of hardware in heterogeneous systems and maximizing concurrent communications.Finally, to provide a fast and accurate autotuning mechanism for my framework, I design a new autotuning approach by combining two existing methods. This new approach significantly reduces the search space to save the autotuning time and is still able to provide accurate estimations.I evaluate my work with microbenchmarks and applications at different scales. Microbenchmark results show my work speedups MPI_Bcast and MPI_Allreduce up to 7.34X and 4.86X, respectively, on 4096 processes.In terms of applications, I achieve a 24.3% improvement for Hovorod and a 143% improvement for ASP on 1536 processes as compared to the current Open MPI

University of Tennessee, Knoxville: Trace