Search CORE

4 research outputs found

Performance modeling for systematic performance tuning

Author
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2011
Field of study

Final Report for Enhancing the MPI Programming Model for PetaScale Systems

Author: Gropp William Douglas
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date: 22/07/2013
Field of study

This project performed research into enhancing the MPI programming model in two ways: developing improved algorithms and implementation strategies, tested and realized in the MPICH implementation, and exploring extensions to the MPI standard to better support PetaScale and ExaScale systems

Crossref

UNT Digital Library

Performance-Detective: Automatic Deduction of Cheap and Accurate Performance Models

Author: Calotoiu Alexandru
Copik Marcin
Hoefler Torsten
Koziolek Anne
Reiter Andreas
Schmid Larissa
Selzer Michael
Werle Dominik
Publication venue: Association for Computing Machinery
Publication date: 20/05/2022
Field of study

The many configuration options of modern applications make it difficult for users to select a performance-optimal configuration. Performance models help users in understanding system performance and choosing a fast configuration. Existing performance modeling approaches for applications and configurable systems either require a full-factorial experiment design or a sampling design based on heuristics. This results in high costs for achieving accurate models. Furthermore, they require repeated execution of experiments to account for measurement noise. We propose Performance-Detective, a novel code analysis tool that deduces insights on the interactions of program parameters. We use the insights to derive the smallest necessary experiment design and avoiding repetitions of measurements when possible, significantly lowering the cost of performance modeling. We evaluate Performance-Detective using two case studies where we reduce the number of measurements from up to 3125 to only 25, decreasing cost to only 2.9% of the previously needed core hours, while maintaining accuracy of the resulting model with 91.5% compared to 93.8% using all 3125 measurements

Repository for Publications and Research Data

KITopen

PARALiA: a performance aware runtime for auto-tuning linear algebra on heterogeneous systems

Author: Anastasiadis Petros
Goumas Georgios
Hoppe Dennis
Koziris Nectarios
Papadopoulou Nikela
Zhong Li
Publication venue: ACM
Publication date: 01/12/2023
Field of study

Dense linear algebra operations appear very frequently in high-performance computing (HPC) applications, rendering their performance crucial to achieve optimal scalability. As many modern HPC clusters contain multi-GPU nodes, BLAS operations are frequently offloaded on GPUs, necessitating the use of optimized libraries to ensure good performance. Unfortunately, multi-GPU systems are accompanied by two significant optimization challenges: data transfer bottlenecks as well as problem splitting and scheduling in multiple workers (GPUs) with distinct memories. We demonstrate that the current multi-GPU BLAS methods for tackling these challenges target very specific problem and data characteristics, resulting in serious performance degradation for any slightly deviating workload. Additionally, an even more critical decision is omitted because it cannot be addressed using current scheduler-based approaches: the determination of which devices should be used for a certain routine invocation. To address these issues we propose a model-based approach: using performance estimation to provide problem-specific autotuning during runtime. We integrate this autotuning into an end-to-end BLAS framework named PARALiA. This framework couples autotuning with an optimized task scheduler, leading to near-optimal data distribution and performance-aware resource utilization. We evaluate PARALiA in an HPC testbed with 8 NVIDIA-V100 GPUs, improving the average performance of GEMM by 1.7× and energy efficiency by 2.5× over the state-of-the-art in a large and diverse dataset and demonstrating the adaptability of our performance-aware approach to future heterogeneous systems

Enlighten