Search CORE

3 research outputs found

Learning Intermediate Representations using Graph Neural Networks for NUMA and Prefetchers Optimization

Author: Dutta Akash
Jannesari Ali
Popov Mihail
Saillard Emmanuelle
Tehranijamsaz Ali
Publication venue: HAL CCSD
Publication date: 01/03/2022
Field of study

International audienceThere is a large space of NUMA and hardware prefetcher configurations that can significantly impact the performance of an application. Previous studies have demonstrated how a model can automatically select configurations based on the dynamic properties of the code to achieve speedups. This paper demonstrates how the static Intermediate Representation (IR) of the code can guide NUMA/prefetcher optimizations without the prohibitive cost of performance profiling. We propose a method to create a comprehensive dataset that includes a diverse set of intermediate representations along with optimum configurations. We then apply a graph neural network model in order to validate this dataset. We show that our static intermediate representation based model achieves 80% of the performance gains provided by expensive dynamic performance profiling based strategies. We further develop a hybrid model that uses both static and dynamic information. Our hybrid model achieves the same gains as the dynamic models but at a reduced cost by only profiling 30% of the programs

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Piecewise holistic autotuning of parallel programs with CERE

Author: Agakov
Akel
Ashouri
Bailey
Baysal
Charif-Rubial
Chen
Cooper
Cooper
Curtis-Maury
de Oliveira Castro
de Oliveira Castro
de Oliveira Castro
Eeckhout
Fursin
Fursin
Hoste
Kaufman
Kessler
Kulkarni
Kulkarni
Lattner
Liao
Martins
Mazouz
Popov
Popov
Purini
Rountree
Sherwood
Thorndike
Treibig
Triantafyllis
Ward
Publication venue: 'Wiley'
Publication date
Field of study

Crossref

Piecewise Holistic Autotuning of Parallel Programs with CERE

Author: Agakov
Akel
Ashouri
Bailey
Baysal
Charif-Rubial
Chen
Cooper
Cooper
Curtis-Maury
de Oliveira Castro
de Oliveira Castro
de Oliveira Castro
Eeckhout
Fursin
Fursin
Hoste
Kaufman
Kessler
Kulkarni
Kulkarni
Lattner
Liao
Martins
Mazouz
Popov
Popov
Purini
Rountree
Sherwood
Thorndike
Treibig
Triantafyllis
Ward
Publication venue: 'Wiley'
Publication date: 01/01/2017
Field of study

International audienceCurrent architecture complexity requires fine tuning of compiler and runtime parameters to achieve best performance.Autotuning substantially improves default parameters in many scenarios but it is a costly process requiring long iterative evaluations. We propose an automatic piecewise autotuner based on CERE (Codelet Extractor and REplayer). CERE decomposes applications into small pieces called codelets: each codelet maps to a loop or to an OpenMP parallel region and can be replayed as a standalone program.Codelet autotuning achieves better speedups at a lower tuning cost. By grouping codelet invocations with the same performance behavior, CERE reduces the number of loops or OpenMP regions to be evaluated. Moreover unlike whole-program tuning, CERE customizes the set of best parameters for each specific OpenMP region or loop. We demonstrate the CERE tuning of compiler optimizations, number of hreads,thread affinity, and scheduling policy on both NUMA and heterogeneous architectures. Over the NAS benchmarks, we achieve an average speedup of1.08x after tuning. Tuning a codelet is 13x cheaper than whole-program evaluation and predicts the tuning impact with a 94.7% accuracy. Similarly, exploring thread configurations and scheduling policies for a Black-Scholes solver on an heterogeneous big.LITTLE architecture is over 40x faster using CERE

Crossref

HAL UVSQ