7 research outputs found
Automated Scheduling Algorithm Selection and Chunk Parameter Calculation in OpenMP
Increasing node and cores-per-node counts in supercomputers render scheduling and load balancing critical for exploiting parallelism. OpenMP applications can achieve high performance via careful selection of scheduling kind and chunk parameters on a per-loop, per-application, and per-system basis from a portfolio of advanced scheduling algorithms (Korndörfer et al. , 2022). This selection approach is time-consuming, challenging, and may need to change during execution. We propose Auto4OMP , a novel approach for automated load balancing of OpenMP applications. With Auto4OMP, we introduce three scheduling algorithm selection methods and an expert-defined chunk parameter for OpenMP's schedule clause's kind and chunk , respectively. Auto4OMP extends the OpenMP schedule(auto) and chunk parameter implementation in LLVM's OpenMP runtime library to automatically select a scheduling algorithm and calculate a chunk parameter during execution. Loop characteristics are inferred in Auto4OMP from the loop execution over the application's time-steps. The experiments performed in this work show that Auto4OMP improves applications performance by up to 11 % compared to LLVM's schedule(auto) implementation and outperforms manual selection. Auto4OMP improves MPI+OpenMP applications performance by explicitly minimizing thread- and implicitly reducing process-load imbalance
PRISM: an intelligent adaptation of prefetch and SMT levels
Current microprocessors include hardware to optimize some specifics workloads.
In general, these hardware knobs are set on a default configuration on the booting
process of the machine. This default behavior cannot be beneficial for all types of
workloads and they are not controlled by anyone but the end user, who needs to
know what configuration is the best one for the workload running. Some of these
knobs are: (1) the Simultaneous MultiThreading level, which specifies the number
of threads that can run simultaneously on a physical CPU, and (2) the data
prefetch engine, that manages the prefetches on memory. Parallel programming
models are here to stay, and one programming model that succeed in allowing programmers
to easily parallelize applications is Open Multi Processing (OMP). Also,
the architecture of microprocessors is getting more complex that end users cannot
afford to optimize their workloads for all the architectural details. These architectural
knobs can help to increase performance but it is needed an automatic and
adaptive system managing them. In this work we propose an independent library
for OpenMP runtimes to increase performance up to 220% (14.7% on average)
while reducing dynamic power consumption up to 13% (2% on average) on a real
POWER8 processor
Machine Learning in Compiler Optimization
In the last decade, machine learning based compilation has moved from an an obscure research niche to a mainstream activity. In this article, we describe the relationship between machine learning and compiler optimisation and introduce the main concepts of features, models, training and deployment. We then provide a comprehensive survey and provide a road map for the wide variety of different research areas. We conclude with a discussion on open issues in the area and potential research directions. This paper provides both an accessible introduction to the fast moving area of machine learning based compilation and a detailed bibliography of its main achievements
Adaptive parallelism mapping in dynamic environments using machine learning
Modern day hardware platforms are parallel and diverse, ranging from mobiles to
data centers. Mainstream parallel applications execute in the same system competing
for resources. This resource contention may lead to a drastic degradation in a program’s
performance. In addition, the execution environment composed of workloads
and hardware resources, is dynamic and unpredictable. Efficient matching of program
parallelism to machine parallelism under uncertainty is hard. The mapping policies
that determine the optimal allocation of work to threads should anticipate these variations.
This thesis proposes solutions to the mapping of parallel programs in dynamic environments.
It employs predictive modelling techniques to determine the best degree of
parallelism. Firstly, this thesis proposes a machine learning-based model to determine
the optimal thread number for a target program co-executing with varying workloads.
For this purpose, this offline trained model uses static code features and dynamic runtime
information as input.
Next, this thesis proposes a novel solution to monitor the proposed offline model
and adjust its decisions in response to the environment changes. It develops a second
predictive model for determining how the future environment should be, if the current
thread prediction was optimal. Depending on how close this prediction was to the
actual environment, the predicted thread numbers are adjusted.
Furthermore, considering the multitude of potential execution scenarios where no
single policy is best suited in all cases, this work proposes an approach based on the
idea of mixture of experts. It considers a number of offline experts or mapping policies,
each specialized for a given scenario, and learns online the best expert that is optimal
for the current execution. When evaluated on highly dynamic executions, these solutions
are proven to surpass default, state-of-art adaptive and analytic approaches
Runtime Empirical Selection of Loop Schedulers on Hyperthreaded SMPs
Hyperthreaded (HT) and simultaneous multithreaded (SMT) processors are now available in commodity workstations and servers. This technology is designed to increase throughput by executing multiple concurrent threads on a single physical processor. These multiple threads share the processor’s functional units and on-chip memory hierarchy in an attempt to make better use of idle resources. Most OpenMP applications have been written assuming an Symmetric Multiprocessor (SMP), not an SMT, model. Threads executing on the same physical processor have interactions on data locality and resource sharing that do not occur on traditional SMPs. This work focuses on tuning the behavior of OpenMP applications executing on SMPs with SMT processors. We propose two adaptive loop schedulers that determine effective hierarchical schedulers for individual parallel loops. We compare the performance of our two proposed schedulers against several standard schedulers and the per-region adaptive scheduler proposed by Zhang et al. using the SPEC and NAS OpenMP benchmark suites. We show that both of our proposed schedulers outperform all other schedulers on average, and increase speedup on average by over 25 % when all thread contexts are used. 1