OmpSs Task offload by Sainz Manteca, Florentino
OmpSs task oﬄoad
Florentino Sainz
Facultat d’Informa`tica de Barcelona (FIB)
Universitat Polite`cnica de Catalunya
A thesis submitted for the degree of
Master in Innovation and Research in Informatics (MIRI-HPC)
1st of July, 2014
Advisor: Vicenc¸ Beltran Querol, BSC-CNS
Tutor: Jesu´s Jose´ Labarta Mancho, BSC-CNS & UPC-DAC
To my parents for their unwavering support, in all my endeavors.
Acknowledgements
I want to thank all people who helped me during this project, specially the
ones involved in the development of OmpSs tools. I also want to thank
Vicenc¸ Beltran and Jesu´s Labarta for their guidance and support during
this work.
The research leading to these results has received funding from the Euro-
pean Community’s Seventh Framework Programme (FP7/2007-2013) un-
der Grant Agreement n◦287530 and 610476.
Abstract
Exascale performance requires a level of energy eﬃciency only achiev-
able with specialized hardware. Hence, to build a general purpose HPC
system with exascale performance diﬀerent types of processors, memory
technologies and interconnection networks will be necessary. Heteroge-
neous hardware is already present on some top supercomputer systems
that are composed of diﬀerent compute nodes, which at the same time,
contains diﬀerent types of processors and memories. However, heteroge-
neous hardware is much harder to manage and exploit than homogeneous
hardware, further increasing the complexity of the applications that run on
HPC systems.
Most HPC applications useMPI to implement a rigid Single ProgramMul-
tiple Data (SPMD) execution model that no longer ﬁts the heterogeneous
nature of the underlying hardware. However, MPI provides a powerful and
ﬂexible MPI Comm spawn API call that was designed to exploit dynam-
ically heterogeneous hardware but at the expense of a higher complexity,
which has hindered a wider adoption of this API.
In this master thesis, we have extended the OmpSs programming model to
oﬄoad dynamically MPI kernels, replacing the low-level and error prone
MPI Comm spawn call with the high-level and easy to use OmpSs prag-
mas. The evaluation shows that our proposal dramatically simpliﬁes the
dynamic oﬄoading of MPI kernels while keeping the same performance
and scalability as MPI Comm spawn.
Contents
1 Introduction 1
2 State of the art 3
2.1 Intra-node heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.3 Intel Oﬄoad . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.4 Higher level programming models . . . . . . . . . . . . . . . 8
2.2 Cluster level heterogeneity . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 rCUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Virtual OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 MPI dynamic process spawn . . . . . . . . . . . . . . . . . . 9
3 OmpSs 11
3.1 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Mercurium compiler . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Nanox Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Implementation of the OmpSs Oﬄoad 16
4.1 Nanox Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.1.1 Oﬄoad mechanism . . . . . . . . . . . . . . . . . . . . . . . 18
4.1.2 Data management . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.3 Threads and scheduling . . . . . . . . . . . . . . . . . . . . . 19
4.1.4 Dynamic compilation of oﬄoad Plugin . . . . . . . . . . . . 20
4.2 Mercurium compiler . . . . . . . . . . . . . . . . . . . . . . . . . . 21
iv
CONTENTS CONTENTS
4.2.1 Clause extensions . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2.2 Task generation . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2.3 Oﬄoading global variables . . . . . . . . . . . . . . . . . . . 23
4.2.4 Add custom compilers . . . . . . . . . . . . . . . . . . . . . 23
4.2.5 Dynamic compilation of oﬄoad Plugin . . . . . . . . . . . . 23
4.3 Executable generation . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5 Evaluation 26
5.1 Performance of oﬄoaded code . . . . . . . . . . . . . . . . . . . . . 26
5.2 Overhead comparison with native MPI . . . . . . . . . . . . . . . . . 33
5.3 Real applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3.1 SRMIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3.2 TurboRVB . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3.3 FWI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6 Conclusions and Future Work 47
v
List of Figures
3.1 Mercurium compilation ﬂow . . . . . . . . . . . . . . . . . . . . . . 14
5.1 Calculate forces particle communications among iterations. Repre-
sents how each rank owns a local partition of the particles which will
be exchanged with other nodes in order to calculate the accumulated
forces for that partition . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2 OmpSs NBody Simulation oﬄoad (Single iteration communications).
Master processes will spawn two worker processes which can commu-
nicate internally in the Xeon Phi and have them calculating the forces.
Total Communications = 2N/NP + N . . . . . . . . . . . . . . . . . 29
5.3 Intel Oﬄoad NBody Simulation (Single iteration communications).
Master processes will spawn one worker process each which can’t
communicate internally in the Xeon Phi and have them calculating the
forces. Total Communications = N/NP + 2N . . . . . . . . . . . . . 31
5.4 Single node performance of each of the diﬀerent versions of the NBody
Benchmark. We can see how it’s slower in the host and all the other
versions, including our oﬄoad, obtain similar performance. . . . . . . 32
5.5 Weak Scaling Speedup of diﬀerent versions of NBody Benchmark . . 32
5.6 Strong scaling speedup of diﬀerent versions of NBody Benchmark . . 33
5.7 OmpSs vs Native MPI Spawn. Each bar shows how time is distributed
in MPI Spawn time, which is the time that MPI takes to spawn the
processes, and the extra operations we introduce to setup our oﬄoad
mechanism. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.8 OmpSs vs Native MPI execution . . . . . . . . . . . . . . . . . . . . 39
vi
LIST OF FIGURES LIST OF FIGURES
5.9 SRMIP master-to-workers partitioning and work sharing before apply-
ing Oﬄoad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.10 SRMIP master-to-workers partitioning and work sharing after apply-
ing Oﬄoad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.11 FWI execution 1. Shows how work is oﬄoaded to level 2 nodes and
how level 3 nodes calculate the shots. . . . . . . . . . . . . . . . . . 44
5.12 FWI execution 2. Shows how each group of 4 level 3 nodes process
diﬀerent shots at the same time independently. . . . . . . . . . . . . . 46
vii
List of Tables
5.1 NBody Execution Device Schema . . . . . . . . . . . . . . . . . . . 29
5.2 Comparison of the number of code lines with and without OmpSs oﬄoad 43
viii
Chapter 1
Introduction
Since the race to Exascale computing started and new architectures were proposed,
supercomputers are evolving from homogeneous systems in which every node has the
same hardware conﬁguration to truly heterogeneous systems, where there are diﬀerent
sets of nodes for diﬀerent purposes, and each node also contains diﬀerent processors,
memories and interconnection networks.
A good example of this trend is the Stampede supercomputer hosted at the Texas
Advanced Computing Center (TACC). This supercomputer is mainly composed of
compute nodes with two Xeon processors and one Xeon Phi processor attached, but
there are also compute nodes with two Xeon Phi processors instead of one, compute
nodes with 1TB of memory or compute nodes equipped with NVIDIA K20 GPUs. An-
other example is the DEEP system [1] that is composed of a cluster of Xeon processors
linked with an Inﬁniband network and a cluster of Xeon Phi processors linked with an
Extoll network.
On these systems, the traditionally used Single Program Multiple Data (SPMD)
execution model is not adequate to eﬀectively exploit the underlyning resources. Most
applications have diﬀerent computational phases, and each of these phases may run
best on a diﬀerent type and/or number of nodes. For instance, one compute phase may
scale poorly with the number of nodes, so it better runs on a small number of powerful
nodes, while other phases may scale very well and are more eﬃciently executed on a
large number of nodes with accelerators.
Thus, to eﬀectively exploit an heterogeneous system, most applications should have a
1
1. INTRODUCTION
Multiple Program Multiple Data (MPMD) execution model, in which each computa-
tional phase runs on top of the hardware that better suits its needs. MPI provides the
MPI Comm spawn to properly implement an MPMD model. This API call enables
the dynamic spawn of new MPI processes on additional compute nodes that can run
a diﬀerent program, which is connected and can communicate with the original one.
However, the usage of this low-level API is complex and error prone.
The very nature of MPMD programs make them diﬃcult to implement and the rea-
son about because the programmer must not only manually manage the intracommu-
nications of each spawned MPI program, but also the intercommunications required
between the diﬀerent MPI programs. These include the data transfers across MPI pro-
grams as well as the necessary synchronizations to orchestrate the whole program exe-
cution. The complexity of implementing this approach is reﬂected by the low number
of HPC applications that currently has an MPMD execution model.
We have extended our OmpSs [2] data-ﬂow programming model to support the
dynamic oﬄoad of MPI tasks, providing a practical way to implement MPMD applica-
tions without any of the complexities associated with the direct use ofMPI Comm spawn
API. We have developed a simple API to dynamically allocate nodes/MPI processes,
which returns a MPI intercommunicator that encloses all the newly created MPI pro-
cesses. Additionally, OmpSs has been extended with a new onto(comm, rank) clause
that specify in which speciﬁcMPI process a task has to run. Our extended compiler and
runtime system transparently manage all data transfers and synchronizations required.
2
Chapter 2
State of the art
The growing popularity of hardware accelerators has encouraged academic and indus-
trial researchers to develop novel programming models to make the most of these new
highly parallel compute devices with a moderate eﬀort. The applications that run on
these hybrid systems, composed of multi-cores and hardware accelerators, usually have
some parts that do not scale well and run better on multi-cores while other highly par-
allel parts can exploit the full potential of hardware accelerators. Hence most parallel
programming models explicitly provide mechanism to split and coordinate applica-
tions to succesfully run on hybrid systems. The rest of the section divides previous
research eﬀorts on this topic in intra-node and inter-node heterogenity.
2.1 Intra-node heterogeneity
At the intra-node level, much work has been done and several programming models
have been implemented, like OpenCL [3], CUDA [4], Intel Oﬄoad, OpenACC or
OpenMP 4.0. All of them provide mechanisms to divide applications in two parts, the
one that runs on the multi-core side, which is usually known as the host part, and the
the part that runs on the accelerator, that is usually known as a kernel.
CUDA was the ﬁrst mainstream programming model widely used to exploit GPUs.
A standardization eﬀort to exploit hardware accelerators have led to the development
of OpenCL, which is the only programming model that can exploit a variety of accel-
erators, including the Intel Xeon Phi accelerators. In the rest of this Section, we will
summarize the most widely used programming models to exploit local accelerators.
3
2. STATE OF THE ART 2.1 Intra-node heterogeneity
2.1.1 CUDA
CUDA (formerly Compute Uniﬁed Device Architecture) was ﬁrst introduced back in
2007 by NVIDIA to exploit the GPUs computational power.
CUDA is composed by a group of development tools created by NVIDIA which
allows programmers to use a programming language similar to C and C++ in order
program GPUs.
The language provides a host API which is used to copy data to/from the GPU and
launch kernels inside the GPU, and a C-like programming language, in which kernels
are written. These kernels are pieces of code/functions which will run inside the GPU.
Listing 2.1: Kernel written in CUDA C
__global__ void saxpy(int n, float a, float* x, float* y)
{
int i= blockIdx.x*blockDim.x+threadIdx.x;
if(i < n)
y[i] = a * x[i] + y[i];
}
Listing 2.2: CUDA program using host API to launch kernels
int main(void)
{
int N = 1<<20;
float *x, *y, *d_x, *d_y;
x = (float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float));
cudaMalloc(&d_x, N*sizeof(float));
cudaMalloc(&d_y, N*sizeof(float));
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
// Perform SAXPY on 1M elements
4
2. STATE OF THE ART 2.1 Intra-node heterogeneity
saxpy<<<(N+255)/256, 256>>>(N, 2.0, d_x, d_y);
cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = max(maxError , abs(y[i]-4.0f));
}
At listing 2.1 we can see how the programmer will obtain the id of the thread
(consecutive numbers) and will make each thread handle a diﬀerent element of the
vectors. The amount of threads launched in the GPU will be speciﬁed by using the
host API as seen at listing 2.2 at the time of launching the kernel.
2.1.2 OpenCL
OpenCL was ﬁrst developed by Apple and then submitted to the Khronos group, which
published the ﬁrst speciﬁcation at the end of 2008. It’s philosophy and features are
similar to CUDA but it also covers other devices (CPU, Xeon Phi, FPGAs, GPUs...).
The complexity of developing an OpenCL application is, like in CUDA, clearly split
in two diﬀerent parts.
The ﬁrst one is related to the development of optimized OpenCL C kernels like
the saxpy kernel we can see in Listing 2.3 that can exploit accelerator hardware. The
second one includes compiling the kernel, calling it and moving data from the host to
the accelerator and viceversa, for this, OpenCL provides a low-level API to interact
with the accelerator.
Listing 2.3: Kernel written in OpenCL C
__kernel void saxpy(int n, float a,
__global float* x, __global float* y)
{
int i = get_global_id(0);
if(i < n)
y[i] = a * x[i] + y[i];
}
Listing 2.4: OpenCL program using host API to launch kernels
5
2. STATE OF THE ART 2.1 Intra-node heterogeneity
int main(int argc, char** argv)
{
float a, h_x[1024], h_y[1024];
// Init a, h_x and h_y;
cl_uint numPlats;
clGetPlatformIDs(0, 0, &numPlats);
cl_platform_id Plat[numPlats];
clGetPlatformIDs(numPlats, Plat, 0);
clGetDeviceIDs(Plat[i], DEV, 1, &id,0);
cl_context ctx =
clCreateContext(0, 1, &id, 0, 0, 0);
cl_command_queue cmd =
clCreateCommandQueue(ctx, id, 0, 0);
cl_program program =
clCreateProgramWithSource(ctx, 1, KernelSrc , 0, 0);
clBuildProgram(program, 0, 0, 0, 0, 0);
cl_kernel ko_saxpy = clCreateKernel( program, "saxpy", 0);
cl_mem d_x = clCreateBuffer(ctx, 0, sizeof(float) * n, 0, 0);
cl_mem d_y = clCreateBuffer(ctx, 0, sizeof(float) * n, 0, 0);
clEnqueueWriteBuffer(cmd, d_x, CL_TRUE, 0, sizeof(float) * n, h_x
, 0, 0, 0);
clEnqueueWriteBuffer(cmd, d_y, CL_TRUE, 0, sizeof(float) * n, h_y
, 0, 0, 0);
clSetKernelArg(ko_saxpy , 0, 4, &n);
clSetKernelArg(ko_saxpy , 1, 4, &a);
size_t sze = sizeof(cl_mem);
clSetKernelArg(ko_saxpy , 2, sze, &d_x);
clSetKernelArg(ko_saxpy , 3, sze, &d_y);
size_t global = 1024, local = 128;
clEnqueueNDRangeKernel(cmd, ko_saxpy,
1, 0, &global, &local, 0, 0, 0);
clFinish(commands);
clEnqueueReadBuffer( commands, d_y, CL_TRUE, 0, 4 * n, h_y, 0, 0,
0 );
6
2. STATE OF THE ART 2.1 Intra-node heterogeneity
clReleaseMemObject(d_x);
clReleaseMemObject(d_y);
clReleaseProgram(program);
clReleaseKernel(ko_saxpy);
clReleaseCommandQueue(commands);
clReleaseContext(ctx);
return 0;
}
At Listing 2.3 we can see how the programmer will obtain the id of the thread
(consecutive numbers) and will make each thread handle a diﬀerent element of the
vectors. The amount of threads launched in the GPU will be speciﬁed by using the
host API as seen at the time of launching the kernel as seen at listing 2.4. We can also
see that the verbosity of this host API is quite high.
2.1.3 Intel Oﬄoad
Intel Oﬄoad is a technology developed by Intel which allows users to run C/C++ code
at Intel Xeon Phi accelerators which are in the same node as the host Xeon E5.
It provides an extension to the compiler which is able to interpret pragmas which
specify the regions of code which will be executed on remote devices. Once the ex-
ecution reach these regions of code, it will stop the program and transfer all the data
needed to the Xeon Phi and begin the execution of that code. Once the code is ﬁnished,
data will be copied back to the host and the execution will continue.
Listing 2.5: Intel Oﬄoad code
void saxpy(int n, float a, float* x, float* y)
{
#pragma offload target(mic:0) in(x : length(n) alloc_if(1)
free_if(1)) inout(y : length(n) alloc_if(1) free_if(1))
for (int i=0; i<n; i++) {
y[i] = a * x[i] + y[i];
}
}
At Listing 2.3 we can see how the programmer will execute the function specifying
that the code will be oﬄoaded to a ”mic” (Xeon Phi) and the data it needs. After the
7
2. STATE OF THE ART 2.2 Cluster level heterogeneity
oﬄoad code has ﬁnished the host CPU will ﬁnish the function and continue with the
regular code.
2.1.4 Higher level programming models
The complexity of programming in CUDA or OpenCL can be alleviated by using some
higher level programming models, like OpenMP 4.0 [5], OpenACC [6] which substi-
tute APIs for pragmas and transform sequential C code to CUDA/OpenCL kernels.
Other languages like OmpSs do not generate kernels, but they handle the integration
of the kernel with the main application in a transparent way.
2.2 Cluster level heterogeneity
The previous programming models were not designed to work on distributed environ-
ments. However, there are a few technologies such as rCUDA [7] or Virtual OpenCL [8],
which extends CUDA and OpenCL respectivelly and allow programmers to transpar-
ently use remote devices as local devices. Additionally, MPI-2 standard already in-
cluded some functionalities to deal with heterogeneous cluster, but not with hardware
accelerators in mind.
2.2.1 rCUDA
rCUDA is a middle-ware that enables CUDA remoting over a commodity network.
That is, the middle-ware allows an application to use a CUDA-compatible graphics
processing unit (GPU) installed in a remote computer as if it were installed in the com-
puter where the application is being executed. rCUDA intercepts calls to the CUDA
API in an application and sends them to a client(application)-server(node with GPUs)
distributed architecture which allows them to execute CUDA code in remote nodes
without having a local GPU. This technology can help to save energy or have clus-
ters with very heterogeneous architectures, but still has the same problem than native
CUDA, devices cannot communicate inside of a kernel and communicating between
two diﬀerent devices has to be done using a host CPU.
8
2. STATE OF THE ART 2.2 Cluster level heterogeneity
2.2.2 Virtual OpenCL
The VCL cluster platform is the OpenCL equivalent to rCUDA. However, in both
technologies the transparent virtualization of the hardware accelerators may prevent
programmers to make the most of them at some scenarios, as one usual bottleneck is
the data-transfer between the host and the accelerator.
2.2.3 MPI dynamic process spawn
MPI-2 standard introduced a call (MPI Comm spawn) to enable the dynamic creation
of new MPI processes. This API call enables the creation of dynamic and malleable
distributed applications that can even run on heterogeneous clusters. This ﬂexibility
comes at the expenses of a large increase on the application complexity because the
programmmer must explicitly coordinate bothMPI parts, which do not follow the usual
SPMD execution model of most MPI applications. The lack of heterogeneous clusters
until recently and the high complexity to eﬀectively use this features has hindered a
wider adoption of this technique.
Nowadays, programming models only support homogeneous clusters of hardware
accelerators, where MPI is combined with a programming model that supports hard-
ware accelerators at the node level. However, the emerging number of heterogeneous
clusters, which combine diﬀerent types of nodes with hardware accelerators, requires
new approaches to eﬀectively exploit them. In order to exploit cluster-level hetero-
geneity with many nodes with diﬀerent characteristics, we propose our OmpSs Collec-
tive Oﬄoad model in which one or many nodes can oﬄoad work to a group of remote
nodes which will execute a diﬀerent part of the algorithm (a MPI kernel), while being
able to directly communicate between them. Thus, our approach is as simple as the
Intel Oﬄoad but as powerful as the MPI Comm spawn.
One example of this kind of architectures is the one proposed under DEEP project
in which a stand-alone cluster of Intel Xeon Phi accelerators connected with a high
performance network will be used to oﬄoad MPI Kernels from another cluster of Xeon
processors.
With the new Intel Xeon Phi based accelerators, there is a new scenario, an accel-
erator which can execute x86 code and supports MPI. In this scenario, programmers
are able to run their applications in this accelerator with minor changes. In order to
9
2. STATE OF THE ART 2.2 Cluster level heterogeneity
get their best performance, parallel code should run on the accelerator and serial code
on regular CPUs. Both Intel OpenCL or Intel Oﬄoad Model [9] were implemented
in order to address this issue, but they are limited in the sense that accelerators can-
not communicate between themselves. This limitation can be overcome by oﬄoading
using the mechanisms oﬀered by MPI dynamic process spawn.
10
Chapter 3
OmpSs
OmpSs is a directive-based programming model that enables the execution of sequen-
tial programs in a data-ﬂow way. The programmer only needs to specify the data which
is going to be read (in) and written (out) inside a function (task). Once this is provided,
the code will be compiled by Mercurium compiler which will generate tasks to be
executed by Nanox++ runtime.
3.1 Tasks
Data dependencies are speciﬁed with the clauses in/out/inout which specify how data is
accessed in a task and can be used to build a graph which controls that there are no race
conditions between tasks, thus warranting a correct execution. The way dependencies
work is that following instantiation order, a new task will not be able to start if a
previous task is writing into the same data (out/inout) or if the task has to write into
data (out/inout) which is being used by some previous task which is still executing.
We can see an example of how OmpSs works in Listing 3.1. OmpSs has a team
of worker threads. The main thread will start executing the application and generating
tasks which are added to a DAG (Directed acyclic graph), once these tasks have all
the dependencies satisﬁed, they are moved to a ready queue. Threads will steal these
tasks from the ready queue and execute them. In our code, we can see that the ﬁrst task
outputs forces and the second task has forces as input, this means that second task will
not start until ﬁrst task ﬁnishes. By looking at dependencies we can see that almost
11
3. OMPSS 3.1 Tasks
Listing 3.1: Nbody OmpSs
void parallelCalcForces(particle_t* local, particle_t* remote,
force_t* forces, int np, int tsteps) {
int rank_size;
MPI_Comm_size(MPI_COMM_WORLD , &rank_size);
for (int t = 0; t < timesteps; t++) {
particles_block_t * remote = local;
for(int i=0; i < rank_size; i++){
#pragma omp task in([n_blocks] local, [n_blocks] remote)
inout([n_blocks] forces)
calculate_forces(forces, local, remote, n_blocks);
#pragma omp task in([n_blocks] remote) out([n_blocks] tmp)
exchange_particles(remote, tmp, n_blocks, rank, rank_size , i)
;
remote=tmp;
}
#pragma omp task inout([n_blocks] local ) inout([n_blocks]
forces)
update_particles(n_blocks, local, forces, time_interval);
}
#pragma omp taskwait
}
12
3. OMPSS 3.2 Mercurium compiler
every task is dependant from the previous one, but communication and computation
can be overlapped. In OmpSs tasks can be nested, so calculate forces function can be
a highly parallel function too.
The program will wait for every task to ﬁnish by using pragma omp taskwait di-
rective.
3.2 Mercurium compiler
Mercurium is a source-to-source compilation infrastructure aimed at fast prototyping.
Current supported languages are C, C++ and Fortran. Mercurium is mainly used in
Nanox environment to implement OpenMP but since it is quite extensible it has been
used to implement other programming models or compiler transformations, examples
include Cell Superscalar, Software Transactional Memory, Distributed SharedMemory
or the ACOTES project, just to name a few.
Extending Mercurium is achieved using a plugin architecture, where plugins repre-
sent several phases of the compiler. These plugins are written in C++ and dynamically
loaded by the compiler according to the chosen conﬁguration. Code transformations
are implemented in terms of source code.
Internally, the compiler is divided in the front-end, which uses bison to parse the
source code and transform it into a tree which will be analysed or modiﬁed by the
back-end. And the back-end, in which plug-ins and phases are implemented. During
this project two phases were implemented:
• A simple phase which adds a function call to the ﬁrst line of the main/PRO-
GRAM and will be used by our ”workers” and also by some other extensions of
OmpSs whenever they have to change the behaviour of the user’s main.
• The main device phase which handles the code generation of oﬄoad tasks.
At Figure 3.1 we can see how mercurium will compile a program, the input source
will be parsed by the frontend and the tree will be generated, after doing so, this code
will be processed by the OmpSs/OMP code generator, which will process pragmas/-
tasks in case the code has it and will generate some parts of code needed by the runtime
13
3. OMPSS 3.2 Mercurium compiler
Figure 3.1: Mercurium compilation ﬂow
14
3. OMPSS 3.3 Nanox Runtime
in order to execute a task. After doing so, the device provider (speciﬁed with the tar-
get or onto clauses) will generate the code needed by that particular device in order to
execute the task and call any special compiler if needed by that concrete device.
After all the code has been generated (by modifying the tree), Mercurium will write
this tree into a ﬁle or multiple ﬁles which will be compiled by the native compiler or
with compilers speciﬁed by the devices, after doing so, Mercurium will merge all the
object ﬁles into a single object ﬁle/executable.
3.3 Nanox Runtime
Nanox is a runtime designed to support parallel environments, mainly OmpSs.
Nanox provides services to support task level parallelism using data dependen-
cies as a way to synchronize them. These tasks are implemented as execution threads
whenever possible. In addition, Nanox supports keeping coherency among diﬀerent
memory spaces (like GPUs, remote nodes...).
The main objective of Nanox is to be used to develop new parallel environments.
In order to help with that task, its extensible by plugins which can implement diﬀerent
features:
• Task scheduling
• Support to other devices
• Instrumentation
• Resource management
Many of the features of Nanox are also implemented as plugins. Like support to
CUDA/OpenCL devices, Extrae tracing and others.
However, Nanox is not designed to be used directly by application developers, but
instead as a backend for Programming Models such as OmpSs and OpenMP. These
programming models are supported by Mercurium Compiler.
15
Chapter 4
Implementation of the OmpSs Oﬄoad
In this chapter we will cover modiﬁcations done to our runtime and compiler in order
to perform OmpSs oﬄoading.
We already saw an example of how OmpSs works in Listing 3.1. Our objective
with dynamic oﬄoading is that regular OmpSs tasks which are normally executed in
the local node can be oﬄoaded to remote MPI processes transparently.
We have augmented OmpSs with a new API call to dynamically allocate nodes
(deep booster alloc) and a new OmpSs clause (onto) to easily oﬄoad task to these
newly allocated nodes. In Listing 4.1 we can see how both the API and the clause
are used in order to perform a vector sum in an allocated remote node. A detailed
description of the new API call and OmpSs clause are provided in next Section. In
order to execute this vector sum we have to perform the following actions, which will
be performed by OmpSs toolchain:
• Mercurium will generate one binary for each target architecture (in this case
one binary for Xeon and one for Xeon Phi). The binaries can be started either
in master or slave mode. There is only one binary that starts in Master mode,
which then executes the main/PROGRAM of the application. Once the main ap-
plication allocates additional MPI processes the corresponding binary in SLAVE
mode is executed. The binaries executed in slave mode, after initialization, waits
for orders from the master.
• After compiling, during the execution of the master, user will perform the alloca-
tion of the slaves using deep booster allocAPI call (which usesMPI Comm spawn multiple).
16
4. IMPLEMENTATION OF THE OMPSS OFFLOAD
Listing 4.1: Oﬄoad vector sum
void main(){
MPI_Comm workers;
deep_booster_alloc(MPI_COMM_SELF , 1, 1, &workers);
int a[N], b[N];
initVectors(a, b);
#pragma omp task in(b[0:N]) inout(a[0:N]) onto(workers ,0)
{
#pragma omp parallel for
for (int i=0; i<N;++i) {
a[i]=a[i]+b[i];
}
}
#pragma omp taskwait
printVector(a, N);
deep_booster_free(&workers);
MPI_Finalize();
}
• Once the remote node is available and all dependencies are satisﬁed, local Nanox++
will send both arrays to the remote node and will send the order to execute the
task, which after every data has been received, can be executed.
• After the task ﬁnishes, the remote node will send a signal to the master, which
will free the dependencies blocked by that task.
• After the dependencies are freed, the taskwait will ﬁnish and Nanox++will copy
the data back to the host. After doing so, the main program can continue printing
the results and freeing the nodes.
This is a simple example with no communications between the nodes, but in real
programs, communications may be performed between oﬄoaded tasks. More elabo-
rated examples are shown later in Evaluation Section.
17
4. IMPLEMENTATION OF THE OMPSS OFFLOAD 4.1 Nanox Runtime
4.1 Nanox Runtime
Nanox++ runtime had to be extended to allow allocation of remote nodes and task
oﬄoading to these nodes. In order to achieve this some modiﬁcations had to be done:
4.1.1 Oﬄoad mechanism
In order to allocate remote nodes, MPI Comm spawn multiple MPI call is used, this
call is not user-friendly and it is not widely used, but it can be very useful for hetero-
geneous architectures. In Listing 4.1 it’s used when the user calls our APIs to allocate
nodes.
Allocation is quite expensive in terms of execution time, with our model, allocation
is performed once for each node/group of nodes, then they can be reused as many times
as needed.
In order to allow users to easily allocate nodes, we extended Nanox++ so it pro-
vides a new user-level API which allows to allocate remote nodes, this API has the
following interface:
Listing 4.2: Oﬄoad node reservation API
DEEP_Booster_alloc(MPI_Comm SpawningIntercomm , int NNODES, int PPN,
MPI_Comm SpawnedIntercomm);
DEEP_Booster_free(MPI_Comm SpawnedIntercomm);
DEEP Booster alloc: This call allows users to allocate a group of remote nodes
for task execution. The environment variables of the remote nodes, as well as their
physical location can be speciﬁed by using a MPI-like hostﬁle or an environment vari-
able. Apart from allocating nodes using MPI Comm spawn multiple, it creates every
Nanox++ structure in both parts of the oﬄoad (threads, cache, etc) needed in order
to manage the remote node internally. More than one parent node may allocate and
launch tasks on a single remote node, but only one of them will be able to launch tasks
on it at the same time. NNODES speciﬁes the amount of nodes which the user wants to
allocate and PPN speciﬁes the number of processes per node which will be allocated.
18
4. IMPLEMENTATION OF THE OMPSS OFFLOAD 4.1 Nanox Runtime
DEEP Booster free: This call allows users to free remote nodes which were allo-
cated before, from this point they can not be used to oﬄoad anymore.
4.1.2 Data management
When executing tasks on remote nodes, data has to be moved to the remote location so
it is available when executing these tasks, in order to perform this movement, Nanox++
already has a directory/cache which manages data from diﬀerent memory spaces. In
Listing 4.1 it is used when the data has to be moved from the master node to the remote
node prior to the execution of the task and when the data comes back to the master node
in the taskwait .
In order to beneﬁt from Nanox++ cache, our device has to implement most basic
operations (allocate, free and copies), both in the local node and in the remote node,
so we implement every function needed in order to dispatch copies from the local
node, these functions will send an MPI message to a daemon which will be running
in the remote node, this daemon will do the remote allocation/free, or in case of data
transfers, place them in the right place.
In data transfers, there is an important optimization, which is data shared between
diﬀerent tasks which will be executed on diﬀerent remote nodes. In this case, instead
of sending the data through the host node, we will send one instruction to each node or-
dering them to transfer the data directly using the network between both nodes, which
should be faster than the one which connects the remote node with the host.
In addition to managing single-level cache, Nanox++ runtime will run in both
sides of the oﬄoad (local and remote), so when passing data from one level to another,
caches of both runtimes have to be consistent, this is useful whenever the user wants to
do multi-level oﬄoading or exploit other separate memory address space devices, like
GPUs, inside the remote node.
4.1.3 Threads and scheduling
Nanox++ can balance load between diﬀerent remote nodes dynamically, so if the pro-
grammer does not specify in which node each task must be ran, they will be launched
19
4. IMPLEMENTATION OF THE OMPSS OFFLOAD 4.1 Nanox Runtime
on any node which is available. In order to be able to do this, a thread representing
these remote nodes has to be implemented inside the runtime. This thread will make
sure that all the data is on the remote device and dependencies are satisﬁed before send-
ing the task. In Listing 4.1 it is used to manage dependencies and how and when the
task is executed, also to manage when the task ﬁnishes and the taskwait can proceed.
Scheduling policies also had to be modiﬁed, so programmer can specify in which
node he wants to enforce the task to be running on. This can be done by using the
onto clause which can be seen in Listing 4.1. In this clause, there are three possible
scenarios:
• No restrictions: Nanox will choose any of the oﬄoaded nodes to execute and are
currently free.
• Only communicator: Nanox will choose any free node to execute the task, re-
stricted to the the ones which are in this communicator and are not executing
anything.
• Communicator+Rank: Nanox will only run this task on the speciﬁed node.
4.1.4 Dynamic compilation of oﬄoad Plugin
One of the main problems that appear when developing libraries which use MPI is that
they have to work with any of the implementations of the MPI Standard. But each
implementation has a diﬀerent header and use diﬀerent kind of structures to represent
the data used by MPI (MPI Comm,...), so they are not binary compatible with each
other.
A workaround for this problem would be to compile one version of OmpSs for each
implementation of MPI. But there are many implementations and versions, specially in
HPC world, where each provider usually gives a optimized version for their machine,
so this solution would not be easy to maintain.
Our solution to this problem is to make the device as a Nanox Plug-in which is
totally independent from the core and is not compiled together with it. Whenever
Nanox is installed, the sources of the oﬄoad plug-in will be distributed instead of
20
4. IMPLEMENTATION OF THE OMPSS OFFLOAD 4.2 Mercurium compiler
being compiled. In addition to the sources, the ﬂags and the compiler name which
were used to compile Nanox, will be saved into the installation folder.
These sources and ﬂags will be later used by Mercurium to compile the plugin with
the same ﬂags and compiler than Nanox, and also with the same version of MPI that
the user uses to compile his application.
4.2 Mercurium compiler
Currently, OmpSs programming model is already supported by Mercurium, so most of
the code transformations needed to generate the source code of a task is provided by
the base version of the compiler, this includes copying and allocating private variables,
placing them in a data structure, and generating the information which Nanox++ needs
to manage data and dependencies. Mercurium also detects when scalar variables have
to be moved (ﬁrstprivate) from the host to the device, like N in the previous sample
4.2.1 Clause extensions
Mercurium has to support the onto clause as seen in Listing 4.1, so support for this
extra clause was added in the compiler. If this onto clause is detected, the mpi oﬄoad
device is assumed, so it will be processed by our phase.
4.2.2 Task generation
When generating tasks, an outline function/procedure is generated. The behaviour is
diﬀerent depending on the task:
• Outline tasks: this outline function is a wrapper which translates some pointers
when needed by the data management mechanism and calls the original user
function/procedure.
• Inline tasks: the task region is encapsulated into a function/procedure, variables
are renamed so they match the original user code, and after doing so, the outline
function can call this encapsulated function as if it was an outline task.
21
4. IMPLEMENTATION OF THE OMPSS OFFLOAD 4.2 Mercurium compiler
In our MPI oﬄoad approach, this outline function is split in two diﬀerent parts, a
host function which oﬄoads the tasks and waits for it to ﬁnish, and a remote/oﬄoaded
function which will actually execute the code.
When passing a task from the host to the oﬄoad, arguments have to be sent, in order
to do this, we pack every immediate/pointer argument into a C/Fortran structure/type
and create an MPI datatype which sends them to the oﬄoaded function. This allows us
to send these arguments in a single message while allowing MPI to translate datatypes
whenever it is needed (diﬀerent architectures).
The generated outline function for oﬄoad tasks follows this ﬂow:
1. Call host code of a task
2. Send task identiﬁer to the remote node
3. Pack task arguments and send it to the remote node via MPI
4. Ask Nanox++ to translate array/copies addresses from local pointers to their
equivalent remote pointers
5. Execute remote code of the task
• Receive data structure
• Execute outline with data structure
• Send task ﬁnish signal to the host
6. Receive task-end signal
When tasks are generated, our compiler will store both the host function and the remote
function into an array which is merged at linking time, so when generating the exe-
cutable ﬁles, every task which was generated in our executable or any library pointed
by it will be available on this unordered array.
22
4. IMPLEMENTATION OF THE OMPSS OFFLOAD 4.2 Mercurium compiler
4.2.3 Oﬄoading global variables
During previous steps, we did not mention global variables, which are not initialized
in the remote nodes, as we explained before, in order to oﬄoad these variables, they
are privatized and their value gets packed inside the oﬄoaded region. This works ﬁne
for most cases, but if the code of the task has a call to other procedure/function and this
other procedure/function uses these global variables the value of those variables will
not be available on that procedure/function.
To ﬁx this problem, before calling a task, we set the global variable pointing to
the local variable used inside the task, this way we make sure that every function will
be accessing the same variable and the right value will be copied back to the host
whenever it is needed. This is safe because only one task can be oﬄoaded at a time
on the same node, and using OmpSs programming model, if you are writing to the
same global variable in two tasks, they will be serialized, so they will not execute
concurrently on diﬀerent nodes.
4.2.4 Add custom compilers
In addition to these changes, three new proﬁles have been added toMercurium. mpimcc
(C), mpimcxx (C++) and mpimfc (Fortran). These proﬁles allow to automatically link
with MPI (needed in order to perform oﬄoad) and are capable of generating code for
both architectures (Intel k1om and x86). Using these compilers, generating the exe-
cutables should be as easy as running the same compilation sequence with and without
using the ﬂag –mmic.
4.2.5 Dynamic compilation of oﬄoad Plugin
As explained in Section 4.2.1, Nanox does not compile with Oﬄoad support but dis-
tributes the sources and the compilation ﬂags.
Mercurium oﬄoad proﬁle will compile user code normally by using native MPI
wrappers, as any other proﬁle would do.
23
4. IMPLEMENTATION OF THE OMPSS OFFLOAD4.3 Executable generation
Whenever a executable has to be generated, Mercurium will read the ﬂags dis-
tributed by Nanox installation and compile the Oﬄoad Plugin, which will be linked
with user code and Nanox.
This way, the Oﬄoad Plugin (which is the part of Nanox which depends on MPI)
will be compiled with the MPI implementation available in the environment, allowing
users to use any MPI available in the system instead of being constrained to the one
which was used to compile Nanox Runtime.
4.3 Executable generation
In order to oﬄoad, programmer has to provide as many executables as architectures
present in the oﬄoad, all of them named with the convention binary.arch. As previ-
ously stated, we provide the compiler executables to do this easily with k1om/mic and
x86 64/intel64 architectures, but this implementation may be used with other architec-
tures as long as this name convention is respected.
This executables are generated by our compiler, and they have two diﬀerent be-
haviours, the slave behaviour and the one which executes the original application code.
In the slave behaviour, which is enabled automatically by our oﬄoad system, the
executable will act as a daemon process which receives orders from another process
and executes whatever tasks are needed. As we want to keep non data transfers over-
heads as low as possible, the message which initializes tasks will be an integer. This
integer is the index of the task in our array of tasks. So ﬁnding which task we have to
execute is as fast as ﬁnding the index of the task in the host array, sending it through
the network and accessing the remote index the device in array.
As stated previously, in the remote node/accelerator, our runtime is available, which
means that our programming model is also available inside the remote node. This al-
lows the programmer to take advantage of OmpSs by using tasks also in the remote
node, including any CUDA or OpenCL device which is available in that node.
This technique for indexing tasks, even though it is fast, has a problem, as we pre-
viously stated, these arrays have no particular order, so it has to be synchronized at
initialisation time. When nodes are allocated, after the remote process is initialised,
24
4. IMPLEMENTATION OF THE OMPSS OFFLOAD4.3 Executable generation
some data structures are sent from the host node to the remote node. By using this data
structures, device will make sure that its tasks are on the same order than on the host,
so our way to identify task by using an ID works correctly.
With this behaviour, we get a small overhead at node allocation time, which only
happens a few times (usually one or two times at initialization time), but improve the
performance on each task call, which are executed many times.
25
Chapter 5
Evaluation
The objective of this Chapter is to compare the performance of oﬄoaded MPI kernels
and with native MPI oﬄoading. In addition we will show some real applications which
were ported to our Oﬄoad model.
5.1 Performance of oﬄoaded code
Our main objective is to show that the performance of our approach is similar to exe-
cuting the algorithm natively. We have evaluated our approach with a N-body simula-
tion benchmark implemented using diﬀerent programming models or code distribution
between host and oﬄoaded parts.
These results have been taken on Stampede Supercomputer using Intel compilers
and Intel MPI library. Our (per-node) test conﬁguration consists in:
Main Processor 2x8 Xeon E5-2680 2.7GHz
Memory Per Host 32GB 8x4G 4C 1600MHz
Coprocessor 1 Xeon Phi SE10P 1.1GHz
Co-processor Memory 8GB GDDR5
Host-Co interconnect QPI 8.0 GT/s PCI-e
Network protocol used for the tests was Ethernet emulated over inﬁniband because
no direct MIC-to-MIC physical communication over inﬁniband is available yet.
As seen in Listing 5.1 the computational part of our implementation is divided
in two diﬀerent parts, calculate forces O(n2) which calculates the forces produced be-
tween the particles for one iteration and update particles O(n), which based on previous
26
5. EVALUATION 5.1 Performance of oﬄoaded code
Listing 5.1: NBody OmpSs oﬄoaded
void parallelCalcForces(particle_t* local, particle_t* tmp, force_t*
forces, int np, int tsteps) {
int mpi_rank , mpi_size;
MPI_Comm_rank(MPI_COMM_WORLD , &mpi_rank);
MPI_Comm_size(MPI_COMM_WORLD , &mpi_size);
deep_booster_alloc(MPI_COMM_WORLD , mpi_size ,1,&workers);
for (int t = 0; t < tsteps; t++) {
#pragma omp task in([n_blocks] local) inout([n_blocks] forces,
[n_blocks] tmp) onto(workers,mpi_rank)
{
int mpi_size;
MPI_Comm_size(MPI_COMM_WORLD , &mpi_size);
particles_block_t * remote = local;
for(int i=0; i < mpi_size; i++){
calculate_forces(forces, local, remote, n_blocks);
exchange_particles(remote, tmp, n_blocks , rank, rank_size
, i);
remote=tmp;
}
}
#pragma omp task inout([n_blocks] local ) inout([n_blocks]
forces)
update_particles(n_blocks, local, forces, time_interval);
}
#pragma omp taskwait
deep_booster_free(MPI_COMM_WORLD ,mpi_size ,1,&workers);
}
27
5. EVALUATION 5.1 Performance of oﬄoaded code
Rank 1
Rank 2
Rank 3
Rank 4
0 N, N>0T T
forces local_p exchanged_p
Figure 5.1: Calculate forces particle communications among iterations. Represents
how each rank owns a local partition of the particles which will be exchanged with
other nodes in order to calculate the accumulated forces for that partition
calculate forces updates each particle speed and position. These parts are performed
on every iteration.
In Figure 5.1 we can see how calculate forces is divided between many MPI pro-
cesses, so one process calculates partial forces between his own particles, and then
exchanges his particles with his neighbours in order to calculate another group of par-
tial forces. After doing this interchange with every other process in the computation,
the forces will be fully calculated. Afterwards, particle positions and speeds are calcu-
lated on update particles.
In table 5.1 we can see in which devices each part of our benchmark is executed.
For the Intel Oﬄoad execution, communication has to be performed through the hosts,
because as explained before, those programming models do not allow communications
between diﬀerent nodes to be performed inside of kernels.
We can see a comparison of the data communications needed done during the
28
5. EVALUATION 5.1 Performance of oﬄoaded code
Table 5.1: NBody Execution Device Schema
Programming Model (MPI+...) Setup & check Calc forces O(n2) exchange particles O(n), MPI update part, O(n)
Native OmpSs(A) Xeon Phi Xeon Phi Xeon Phi Xeon Phi
OmpSs Oﬄoad(B) Host Xeon Phi Xeon Phi Xeon Phi
OmpSs Oﬄoad(C) Host Xeon Phi Xeon Phi Host
Intel Oﬄoad(E) Host Xeon Phi Host Xeon Phi
MPI Group 1
2x Xeon E5
2x Xeon E5
MPI_COMM_WORLD
MPI Group 2
Xeon Phi
Xeon Phi
MPI_COMM_WORLD
2N/NP particles
(Using OmpSs Offload)
2N/NP particles
(Using OmpSs Offload)
N Particles
(using MPI)
NP= Number of Processes
Figure 5.2: OmpSs NBody Simulation oﬄoad (Single iteration communications).
Master processes will spawn two worker processes which can communicate internally
in the Xeon Phi and have them calculating the forces. Total Communications = 2N/NP
+ N
benchmark using our oﬄoad and Intel Oﬄoad in Figures 5.2 and 5.3.
When implementing this simulation with Intel Oﬄoad or OmpSs oﬄoad, there
are some diﬀerences, some of them come because of the fact that Intel Oﬄoad is not
capable of communicating between the MICs.
By looking at listings 5.1 and 5.2, we can compare oﬄoad directives, syntactically
both directives look similar, but in OmpSs oﬄoad, programmer only has to specify the
dependencies/copies, as data allocation and transfers are handled automatically by our
runtime instead of having to specify allocate and free on pragmas whenever program-
mer wants to reuse data like when using Intel Oﬄoad. Also in OmpSs Oﬄoad user can
oﬄoad the internal loop as a whole, as remote processes can exchange particles using
MPI, in Intel Oﬄoad this has to be performed in the local node
29
5. EVALUATION 5.1 Performance of oﬄoaded code
Listing 5.2: NBody implemented with Intel Oﬄoad
void parallelCalcForces(particle_t* local, particle_t* tmp, force_t*
forces, int np, int tsteps) {
int mpi_rank , mpi_size;
MPI_Comm_rank(MPI_COMM_WORLD , &mpi_rank);
MPI_Comm_size(MPI_COMM_WORLD , &mpi_size);
for (int t = 0; t < tsteps; t++) {
particles_block_t * remote = local;
for(int i=0; i < mpi_size; i++){
#pragma offload target(mic) in(local[0:n_blocks] :
alloc_if(t==0) free_if(0)) in(remote[0:n_blocks]
: free_if(i==rank_size -1) alloc_if(i<=1) out(
forces[0:n_blocks] : alloc_if(t==0) free_if(0))
calculate_forces(forces, local, remote, n_blocks);
exchange_particles(remote, tmp, n_blocks , rank, rank_size
, i);
remote=tmp;
}
#pragma offload target(mic) inout(local[0:n_blocks] :
alloc_if(0) free_if(t==tsteps -1)) inout(forces[0:n_blocks]
: alloc_if(0) free_if(t==timesteps -1))
update_particles(n_blocks, local, forces, time_interval);
}
}
30
5. EVALUATION 5.1 Performance of oﬄoaded code
MPI Group 1
2x Xeon E5
2x Xeon E5
MPI_COMM_WORLD
Xeon Phi
Xeon Phi
N/NP+N particles
(Using Intel Offload)
N/NP+N particles
(Using Intel Offload)
N Particles
(using MPI)
NP= Number of Processes
Figure 5.3: Intel Oﬄoad NBody Simulation (Single iteration communications). Master
processes will spawn one worker process each which can’t communicate internally in
the Xeon Phi and have them calculating the forces. Total Communications = N/NP +
2N
As seen in Figure 5.4, single node performance is almost the same when compar-
ing native versions with oﬄoad versions, so we can say that there is no performance
losses when executing code which was oﬄoaded when comparing to executing code
in native mode. Data transfers from master node to oﬄoad node will have a cost when
oﬄoading, but they are negligible in this case. Apart from this, we can see that, as
expected, host execution is much slower than Xeon Phi for this benchmark.
Same results can be seen on Figure 5.5 regarding scalability, in which we see that
the performance of the oﬄoad implementation is the same than executing the code
natively on the MICs. Performance keeps constant from 1 to 128 nodes, so we show
that OmpSs oﬄoading does not introduce any particular problem which prevent the
oﬄoaded parts of the code to scale and perform as as well as they do if they were
executed natively.
We executed our benchmark on Stampede Supercomputer with 128 nodes, from
Figure 5.6 we can see results for strong-scaling test. In this comparison, we can com-
pare with Intel Oﬄoad implementation, which seems to be performing better than our
31
5. EVALUATION 5.1 Performance of oﬄoaded code
 0
 10000
 20000
 30000
 40000
 50000
 60000
 70000
 80000
 90000
B
ill
io
n 
ite
rs
 p
er
 s
ec
on
d
Single node performance(4194304 particles total)
Native(Host)
Native(Phi)
OmpSs Offload(A)
OmpSs Offload(B)
Intel Offload
Figure 5.4: Single node performance of each of the diﬀerent versions of the NBody
Benchmark. We can see how it’s slower in the host and all the other versions, including
our oﬄoad, obtain similar performance.
 1
 2
 4
 8
 16
 32
 64
 128
 1  2  4  8  16  32  64  128
S
pe
ed
up
# Nodes
Weak Scaling (1048576 p. per node)
OmpSs Offload(A)
OmpSs Offload(B)
Intel Offload
Native(Phi)
Figure 5.5: Weak Scaling Speedup of diﬀerent versions of NBody Benchmark
32
5. EVALUATION 5.2 Overhead comparison with native MPI
 1
 2
 4
 8
 16
 32
 64
 128
 1  2  4  8  16  32  64  128
S
pe
ed
up
# Nodes
Strong scaling (4194304 particles total)
OmpSs Offload(A)
OmpSs Offload(B)
Intel Offload
Native(Phi)
Figure 5.6: Strong scaling speedup of diﬀerent versions of NBody Benchmark
implementation when theoretically it should perform worse as we allow direct MIC-to-
MIC communications. This is caused by Stampede not having an optimized InﬁniBand
driver on the MICs (yet), so transferring data between hosts using InﬁniBand and then
transferring data using QDR bus between host and MIC is actually faster than transfer-
ring data between MIC and MIC (which uses Ethernet and goes through the host).
5.2 Overhead comparison with native MPI
Now that we saw that code is not executed slower by just oﬄoading it, we have to
compare the performance versus a theoretical native MPI implementation. In order to
do this, there are two possible sources of overhead, the one whichDEEP Booster Alloc
introduces compared with MPI Comm Spawn and the one in oﬄoading tasks.
At Figure 5.7 we can see the time taken by each step of the spawn, we can see the
overhead of initializing our runtime and oﬄoad structures is negligible compared with
native MPI Comm spawn. In addition to this, scalability is good when increasing the
number of nodes, as the spawn can be done in parallel on each node.
In addition to allocation performance, we studied the performance of oﬄoading
simple tasks without computation nor big data transfers (with data cached in the re-
mote node). In order to do this we launch 1000 simple tasks in 1, 2 and 4 remote pro-
33
5. EVALUATION 5.2 Overhead comparison with native MPI
 0
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 4
1 2 4 8 16 32
Ti
m
e 
(s
ec
on
ds
)
# Workers
Native MPI Spawn time
Parallel Runtime Initialization Time
Offload/Nanox structures creation time
Figure 5.7: OmpSs vs Native MPI Spawn. Each bar shows how time is distributed in
MPI Spawn time, which is the time that MPI takes to spawn the processes, and the
extra operations we introduce to setup our oﬄoad mechanism.
cesses running in a Xeon Phi, considering the time of a dummy MPI task as sending
two messages (struct+integer) and receiving one integer. Native MPI achieves a peak
throughput of 35.000 dummy tasks per second while OmpSs achieves a throughput of
25.000 dummy tasks per second.
Listing 5.3: Static test OmpSs
#define N_STEPS 500
void sum(int* const __restrict__ vector1, int* const __restrict__
vector2,const int arraySize) {
int rank;
MPI_Comm_rank(MPI_COMM_WORLD ,&rank);
for (int i=0; i<100000; ++i) vector2[i%arraySize]+=
rank;
}
int main( int argc, char **argv )
{
const int ARR_SIZE=10000000;
int number_of_spawns = 2;
MPI_Comm workers;
34
5. EVALUATION 5.2 Overhead comparison with native MPI
double start1=wall_time();
deep_booster_alloc(MPI_COMM_WORLD ,number_of_spawns ,1,&
workers);
double start=wall_time();
int vector1[ARR_SIZE];
int vector2[ARR_SIZE];
int vector3[ARR_SIZE];
int vector4[ARR_SIZE];
for (int i=0; i<ARR_SIZE; ++i) {
vector1[i]=0;
vector2[i]=0;
vector3[i]=0;
vector4[i]=0;
}
for (int ns=0; ns<N_STEPS; ns++) {
#pragma omp task inout(vector1[0;ARR_SIZE],vector2
[0;ARR_SIZE]) onto(workers ,0)
{
sum(vector1,vector2,ARR_SIZE);
}
#pragma omp task inout(vector3[0;ARR_SIZE],vector4
[0;ARR_SIZE]) onto(workers ,1)
{
sum(vector3,vector4,ARR_SIZE);
}
#pragma omp task inout(vector1[0;ARR_SIZE],vector2
[0;ARR_SIZE],vector3[0;ARR_SIZE],vector4[0;
ARR_SIZE])
for (int i=0; i<ARR_SIZE; ++i) {
vector1[i]+=10;
vector2[i]+=10;
vector3[i]+=10;
vector4[i]+=10;
}
}
#pragma omp taskwait
double end=wall_time();
printf("Vectorvaluesis%d\n",vector1[50]);
printf("OmpSsTotalexecutiontime:%gs.\n", end - start);
printf("OmpSsSpawnexecutiontime:%gs.\n", end1 - start1)
;
35
5. EVALUATION 5.2 Overhead comparison with native MPI
deep_booster_free(&workers);
MPI_Finalize();
}
Listing 5.4: Static test native MPI
#define N_STEPS 500
#define ARR_SIZE 10000000
void sum(int* const __restrict__ vector1, int* const __restrict__
vector2,const int arraySize) {
int rank;
MPI_Comm_rank(MPI_COMM_WORLD ,&rank);
for (int i=0; i<100000; ++i) vector2[i%arraySize]+=
rank;
}
int main( int argc, char **argv )
{
if (argc>1){
int provided;
MPI_Init_thread(0,0,MPI_THREAD_MULTIPLE ,&provided);
MPI_Comm parent;
int arraySize;
int rank;
MPI_Comm_rank(MPI_COMM_WORLD ,&rank);
MPI_Comm_get_parent(&parent);
MPI_Recv(&arraySize ,1,MPI_INT ,0,100,parent,
MPI_STATUS_IGNORE);
int* vector1=malloc(sizeof(int)*arraySize);
int* vector2=malloc(sizeof(int)*arraySize);
//For N_STEPS, recv, add, send
for (int ns=0; ns<N_STEPS; ns++) {
MPI_Recv(vector1,arraySize ,MPI_INT ,0,100,
parent,MPI_STATUS_IGNORE);
MPI_Recv(vector2,arraySize ,MPI_INT ,0,100,
parent,MPI_STATUS_IGNORE);
36
5. EVALUATION 5.2 Overhead comparison with native MPI
sum(vector1,vector2,arraySize);
MPI_Send(vector1,arraySize ,MPI_INT ,0,100,
parent);
MPI_Send(vector2,arraySize ,MPI_INT ,0,100,
parent);
}
free(vector1);
free(vector2);
MPI_Finalize();
return 0;
}
int number_of_spawns = 2;
MPI_Comm workers;
MPI_Status status;
int size, again;
int provided;
MPI_Init_thread(&argc,&argv,MPI_THREAD_MULTIPLE ,&provided);
MPI_Comm_size(MPI_COMM_WORLD , &size);
char *array_of_commands[number_of_spawns];
char **array_of_argv[number_of_spawns];
MPI_Info array_of_info[number_of_spawns];
int n_process[number_of_spawns];
int i=0;
for (i=0; i<number_of_spawns; i++){
n_process[i]=1;
char *argvv[] = { "dummyarg", 0};
array_of_argv[i]=argvv;
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "host", "mic0"); //Set MIC ip
address here
array_of_commands[i]="./vecsum.mic";
array_of_info[i]=info;
}
double start1=wall_time();
MPI_Comm_spawn_multiple(number_of_spawns ,array_of_commands ,
array_of_argv , n_process ,
array_of_info , 0, MPI_COMM_WORLD , &workers,
MPI_ERRCODES_IGNORE);
37
5. EVALUATION 5.2 Overhead comparison with native MPI
double end1=wall_time();
double start=wall_time();
int vector1[ARR_SIZE];
int vector2[ARR_SIZE];
int vector3[ARR_SIZE];
int vector4[ARR_SIZE];
int arraySize=ARR_SIZE;
for (int i=0; i<ARR_SIZE; ++i) {
vector1[i]=0;
vector2[i]=0;
vector3[i]=0;
vector4[i]=0;
}
MPI_Send(&arraySize ,1,MPI_INT ,0,100,workers);
MPI_Send(&arraySize ,1,MPI_INT ,1,100,workers);
//For N_STEPS, send recv and add
for (int ns=0; ns<N_STEPS; ns++) {
MPI_Send(&vector1,ARR_SIZE,MPI_INT ,0,100,workers);
MPI_Send(&vector2,ARR_SIZE,MPI_INT ,0,100,workers);
MPI_Send(&vector3,ARR_SIZE,MPI_INT ,1,100,workers);
MPI_Send(&vector4,ARR_SIZE,MPI_INT ,1,100,workers);
MPI_Recv(&vector1,ARR_SIZE,MPI_INT ,0,100,workers,
MPI_STATUS_IGNORE);
MPI_Recv(&vector2,ARR_SIZE,MPI_INT ,0,100,workers,
MPI_STATUS_IGNORE);
MPI_Recv(&vector3,ARR_SIZE,MPI_INT ,1,100,workers,
MPI_STATUS_IGNORE);
MPI_Recv(&vector4,ARR_SIZE,MPI_INT ,1,100,workers,
MPI_STATUS_IGNORE);
for (int i=0; i<ARR_SIZE; ++i) {
vector1[i]+=10;
vector2[i]+=10;
vector3[i]+=10;
vector4[i]+=10;
}
}
double end=wall_time();
printf("Vectorvaluesis%d\n",vector1[50]);
38
5. EVALUATION 5.3 Real applications
 1
 10
 100
8 80 800 8000 80000
To
ta
l e
xe
cu
tio
n 
tim
e 
(s
ec
on
ds
)
Data size per task (KB)
OmpSs vs native MPI (1500 tasks)
Native
OmpSs
Figure 5.8: OmpSs vs Native MPI execution
printf("Totalexecutiontime:%gs.\n", end - start);
printf("Spawnexecutiontime:%gs.\n", end1 - start1);
MPI_Finalize();
}
As seen in Listings 5.4 and 5.3 we implemented a very simple program where
all native MPI tasks are hard-coded, something which in complex programs where
the control ﬂow is more complex and depends on decisions taken in the host will not
be easy to code. Even in this simple program, in addition to diﬀerences in logical
complexity, the amount of useful lines of code is 44 in OmpSs and 100 in native MPI.
At Figure 5.8 we can see that in our system the average overhead per task is neg-
ligible for large tasks, and quite small (around 3%) for small tasks. Ideally oﬄoaded
sections should be large tasks or kernels which can communicate between them using
native MPI, so overheads should not be a critical problem.
5.3 Real applications
Our objective with this oﬄoad model is not only to get as good performance as possi-
ble when oﬄoading, but allowing users to oﬄoad their software without much eﬀort.
39
5. EVALUATION 5.3 Real applications
Master
Master
Merge outputs
Worker
Worker
Worker
Worker
Send/Recv work
& results
Figure 5.9: SRMIP master-to-workers partitioning and work sharing before applying
Oﬄoad
In order to do this, three external applications have been ported to our oﬄoad sys-
tem. These applications are part of the DEEP/DEEP-ER project and are still being
improved. These applications implement a master-slave model and almost full oﬄoad
(I/O on hosts and everything else on Phis), which is one of the ways to use our model.
5.3.1 SRMIP
This C application does seismic imaging and follows a master-slave pattern, as the code
of the application is not public, we ported a prototype provided by the owners of the
code, in this prototype MPI processes are split into groups, being the ﬁrst rank in each
group the master of the group. This approach works correctly in single architecture
machines, but in multi architecture machines, there may be a problem as programmer
will have to control where he wants to execute each one of the workers.
In each group work comes from the master, as seen in Listing 5.9, which sends a
work order to the slaves and waits until the slaves ﬁnish processing the work. Being
this operation bound to the amount of processes, so each master has the same amount
of workers. After every worker ﬁnishes, masters merge the results of all the groups
into a single result.
With our oﬄoad model, only masters will be launched in the application, as seen
in Figure 5.10, and each master will allocate as many workers as needed, so having a
40
5. EVALUATION 5.3 Real applications
Master
Master
Merge outputs
Worker
Worker
Worker
Worker
Worker
Figure 5.10: SRMIP master-to-workers partitioning and work sharing after applying
Oﬄoad
dynamic amount of workers per master. User can get a dynamic amount of workers per
master without all the hassle which comes from managing which ranks are workers and
masters in each group. In addition to this, workers can be seen by every master (this
is optional, depending on the allocation strategy), so diﬀerent masters can use the pool
of workers globally, allowing to perform load balancing between diﬀerent masters.
The resulting pseudo-code of the application can be seen in 5.5.
As explained before, the structure of this application ﬁts our programming model
very well, so apart from oﬄoading tasks we were able to hide most of the complexity
of implementing a master-slave mechanism inside MPI, so only 4 MPI Calls (commu-
nication between the masters) are needed in our version, the number of lines of code
needed which can be seen in Table 5.2:
41
5. EVALUATION 5.3 Real applications
Listing 5.5: SRMIP oﬄoad pseudo-code
void main(int argc, char* argv[]) {
MPI_Com_size(MPI_COM_WORLD , &masters);
// Allocate and init image_t data structure
image_t global_pool , master_image , worker_pool[workers];
MPI_Comm comm_workers;
DEEP_Booster_alloc(MPI_COMM_WORLD , n_workersm , 1, &
comm_workers);
int my_jobs = jobs / masters;
for(int i=0; i<my_jobs; i++) {
int idx = i % workers;
#pragma omp task out(worker_pool[idx]) onto(
comm_workers)
worker(worker_pool[idx]);
#pragma omp task inout(master_image) in(worker_pool
[idx]);
accum(master_image , worker_pool[idx])
}
#pragma omp taskwait
mpi_global_accum(global_pool , master_image);
}
42
5. EVALUATION 5.3 Real applications
Table 5.2: Comparison of the number of code lines with and without OmpSs oﬄoad
SRMIP results Original OmpSs Oﬄoad
# Code Lines 415 314
# MPI Calls 14 4
# OmpSs Pragmas 0 3
5.3.2 TurboRVB
This Fortran application is a MPI based application where every process performs an
initialisation which reads data from ﬁles and performs some MPI communications,
after doing so, it performs the computation and writes the output. It works well on the
MICs, so the objective with this application was only to execute it completely in the
MICs without having to rely if ﬁles were available inside the MIC ﬁle-system.
In order to oﬄoad this application, the initialization phase was split in two diﬀerent
phases, a local initialization phase where data from ﬁles is read and some of the input
parameters are allocated. After doing so, each process will allocate one remote MIC
node and the rest of the computation will be oﬄoaded to that node. Once the oﬄoad
begins, a post-initialization phase is performed, and all temporary arrays are allocated
in remote nodes with the parameters read in the cluster part of the execution.
The main challenge when oﬄoading this application came from their extensive use
of global/module variables inside the oﬄoaded region, as according to OpenMP stan-
dard, variables which are private for a task are only private for the task region, but not
for calls to external functions in diﬀerent modules which use the same variable. This
restriction which only applies for real single memory space programs made oﬄoading
big codes very diﬃcult. As we had implemented an oﬄoad model in which we have
another instance of the process in which global variables are somehow private, in the
sense that are not used by anyone except tasks and do not share value with the main
instance of the program, we can modify them without aﬀecting the ones in the main
program, so whenever we oﬄoad a task which uses global variables, we copy the value
of the private variable into the region pointed by the global variable, and make the pri-
vate variable point the global variable. This way, both the task code and the calls to
functions which are performed inside the task code, will use the same variable.
43
5. EVALUATION 5.3 Real applications
Figure 5.11: FWI execution 1. Shows how work is oﬄoaded to level 2 nodes and how
level 3 nodes calculate the shots.
5.3.3 FWI
Full Wave Inversion is an oil related application which has been developed by the
CASE department of BSC.
A very simpliﬁed pseudo-code of this application can be seen at listing 5.6, we can
see that this application has a master process (level 1) and two levels of Oﬄoad (levels
2 and 3). The master process will allocate the ﬁrst level of oﬄoads and afterwards
each worker of this ﬁrst level will allocate its own second level oﬄoad workers. Once
all the nodes have been allocated, they are ready to execute all the tasks compose the
application.
First the master will oﬄoad one shot to process to each of the ﬁrst level workers,
which will decompose this shot in as many parts as needed and send the shot to the
second-level workers so they process the shot. After all of this work has been ﬁnished
and all shots have been processed, a similar workﬂow will be executed in order to
merge these shots.
By looking at ﬁgure 5.11 we can see a trace obtained with Extrae of a short execu-
tion of this program with 12 shots and a distribution of 1x4x4 (17 total nodes) workers.
We can see how ﬁrst the second level boosters are allocated in the ﬁrst part of the pro-
gram and how after doing so, second level workers will decompose the shot and send
44
5. EVALUATION 5.3 Real applications
Listing 5.6: FWI pseudo-code
void main(int argc, char* argv[]) {
MPI_Com_size(MPI_COM_WORLD , &masters);
// Allocate slaves
DEEP_Booster_alloc(MPI_COMM_SELF , n_slaves , &comm_slaves);
for(int i=0; i<n_slaves; i++) {
// Order slaves to allocate workers
#pragma omp task onto(comm_slaves ,i)
DEEP_Booster_alloc(MPI_COMM_SELF , n_workers , &comm_workers);
}
for(int i=0; i<nshots; i++) {
#pragma omp task out(shots[i]) onto(comm_slaves)
processShot(shot[i]);
}
for(int i=0; i<nshots; i++) {
#pragma omp task in(shots[i]) inout(merger) onto(comm_slaves)
merge_shot(shots[i],merger);
}
}
void processShot(int shotID) {
double decomposedShot[n_workers]=readAndDecShot(shotID);
for(int i=0; i<n_workers; i++) {
#pragma omp task inout(decomposedShot[i]) onto(comm_workers ,
i)
processDecomposedShot(decomposedShot[i]);
}
}
45
5. EVALUATION 5.3 Real applications
Figure 5.12: FWI execution 2. Shows how each group of 4 level 3 nodes process
diﬀerent shots at the same time independently.
the computationally intensive part of the algorithm to be executed at the third level of
the oﬄoad. At ﬁgure 5.12 we can see a zoomed view of how this computational part of
the algorithm is executed in 4 diﬀerent groups of 4 workers each which perform MPI
communications internally.
46
Chapter 6
Conclusions and Future Work
This master thesis shows how OmpSs, a programming model that runs sequentially
written applications following a task based data-ﬂow execution model, has been aug-
mented with the capability of oﬄoadingMPI tasks to remote nodes dynamically spawned
during the execution of an application.
With the use of a simple and concise syntax, OmpSs can oﬄoad these tasks to re-
mote nodes. Our results show competitive performance when oﬄoading MPI tasks, as
the performance obtained is equivalent to the native execution of these MPI tasks and
the amount of tasks per second which can be oﬄoaded is very high. Thus, our oﬄoad-
ing extension is very similar to the one provided by the Intel Oﬄoading, but the last
one is restricted to computational kernels that can not contain any call to MPI, while
our collective oﬄoad fully support it. We believe our OmpSs oﬄoading capabilities
will help to exploit current and future heterogeneous clusters, providing application de-
velopers an eﬀective tool to code complex applications with a MPMP execution model
that really ﬁts the underlying hardware.
There are several areas of future work that we plan to work on, like integrating our
oﬄoad technique with existing job managers so users do not need to specify destination
nodes manually with host ﬁles or extending the uses of this technique to facilitate
migration and malleability of MPI applications.
47
References
[1] D. A. Mallon, N. Eicker, M. E. Innocenti, G. Lapenta, T. Lippert, and E. Suarez,
“On the scalability of the clusters-booster concept: a critical assessment of the
deep architecture,” in Proceedings of the Future HPC Systems: the Challenges of
Power-Constrained Performance. ACM, 2012, p. 3. 1
[2] A. Duran, E. Ayguade´, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and
J. Planas, “Ompss: a proposal for programming heterogeneous multi-core archi-
tectures,” Parallel Processing Letters, vol. 21, no. 02, pp. 173–193, 2011. 2
[3] K. O. W. Group et al., “The opencl speciﬁcation,” A. Munshi, Ed, 2008. 3
[4] C. Nvidia, “Compute uniﬁed device architecture programming guide,” 2007. 3
[5] “Openmp 4.0 (july 2013),” http://www.openmp.org/mp-
documents/OpenMP4.0.0.pdf, 2013, [Online; accessed 20-Dec-2013]. 8
[6] O. W. Group et al., “The openacc application programming interface,” 2011. 8
[7] J. Duato, A. J. Pena, F. Silla, R. Mayo, and E. S. Quintana-Orti, “rcuda: Reduc-
ing the number of gpu-based accelerators in high performance clusters,” in High
Performance Computing and Simulation (HPCS), 2010 International Conference
on. IEEE, 2010, pp. 224–231. 8
[8] A. Barak and A. Shiloh, “The mosix virtual opencll (vcl) cluster platform,” in
Proc. Intel European Research and Innovation Conference, 2011. 8
[9] C. J. Newburn, R. Deodhar, S. Dmitriev, R. Murty, R. Narayanaswamy,
J. Wiegert, F. Chinchilla, and R. McGuire, “Oﬄoad compiler runtime for the
48
REFERENCES REFERENCES
intel R© xeon phitm coprocessor,” in Supercomputing. Springer, 2013, pp. 239–
254. 10
[10] J. Bueno, L. Martinell, A. Duran, M. Farreras, X. Martorell, R. M. Badia,
E. Ayguade, and J. Labarta, “Productive cluster programming with ompss,” in
Euro-Par 2011 Parallel Processing. Springer, 2011, pp. 555–566.
[11] M. Snir, S. W. Otto, D. W. Walker, J. Dongarra, and S. Huss-Lederman, MPI: the
complete reference. MIT press, 1995.
[12] L. Dagum and R. Menon, “Openmp: an industry standard api for shared-memory
programming,” Computational Science & Engineering, IEEE, vol. 5, no. 1, pp.
46–55, 1998.
[13] J. Desouza, B. Kuhn, B. R. De Supinski, V. Samofalov, S. Zheltov, and
S. Bratanov, “Automated, scalable debugging of mpi programs with intel R© mes-
sage checker,” in Proceedings of the second international workshop on Software
engineering for high performance computing system applications. ACM, 2005,
pp. 78–82.
[14] N. Eicker, T. Lippert, T. Moschny, and E. Suarez, “The deep project-pursuing
cluster-computing in the many-core era,” in Parallel Processing (ICPP), 2013
42nd International Conference on. IEEE, 2013, pp. 885–892.
[15] R. L. Graham, T. S. Woodall, and J. M. Squyres, “Open mpi: A ﬂexible high
performance mpi,” in Parallel Processing and Applied Mathematics. Springer,
2006, pp. 228–239.
[16] N. Eicker, T. Lippert, J. S. Center, and F. Ju¨lich, “Deep–an accelerated cluster
architecture for exascale computing,” Intel European Exascale Labs, p. 12.
49
