

# Partnership for Advanced Computing in Europe

# Selection of Task Implementations in the Nanos++ Runtime

Judit Planas<sup>a,b</sup>, Rosa M. Badia<sup>a,b,c</sup>, Eduard Ayguadé<sup>a,b</sup>, Jesús Labarta<sup>a,b</sup>

{judit.planas, rosa.m.badia, eduard.ayguade, jesus.labarta}@bsc.es

<sup>a</sup> Barcelona Supercomputing Center, Barcelona, Spain <sup>b</sup> Universitat Politècnica de Catalunya, Barcelona, Spain

<sup>c</sup> Artificial Intelligence Research Institute (IIIA), Spanish National Research Council (CSIC), Madrid, Spain

# Abstract

New heterogeneous systems and hardware accelerators can give higher levels of computational power to high performance computers. However, this does not come for free, since the more heterogeneity the system presents, the more complex becomes the programming task in terms of resource utilization.

OmpSs is a task-based programming model and framework focused on the automatic parallelization of sequential applications. We present a set of extensions to this framework: we show how the application programmer can expose different specialized versions of tasks (i.e. pieces of specific code targeted and optimized for a particular architecture) and how the framework will choose between these versions at runtime to obtain the best performance achievable for the given application. From our results, obtained in a multi-GPU system, we can prove that our project gives flexibility to application's source code and can potentially increase application's performance.

# 1. Introduction

Since the emergence of the current multi-core processors and with the promise of the future many-core processors with hundreds to thousands of cores, both academia and industry have reacted with proposals for extending current programming methodologies or with new programming models to support the concurrency of these devices. Between them, we can find extensions to already existing languages or new programming models, like CUDA [1] or OpenCL [2].

Heterogeneity and hierarchy of the many-core processors are basic aspects that have to be considered in these proposals. The forecast speaks about scalable multi-core architectures composed of streamlined processor cores and accelerators, with on-chip network and advanced power management technologies.

We can then expect that future many-core chips will have both heterogeneity (different type of processors, including accelerators) and hierarchy (for example, organized in clusters of cores, possibly with associated local memory).

Programming models should be able to support this heterogeneity and hierarchy in such a way that the application is unaware of the underlying hardware and that can dynamically adapt to it.

The OmpSs programming model combines ideas from OpenMP [3] and StarSs [4]: it enhances OpenMP with support for irregular and asynchronous parallelism and heterogeneous architectures. It incorporates data-flow concepts that allow the compiler (Mercurium) and the runtime (Nanos++) to automatically move data as necessary and perform different kinds of optimizations.

OmpSs is able to run applications on systems that combine shared memory processors (SMPs) and GPUs. Moreover, we can even use clusters of SMPs and/or GPUs transparently from the application point of view with OmpSs [5].

The contribution of this paper is the addition of a new scheduler to the OmpSs runtime. Our motivations are code performance at low-cost maintenance: as new architectures appear, new programming paradigms do too and applications may become obsolete in a relatively short amount of time. With this new feature of the OmpSs runtime, we offer the ability to join separate pieces of code –i.e. new task implementations– to the original application without having to modify it. We can then rewrite certain parts of the old application to improve performance, fit new architectures or even adjust to user needs and join these rewritten parts to the original code, just like a puzzle.

This new scheduler is able to choose the most appropriate task implementation (at runtime) each time a task must be executed. As tasks are executed, the scheduler learns and keeps track of their behavior so that it can make more accurate decisions in the immediate future of the application execution.

The main targets of the scheduler are to make source code maintenance easier, give more flexibility towards portability between heterogeneous platforms and improve application's performance by means of resource (and thread) cooperation.

# 2. Motivation

Generally, there is not a single piece of code that fits all the existing hardware architectures, and even if we find that code, it will not be the best option (in terms of performance, energy consumption, ...) for all of them. Thus, it is not unusual to find different ways of implementing the same algorithm.

As an example, there are uncountable versions of the matrix multiplication algorithm. Figure 1 shows a simple implementation, with no optimizations that could run in several different architectures. However, this is not the optimal version for any of them. A few years ago, the IBM Cell/BE [6] architecture appeared. The user could run the code in Figure 1 in the SPE accelerators, but he would not get any benefit. Thus, by that time several implementations of the matrix multiplication algorithm appeared specifically targeted to the SPE architecture. The result was that the user had to rewrite a central portion of his code in order to take advantage of that powerful brand new accelerator.

As history repeats, we are currently experiencing the same effects with the emergence of general purpose GPUs (GPGPUs): a small amount of GPGPUs can give the same peak performance as a supercomputer, so it seems worth using them to make computational applications much faster. Nevertheless, in this case the problem is even worse than what happened with Cell/BE: GPUs cannot run normal CPU code. The only way is to write a specific version for the GPU architecture, something not trivial nowadays.

In short, with the emergence of new architectures that promise providing more performance to the applications, the code improvement and maintenance is getting more complicated and more expensive. We present in this paper a solution to alleviate this problem and we explain the details by giving several examples.

### 2.1. Example

In order to illustrate our proposal, we will show an example of a tiled matrix multiplication algorithm. The basic implementation of the computation with the appropriate OmpSs task directives is shown in Figure 1. The matrices

are stored in tiles of BS x BS elements. The **input** and **inout** clauses express data dependencies between tasks and ensure that the computation over the different blocks is done in the correct order.

But, as this implementation is very simple and non-optimal we might want to try calling a GPU kernel instead. Figure 2 shows the additional code that we would have to add to the original matrix multiplication code in order to use GPUs. The **device** clause specifies that the task should be run in a GPU architecture. The **copy\_deps** clause indicates that the data specified by the dependence clauses should be readily available in the GPU memory before the task starts its execution. In this case, we directly call the CUBLAS library.

So, by adding just the piece of code in Figure 2, OmpSs with our proposed extension will evaluate the performance of the two implementations of *matmul\_tile* task and choose the most suitable version at each time. There is no hard limit on the number of task versions, so we could add as many as we want (i.e. CBLAS library on CPU, specific implementation for SPE, ...). Figure 3 shows another example of a CUDA implementation of *matmul\_tile* task. This implementation configures and calls a hand-coded CUDA kernel called *gemm\_kernel*.

```
1 #pragma omp target device (smp) copy_deps
                                                                  1 #pragma omp target device(cuda) implements(matmul_tile) \
2 #pragma omp task inout([BS*BS]C) input([BS*BS]A,[BS*BS]B)
                                                                  2
                                                                       copy deps
3 void matmul_tile (float *A, float *B, float *C , int BS)
                                                                  3 #pragma omp task inout([BS*BS]C) input([BS*BS]A,[BS*BS]B)
4 {
                                                                  4 void matmul_tile (float *A, float *B, float *C , int BS)
    int i, j, k;
5
                                                                  5 {
6
     for (i = 0; i < BS; i++)
                                                                  6
                                                                       cublasSgemm ('T','T', BS, BS, BS, 1.0, A, BS, B, BS, 1.0, C, BS);
7
       for (j = 0; j < BS; j++)
                                                                  7 }
                                                                  Figure 2: OmpSs Matrix Multiplication task calling
8
          for (k = 0; k < BS; k++) {
                                                                  CUBLAS
             C[i*BS+j] += A[i*BS+k] * B[k*BS+j];
9
          }
10
                                                                  1 #pragma omp target device(cuda) implements(matmul_tile) \
11 }
                                                                  2
                                                                        copy_deps
12
                                                                  3 #pragma omp task inout([BS*BS]C) input([BS*BS]A,[BS*BS]B)
13 void matmul (int m, int l, int n, float **A, float **B,
                                                                  4 void matmul_tile_cuda(float *A,float *B,float *C,int BS)
14 float **C, int BS)
                                                                  5 {
15 {
                                                                  6
                                                                       dim3 block(16, 16);
16
     int i, j, k;
                                                                  7
                                                                       dim3 grid(NB/block.x, NB/block.y);
17
      for (i = 0; i < m; i++)
                                                                  8
18
        for (j = 0; j < n; j++)
                                                                  9
                                                                       gemm_kernel<<<grid, block>>> (A, B, C);
           for (k = 0; k < 1; k++) {
19
                                                                  10 }
20
              matmul_tile(A[i*l+k],B[k*n+j],C[i*n+j],BS);
21
           }
22 }
                                                                  Figure 3: OmpSs Matrix Multiplication task calling a
Figure
        1: A simple C implementation of Matrix
Multiplication
                                                                  hand-coded CUDA kernel
```

3

# 3. OmpSs Programming Model

The OmpSs programming model [7] covers different homogeneous and heterogeneous architectures used nowadays and is open to cover future systems designed with new emerging architectures.

OmpSs is based on the OpenMP programming model with some modifications to its execution model: it uses a thread-pool execution model instead of the traditional OpenMP fork-join model. There is a master thread that starts the execution and several other threads that cooperate executing the work it creates from work sharing or task constructs. Therefore, there is no need for a **parallel** region. Nesting of constructs allows other threads to generate work as well.

OmpSs also differs from OpenMP in its memory model: it assumes that multiple address spaces may exist. So, shared data may reside in memory locations that are not directly accessible from all processing elements in the system. Therefore, all parallel code can only safely access private data and shared data that have been marked explicitly with OmpSs extended syntax. The runtime takes care of where data resides and manages data transfers as tasks consume or produce them. Data can be replicated on different memory spaces and coherency is transparently managed by the runtime.

In addition, OmpSs allows annotating function declarations or definitions with a task directive. In this case, any call to the function creates a new task that will execute the function body. The data environment of the task is captured from the function arguments. It integrates the StarSs dependence support [8] and allows annotating tasks with three clauses: **input**, **output**, **inout**. They allow expressing, respectively, that a given task depends on some data produced before, that will produce some data, or both. The syntax in the clause allows specifying scalars, arrays, pointers and pointed data. Data addresses and sizes do not need to be constant at compile time since they are detected at runtime. Also, the **task wait** construct (used as a barrier after parallel code) is extended as well with the **on** clause, which allows the encountering task to block until some data is produced.

The **target** construct [9] supports heterogeneity and data motion and can be applied to either tasks or functions. It accepts the **device** clause to express heterogeneity. It is used to specify which devices should run the associated code (e.g., cell, gpu, smp, ...). It also accepts the **copy\_in, copy\_out** and **copy\_inout** clauses that are used to specify the memory spaces where data must be available and where they are produced. Another accepted clause is the **copy\_deps** clause which is used to specify that if the attached construct has any dependence clauses, then they will also have copy semantics (i.e., **input** will also be considered **copy\_in, output copy\_out** and **inout copy\_inout**). The different copy clauses do not necessarily imply a copy before and after the execution of each task. This allows the runtime to take advantage of devices with access to the shared memory or implement different data caching and prefetching techniques without the user needing to modify his code. To make sure that data that have moved to a device is valid again in the host, SMP tasks must also use the copy clauses or appear after an OpenMP flush (either explicit or implicit). The **taskwait** construct has also been extended with the **noflush** clause which allows synchronizing tasks without flushing all the data on remote devices. Finally, the **target** construct also accepts the **implements** clause which is used to specify that the annotated task is an implementation of another task<sup>a</sup>. The information kept in this last clause will be used at execution time by our proposed scheduler to group different versions of the same task together and decide which version is run each time.

Thanks to the flexible design and implementation of OmpSs runtime, it is very easy to extend any of its features, like adding a new scheduler or even the support for a new architecture. We can add a new feature as a new plug-in and later on, when we run an application, we can decide which plug-ins should be enabled through configuration arguments or environment variables. Thus, it is very easy to run several times the same application using, for example, different schedulers since there is no need to recompile neither the OmpSs runtime nor the application; we just have to set the appropriate environment variables or configuration arguments just before each execution.

<sup>&</sup>lt;sup>a</sup> Although OmpSs syntax accepts the **implements** clause, none of the available OmpSs runtime schedulers take it into account: only the main implementation of each task will be run and the rest will be ignored.

#### 4. Versioning Scheduler

This section is focused on the implementation details of our proposal. It is divided into two different parts: the first part talks about the syntax and compiler support needed and the second one explains the runtime implementation.

# 4.1. Syntax and Compiler Support

As described before, tasks must be annotated in a certain way to make the runtime aware of all task implementations.

Between all the existing task implementations, only one will be the main implementation and the other implementations will be versions of it. This distinction is only a compiler issue and will not affect the runtime execution; from the runtime point of view, all task versions are treated equally.

Figure 4 shows an example of a source code with three task versions implementing the same task. The main version is called *main\_impl* and the other two versions are *another\_impl* and *yet\_another\_one*. Note that all of them receive exactly the same parameters and have the same inputs and outputs. The **implements** clause is only valid for functions annotated as tasks (it cannot be used in inlined tasks) and its argument always references the main implementation. It is not possible to create an implementation of another implementation if the second one is not the main version (i.e. yet another one task cannot be an implementation of another implementation about task's device: the main implementation does not need to be SMP-targeted, the programmer can give more than one implementation for each device or even the same implementation can be targeted to more than one device (provided that all devices specified in the device clause are able to run the code). For each set of task implementations, the compiler creates a structure with the necessary information for the runtime to identify the different task versions as a set of implementations for the same task. Basically, this structure contains a list of devices where the task can be executed and a pointer to the corresponding task function for each device.

```
1 #pragma omp target device (cuda) copy_deps
2 #pragma omp task input ([N]A) output ([N]B) inout ([N]C)
3 void main_impl (int N, float *A, float *B, float *C)
4 {
5 // Task code
6 }
7 #pragma omp target device (cuda) implements (main impl) \
8 copy deps
9 #pragma omp task input ([N]A) output ([N]B) inout ([N]C)
10 void another_impl (int N, float *A, float *B, float *C)
11 {
12 // Task code
13 }
14 #pragma omp target device (cuda) implements (main_impl) \
15 copy deps
16 #pragma omp task input ([N]A) output ([N]B) inout ([N]C)
17 void yet another one (int N, float *A, float *B, float *C)
18 {
19 // Task code
20 }
```

Figure 4: Example of different task versions

### 4.2. Runtime Implementation

We have implemented a new scheduling policy for the OmpSs runtime (Nanos++): the versioning scheduler. It is based on the already existing dependency-aware scheduler which tries to follow task dependency chains in order to promote data locality and minimize data transfers in a fast and simple way.

### 4.2.1. Data structures

The versioning scheduler keeps and updates several data structures during the whole application execution that collect information related to each set of task implementations. Table 1 and Figure 5 show the schema of the structures.

| TaskVersionSet | DataSize | <versionid, #exec="" exectime,=""></versionid,> | ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ |
|----------------|----------|-------------------------------------------------|---------------------------------------|
|                | DataSize | <versionid, #exec="" exectime,=""></versionid,> | Worker 1 Task List                    |
|                |          | <versionid, #exec="" exectime,=""></versionid,> |                                       |
| TaskVersionSet |          | <versionid, #exec="" exectime,=""></versionid,> | Workèr 2 Task List                    |
|                | DataSize | <versionid, #exec="" exectime,=""></versionid,> |                                       |
|                |          | <versionid, #exec="" exectime,=""></versionid,> | Worker n Task List                    |
|                | DataSize | <versionid, #exec="" exectime,=""></versionid,> | =                                     |
| TaskVersionSet | DataSize | <versionid, #exec="" exectime,=""></versionid,> | Figure 5: Worker's task queue         |
|                |          | <versionid, #exec="" exectime,=""></versionid,> |                                       |

Table 1: Task VersionSet data structure

A *TaskVersionSet* (Table 1) represents a set of task versions. Since data set size needed by each task at run time is taken into account, each set is divided into different groups, according to the amount of data needed (*DataSize*) by each task instance<sup>b</sup>. For each task version set and data set size, the information is kept per task implementation (*VersionId*). The information associated to a task implementation is the number of executions #*Exec* and their mean execution time *ExecTime*.

Each OmpSs worker thread<sup>c</sup> has its own task queue, as shown in Figure 5. Each element of the list is a pointer to a task and it will be used at runtime to assign tasks to threads and keep track of the amount of work each thread has in order to balance the task execution between all the available threads and resources.

# 4.2.2. Runtime scheduling policy

Before explaining the scheduling policy, we need to give several definitions of scheduler-related concepts:

- Mean execution time of a task version: Each time a task is run, its execution time is recorded and its mean execution time is updated as the arithmetic mean<sup>d</sup> of all the task executions. This value is used by the scheduler as the estimated execution time of that task version for future executions.

<sup>&</sup>lt;sup>b</sup> Calling a task may imply some data movements, so the scheduler takes into account data set sizes of tasks. Each task's parameter size is counted just once, even if it is an input/output parameter.

<sup>&</sup>lt;sup>c</sup> Each worker thread is currently devoted to only one device -SMP, GPU, etc.- and there can be as many workers as machine resources -cores, GPGPUs, etc.

<sup>&</sup>lt;sup>d</sup> Optionally, we could try computing a weighted mean to give more weight to recent execution information and less weight to past information, but we have not tried this option yet.

- Fastest executor of a task version set: The fastest task version of the task version set. We will use this concept to refer to the OmpSs workers that can run such task versions.
- Earliest executor of a task version set: The OmpSs worker that can finish the execution of a task version at the earliest time. It will be usually the fastest executor, but in some cases, when the fastest executor is busy running other tasks, the earliest executor may be another OmpSs worker.

We can divide application's execution into two different phases from the scheduler's point of view: the initial learning phase and the reliable information phase respectively. We consider that the initial learning phase finishes when the scheduler has enough reliable information about the execution of task version sets.

The initial learning phase consists of picking task versions from ready tasks in a Round-Robin fashion and distributing them among OmpSs workers. For each task version run, its execution time is recorded and thus data structures of Table 1 are filled with this information. We force the scheduler to run each task version at least lambda (a user configurable threshold) times during the initial learning phase. Once all tasks versions belonging to the same task version set and the same data set size have been run at least lambda times, we consider that the scheduler has enough reliable information and it switches to the reliable information phase<sup>e</sup> for the given task version set and the scheduler can have different criteria for the ready tasks that picks from the task graph, depending whether their corresponding task version set has enough reliable information or not.

During the reliable information phase, the scheduler tries to assign each task version to its earliest executor. To make this decision, it takes into account who is the fastest executor of the task version set, but also how busy is each worker –by checking each worker's task list-. In this phase, execution information is also recorded exactly in the same way as the previous phase: for each task version, its execution time is computed and its corresponding mean execution time is updated accordingly, so, somehow, the scheduler is always learning and recording execution information.

When the data set size of a ready task differs from what the scheduler has registered in its previous executions, it is considered like a new set of task implementations and it switches back to the initial learning phase behavior until it has again enough reliable information to move forward to the next phase, the reliable information phase.

The learning costs of our proposed scheduler are very application-dependent. Although the scheduler never stops learning (because it is always updating *TaskVersionSet* structures), we could say that the duration of the initial learning corresponds to the learning cost. Its cost is still application-dependent. For example, if one of the task versions is much slower than the others, the impact of the learning cost will be higher. In the same way, a short run with a few task instances (less than 10) will be highly affected by the learning costs (applications with 50-100 or more task instances have low learning costs).

# 5. Evaluation

This section covers the evaluation work that we have done in order to measure the performance of the presented OmpSs scheduler while running OmpSs applications using SMP and GPU specialized kernels.

# 5.1. Methodology

In order to evaluate our environment we selected a set of OmpSs applications and measured their scalability and performance with our scheduler, comparing it to the other schedulers of the runtime environment.

Additionally, and with the purpose of having a better understanding of the results, we also measured the amount of data transferred between host and device memory over the whole execution of the application. It may happen that the amount of data transferred is much bigger than the total size of application's data, because we may need to transfer data back and forth several times and all these memory transfers are taken into account. The amount of data is classified between three categories (we abbreviate "transfers" as "Tx"):

<sup>&</sup>lt;sup>e</sup> Changing from one phase to the other just means that the scheduler changes the criteria to assign task versions to workers.

- Device Tx: when using two GPU devices, the amount of data transferred between these devices.
- Output Tx: the total amount of data transferred from all GPU memory spaces to the host memory space (main memory).
- Input Tx: the total amount of data transferred from the host memory space (main memory) to any of the GPU devices. If a piece of data is transferred to two different devices, both transfers are taken into account.

Finally, we also counted for the versioning scheduler the number of times each implementation was run. This gives us an idea of how SMP and GPU devices cooperate together with application's execution.

# 5.1.1. Environment

We used MinoTauro machine at the Barcelona Supercomputing Center, a multi-GPU system running Linux with two Intel Xeon E5649 6-Core at 2.53 GHz and two NVIDIA GPUs M2090 with 512 CUDA cores and 6 GB of memory each. The total amount of memory of the system is 24 GB of memory.

All the codes were compiled with OmpSs compiler version 1.3.5.8 with optimization -O3 level. GCC version 4.4.4 and CUDA version 4.0 were used as back-end compilers for the SMP and GPU codes respectively. The OmpSs version is 0.7a.

# 5.1.2. Experiments

We run the selected applications with different configurations of numbers of cores and GPU devices to obtain the performance or execution time of each application.

For each application, we show the results of running the regular application (where each annotated task of the source code is targeting one device) with the hybrid version of the application (where annotated tasks can have one or more implementations for different devices). For the regular version of the application, we chose the fastest combinations of task implementations, and this is used to evaluate the quality of our scheduler. For the hybrid version, we gave several SMP and GPU implementations of tasks and let the versioning scheduler to choose at run time.

We used three different OmpSs schedulers to compare the results:

- Dependencies-aware scheduler: In order to reduce the number of transfers between devices, this is a simple implementation of a scheduler that tries to find chains of dependencies and schedule consecutive tasks of the same chain to the same device. Its decisions are fast, but in some cases cannot fully exploit data locality.
- Affinity scheduler: This scheduler is a smarter implementation that tries to minimize the amount of transfers between devices. For each task, it evaluates the amount of data that should be transferred to a certain device in order to execute the task. The scheduler chooses the device where the minimum amount of data must be transferred. We can exploit data locality this way, and reduce significantly the time spent in memory transfers.
- Versioning scheduler: This is the scheduler that we present in this paper which is described in Section 4.

Since the dependency-aware and affinity schedulers do not support having more than one implementation for a task, we can only run hybrid applications with several task implementations using the versioning scheduler.

Even though OmpSs tries to minimize the amount of data transfers, they still represent a significant amount of execution time, so we configured OmpSs to overlap data transfers with task execution. We also combined this feature with prefetching task data to achieve higher performance. These features do not depend on the scheduling policy we are using, so we could enable them for all the three different scheduling policies.

### 5.2. Results

We present here the evaluation of the versioning scheduler presented in this paper. We are going to compare the results of three applications (matrix multiplication, Cholesky factorization and PBPI) run with the three different scheduling policies mentioned before. We will also evaluate how the number of resources impacts application's performance.

#### 5.2.1. Matrix Multiplication

The application performs a dense matrix multiplication of two square matrices. Each matrix is divided into tiles; each created task performs a matrix multiplication operation on a given block of the destination matrix each product of a sub-block of the two input matrices is a task. We used three different kernels to do this computation: the CUBLAS kernel and a hand-coded CUDA implementation (both for a GPU architecture) and an SMP-targeted kernel calling the CBLAS library. The matrix size is 16384 x 16384 double-precision floating point elements (2 GB) and the block size is 1024 x 1024 elements (8 MB).

We present the results of two different application versions we tested. We omitted an SMP-only version of matrix multiplication algorithm because its performance is too low to be comparable to the tested versions:

- mm-gpu: only the GPU version of a matrix multiplication task is given. The task calls the *cublasDgemm* function from CUBLAS library.
- mm-hyb: hybrid application with three different implementations for the matrix multiplication task are given: the main implementation is the same as the one given in the mm-gpu. One of the other two implementations runs on the GPU, too, and calls a hand-coded CUDA kernel that performs the multiplication. The last implementation is an SMP-targeted task that calls the CBLAS library to compute the result.

Figure 6 shows the performance results of the two tested versions of matrix multiplication enabling the different schedulers. We can see that for the mm-gpu version there is no difference between using the affinity scheduler (mmgpu-aff) or the dependency-aware scheduler (mm-gpu-dep). In addition, the application shows the linear scalability when using one or two GPUs. There is no difference between using one, two, four or eight SMP threads because there is no parallelism to exploit for the SMP threads.

The only OmpSs scheduler that exploits the hybrid version of the application is the versioning scheduler, so we can only present the results of running the matrix multiplication with this scheduler (mm-hyb-ver). For a small number of threads, we can see that the overall performance is slightly lower due to several reasons: firstly, the overheads of sharing data between different memory spaces and secondly, because the execution time of the SMP version of matrix multiplication tasks is much higher than the execution time of the GPU version (SMP task duration is about 60 times the GPU task duration). Nevertheless, the more SMP worker threads collaborate in the application execution, the more benefit versioning scheduler takes despite the fact that more data is transferred.

The performance benefit may look very small in this case, but we cannot expect a huge speed-up because the peak performance of eight SMP cores is still quite far from the performance of a single GPU: one SMP core represents less than 1% of the machine's peak performance and one GPU represents around 45% of the peak.

100%





Figure 6: Matrix multiplication performance results

Figure 8: Matrix multiplication task statistics for versioning scheduler

Selection of Task Implementations in the Nanos++ Runtime



Figure 7: Data transferred for matrix multiplication

Figure 7 shows the amount of data transferred for each execution. GA represents the mm-gpu version run with the affinity scheduler, GD represents the same version run with the dependency-aware scheduler and HV represents the mm-hyb version run with the versioning scheduler. Because part of the computation is done on SMP devices and partial results are shared between CPU and GPU memory spaces, the amount of data transfers for the mm-hyb-ver increases. As we increase the number of SMP workers, we can see that memory transfers increase too, because more work is done by SMP workers and, thus, they need to share more data between SMP and GPU memory spaces. The versioning scheduler is also transferring data between GPU devices due to a lack of data locality.

Finally, we show in Figure 8 the number of times each version is run for the mm-hyb version. As we mentioned before, the application provides three different task versions: SMP version (that calls CBLAS library), CUDA version (that calls a hand-coded CUDA kernel) and CUBLAS version (that calls CUBLAS library). The fastest implementation (the CUBLAS version) is picked most of the times while the CUDA version is called only a few times at the beginning of the execution, because there is a faster implementation for the same device (its corresponding portion in the chart of Figure 8 is in the middle of each bar, but it is almost invisible). The SMP worker threads keep picking the SMP version while the GPUs are busy (except for the final part of the computation, where only the GPUs run the fastest implementation to avoid losing performance), and still take about 10% of the work on average that helps improving application's performance. We can notice that as we add more SMP workers, more work is done by them. And for the same number of SMP worker threads, we can see that they do more work when there is only one GPU, because the GPU computation is slower and they have more time (and more chances) to pick tasks.

# 5.2.2. Cholesky Factorization

The Cholesky factorization is a matrix operation commonly used to solve normal equations in linear least square problems. It mainly calculates a triangular matrix (L) from a symmetric and positive definite matrix (A). The product of this triangular matrix L and its transposed copy is A: Cholesky(A) = L, where  $LL^{t} = A$ .

The source code is the main algorithm of a tiled Cholesky factorization. The matrix A is organized in blocks of 2048 x 2048 single-precision floating point elements (4 GB), with a total of 32768 x 32768 elements (16 MB). The annotated application primitives (tasks) operate on these blocks. There are four annotated tasks: potrf, syrk, gemm and trsm. For the last three tasks we give a single GPU-targeted implementation that calls MAGMA [10] or CUBLAS libraries. For the potrf, we give two different implementations: one calls CBLAS and runs on the CPU and the other one calls MAGMA and runs on the GPU.

In order to get good performance in this application, it is important to carefully schedule the execution of potrf, because in Cholesky's task graph, there are some points where all the following tasks depend on the potrf task. So, it acts like a bottleneck and if it is not run as soon as its data dependencies are satisfied, there is less parallelism to exploit and, thus, we observe a slowdown in application's performance.

The different application versions we used are described below:

- potrf-smp: only SMP-targeted implementation of *potrf* task is given. Although we add the "smp" suffix, the other three tasks are always run on the GPU, because running them on the CPU would take too much time for the amount of data they are computing.
- potrf-gpu: a single GPU-targeted implementation is given for each task.
- potrf-hyb: two different implementations (SMP and GPU versions) are given for the potrf task. The other three tasks are always run on the GPU.

We show the results for the three Cholesky versions in the figures below. Performance results are shown in Figure 9. Running the *potrf* task in the SMP implies several data transfers from and to the GPU memory, plus the SMP version is slower than the GPU version. Thus, potrf-smp is the version that achieves less performance in all cases.



Figure 9: Cholesky factorization performance results



Figure 11. Cholesky task statistics for versioning scheduler



Figure 10. Data transferred for Cholesky

Although we can see the same performance trend for the potrf-hyb-ver version as for the matrix multiplication test (as we increase the number of SMP workers, the performance of the versioning scheduler gets better than the other tested versions and schedulers), the situation is different here. In this case, there is a small number of task instances, so the initial learning phase of the versioning scheduler impacts on application's performance. However, as we can see in Figure 10, having more SMP worker threads benefits performance for two reasons: the initial learning phase takes less time and a smaller amount of data is transferred compared to the other schedulers. In this case, the affinity scheduler cannot exploit data locality because there is load imbalance, so there is one GPU that steals tasks from the other one and this increases the number of memory transfers. However, it still achieves good performance because the data transfers can be overlapped with computation.

Figure 11 shows the percentage of times each task version has been run for the potrf-hyb-ver version. In contrast to the matrix multiplication case, the scheduler decides to assign all the work to the GPUs because they run faster and

there is not enough ahead scheduling (due to Cholesky's data dependency graph complexity) to assign a slow SMP task version to an SMP worker thread.

# 5.2.3. PBPI

PBPI is a parallel implementation of a Bayesian phylogenetic inference method for DNA sequence data. This solution is based on the construction of phylogenetic trees from DNA or AA sequences using a Markov chain Monte Carlo (MCMC) sampling method. There are two factors that determine the computation time: the length of the Markov chains used later to approximate the probability of the phylogenetic trees and the time actually needed to evaluate the likelihood values at each generation. The data set size used for this application is 50000 elements (500 MB).

In the implementation used in this paper, three different tasks are defined for each of the three computational loops that account for the majority of the execution time of the program.

In order to simplify the presentation of the results, we have provided up to two implementations for each of the first two computational loops. The third computational loop, although it is still a task, has a single SMP-targeted version.

We present the results of three different application versions:

- pbpi-smp: only the SMP versions of each of the three tasks are given. In this case, data always stay in the host memory and no data transfers will be needed.
- pbpi-gpu: for the first two computational loops (that have been taskified), we give only the GPU version. The third computational loop is always run on SMP architecture.
- pbpi-hyb: we give two implementations for the first two computational loops: the first ones run on the GPU and the other ones run on an SMP device. The third computational loop is always run on SMP architecture.

Unfortunately, this application has no floating point operations, so we cannot give its results in GFLOP/s like we did with the other two applications presented in this paper; we have to report its total execution time instead<sup>f</sup>.

Figure 12 shows the execution time of each version (remember that lower is better in this chart). We can see that pbpi-smp versions run faster than the pbpi-gpu versions. This is due to the fact that sending all the computational work of first and second loops to the GPU is not worth, since all the data will have to be transferred back and forth to run the third loop on the SMP workers and memory transfers cannot be overlapped properly due to data dependences.

However, the versioning scheduler is able to find the appropriate balance between SMP and GPU execution to take advantage and decrease the execution time. Although the amount of data transfers is higher, as shown in Figure 13, thanks to the look-ahead scheduling, it is able to overlap more data transfers with computation, so that we can see some benefit.

Figures 14 and 15 show the percentage of times each task version has been run for the first and second loop respectively. For the first loop, the versioning scheduler decides to send it most of the times to the GPU, but the execution of tasks of the second loop is shared between GPU and SMP. Although it may not be clear in the chart (because hundreds of thousands of tasks are run for the second loop), the SMP version is run many times (thousands of times) and this helps balancing the trade-off between sending data back and forth and running the tasks on SMP workers (the task itself is between three and four times slower for the SMP versions, but the time of transferring all the data to the GPU is high enough to consider sending all the work to the SMP workers).

<sup>&</sup>lt;sup>f</sup> Although we could also report matrix multiplication and Cholesky results as their execution time to make all the charts homogeneous, we think that giving their results in GFLOP/s is better for the reader understanding, since this measure for such applications is widely used among the Computer Scientific community.

Analysis and Optimization of Task-based Parallel Programming Languages



Figure 12. PBPI execution time results







Figure 14. PBPI task statistics for versioning scheduler (first loop)

Figure 15. PBPI task statistics for versioning scheduler (second loop)

#### **Related Work** 6.

Heterogeneous architectures that combine different types of computational units (i.e., GPUs and traditional processors) are becoming more common. However, the programmability of these nodes is an issue since the programmer needs to take into account several aspects at a time: parallelism, different programming styles (i.e., traditional programming languages like C/C++ for the CPUs and CUDA for the GPUs), and the existence of different memory spaces and therefore the management of the data transfers between them.

With regard to traditional node level parallel programming models, OpenMP is a widely adopted approach that focuses on shared memory systems. It was designed for productivity, initially focused on loop parallelism. Recently it has been extended with task based parallelism in its version 3.0 [3].

Cilk [11] is a task-based programming model based on the identification of tasks with the spawn keyword, and the sync statement is used to wait for spawned tasks. Both OpenMP and Cilk consider nested tasks (tasks that generate new tasks), but data dependency detection is not supported and additional synchronization points are required. Although Cilk only supported parallel tasks, Cilk++ also supports parallel loops.

NVIDIA GPUs became very popular in the recent years and are extensively integrated in the HPC clusters. CUDA is almost the de-facto standard for programming GPUs. CUDA [1] is an extension to C++ and it is based on kernels that are executed concurrently by several threads. With CUDA, the programmer is not only responsible of writing the application code and computational kernels, but also of performing memory allocation and managing the data transfers from host memory to device memory. In [12] tools to better map the algorithms to the memory hierarchy have been proposed. In this approach, programmers provide straight-forward implementations of the application kernels using only global memory and the CUDA-lite tools do the transformations automatically to exploit local memories.

OpenCL has been proposed as an alternative to program accelerators that can also be used to program general purpose multi-cores [2]. Although the portability is a strong aspect of OpenCL, it offers a too low-level API to the programmer, exposing her to explicitly manage the data and threads. For example, it is the responsibility of the programmer to build the program executable or to move data between the cores and accelerators.

In Lee et al. [13] propose OpenMPC, which is an OpenMP-to-CUDA translation system, which performs a sourceto-source conversion of a standard OpenMP program to a CUDA program and applies various optimizations to achieve high performance. The compiler interprets OpenMP semantics under the CUDA programming model and identifies kernel regions (code sections to be executed on a GPU) and transforms eligible kernel regions into CUDA kernel functions and inserts necessary memory transfer code to move data between CPU and GPU. Compared to OmpSs, OpenMPC focuses on loop parallelization, while OmpSs focuses on the exploitation of task-based concurrency.

HiCUDA [14] proposes the use of a set of directives that give hints to the compiler about regions of code that can be exploited in the GPUs, and data directionality. Mint [15] implements a translator that transforms stencil computations expressed in C, into CUDA code. Right now Mint does not support multi-GPU environments.

The CAPS HMPP [16] toolkit is a set of compiler directives, tools and software runtime that supports parallel programming in C and Fortran. HMPP works based on codelets that define functions that will be run in a hardware accelerator. These codelets can either be hand-written for a specific architecture or be generated by some code generator. Offload [17] is a programming model for offloading portions of C++ applications to run on accelerators. Code to be offloaded is wrapped in an offload block, indicating that the code should be compiled for an accelerator, and executed asynchronously as a separate thread. Call graphs rooted at an offload block are automatically identified and compiled for the accelerator. Data movement between host and accelerator memories is also handled automatically. The Sequoia [18] alternative focuses on the mapping of the application kernels onto the appropriate engines to exploit the memory hierarchy. The PGI Accelerator Compilers [19] and the Cray OpenMP Accelerator compilers [20] provide support for NVIDIA GPUs. Both compiler systems recognize regions of code annotated with a special pragma, and they outline the code to be run on GPUs. Data directionality clauses are also incorporated in both approaches.

StarPU [21] is based on a tasking API and also on the integration of a data-management facility with a task execution engine. With regard to data management, StarPU proposes a high level library that automates data transfers throughout heterogeneous machines [22]. In StarPU codelets are defined as an abstraction of a task (e.g., a matrix multiplication) that can be executed on a core or offloaded onto an accelerator using an asynchronous continuation passing paradigm. StarPU offers low level scheduling mechanisms (e.g., work stealing) so that scheduler programmers can use them in a high level fashion, regardless of the underlying (possibly heterogeneous) target architecture. OmpSs and StarPU present several similarities with regard the execution model, but StarPU is not proposing a programming model and therefore the programmer is confronted with lower APIs and execution details that are hidden in the OmpSs case.

In [23] the authors describe a runtime framework to schedule Direct Acyclic Graphs (DAGs) in heterogeneous parallel platforms. The proposal is based four important criteria: Suitability, Locality, Availability and Criticality (SLAC) and show that all these criteria must be considered by a heterogeneous runtime framework in order to achieve good performance under varying application and platform characteristics. Again as StarPU, the authors propose a runtime scheduler while OmpSs is proposing a whole programming model and execution model.

# 7. Conclusions and Future Work

In this work we have presented the implementation of a new feature for the OmpSs model. OmpSs runtime has been extended with the ability to run different implementations of a task, collect information about these task executions, learn how they perform and, finally, decide by itself which is the most appropriate implementation to run each time a task is called. This new feature has been presented as a new scheduling policy for the OmpSs framework: the versioning scheduler.

With this new scheduler, the programmer can write hybrid applications where more than one implementation for one or more devices (SMP, GPU, ...) is given for their tasks. This feature enhances the programmability of applications and makes its maintenance easier, because the programmer, at any time, can develop a new implementation for an already existent task in his code that targets the same or a different device and that can potentially improve application's performance.

From our results, we observe that, in most of the cases, the versioning scheduler out-performs the other existent schedulers for the OmpSs runtime and at the same time, gives more flexibility to the programmer. Only in a few cases the versioning scheduler slightly slows down the application compared to the other OmpSs schedulers.

Nevertheless, we have detected some limitations in our scheduler when using it in some specific applications: for example, the amount of data transfers is not optimal because data locality is not taken into account. We are going to provide the versioning scheduler with data locality information as future work in order to further improve the performance of applications.

In addition, also as future work, we are aware of some scheduler weaknesses and we are going to enhance them. For example, each set of tasks implementations has its own execution information depending on the size of the data that the task needs. It is true that the execution time of a task can potentially depend on the size of the data that it computes or processes, but this implementation decision means that if the data needed by two calls to the same task differs by only 1 byte, the scheduler will consider that these calls are completely different and will not reuse the data collected at the first call when the task is called for the second time. In this case, it would be better to group data sizes in a reasonable range so that different calls to a task that process different amounts of data of the same magnitude would be grouped together. With this optimization, the initial learning phase of the scheduler would take less time, so better decisions would be taken earlier.

Finally, as new features, the versioning scheduler should be able to combine the good aspects of both affinity-aware and dependency-aware schedulers. The scheduler should also offer the possibility to receive external hints for tasks versions: for example, read an XML file with additional information about tasks versions. This file can be written by the programmer, but it could also be written by OmpSs runtime in a previous execution of the application.

# Acknowledgements

This work has been supported by the European Commission through the ENCORE project (FP7-248647), the TERAFLUX project (FP7-249013), the TEXT project (FP7-261580), the HiPEAC-2 Network of Excellence (FP7/ICT 217068), the Intel-BSC Exascale Lab collaboration project, the funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement nr. RI-283493, the support of the Spanish Ministry of Education (TIN2007-60625, CSD2007-00050 and FPU program) and the Generalitat de Catalunya (2009-SGR-980).

#### References

- [1] NVIDIA CUDA Compute Unified Device Architecture Version 2.0, NVIDIA Corporation, 2008.
- [2] Khronos OpenCLWorking Group, The OpenCL Specification, version 1.0.29, 8 December 2008. [Online]. Available: http://khronos.org/registry/cl/specs/opencl-1.0.29.pdf
- [3] OpenMP ARB, "OpenMP Application Program Interface, v.3.0," May 2008.
- [4] J. M. Perez, R. M. Badia, and J. Labarta, "A dependency aware task-based programming environment for multi-core architectures," IEEE Int. Conference on Cluster Computing, pp. 142–151, September 2008.
- [5] J. Bueno-Hedo, J. Planas, A. Duran, R. M. Badia, X. Martorell, E. Ayguad´e, and J. Labarta, "Productive programming of gpu clusters with ompss," in Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, ser. IPDPS2012, May 2012. [Online]. Available: http://www.ipdps.org/ipdps2012/2012 advance program.html
- [6] C. Johns and D. Brokenshire, "Introduction to the Cell Broadband Engine Architecture," IBM Journal of Research and Development, vol. 51, pp. 503–519, September 2007.
- [7] E. Ayguade, R. Badia, P. Bellens, D. Cabrera, A. Duran, M. Gonzalez, F. Igual, D. Jimenez-Gonzalez, J. Labarta, L. Martinell, X. Martorell, R. Mayo, J. Perez, J. Planas, and E. Quintana-Orti, "Extending OpenMP to Survive the Heterogeneous Multi-core Era," International Journal of Parallel Programming, vol. 38, no. 5-6, pp. 440–459, June 2010.
- [8] A. Duran, J. M. Perez, E. Eduard Ayguade, R. M. Badia, and J. Labarta, "Extending the OpenMP Tasking Model to Allow Dependent Tasks," in OpenMP in a New Era of Parallelism. Springer Berlin/Heidelberg, 2008, pp. 111–122.
- [9] E. Ayguade, R. M. Badia, D. Cabrera, A. Duran, M. Gonzalez, F. Igual, D. Jimenez, J. Labarta, X. Martorell, R. Mayo, J. M. Perez, and E. S. Quintana-Orti, "A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures," in IWOMP: Evolving OpenMP in an Age of Extreme Parallelism, vol. 5568. Dresden, Germany: Springer, June 2009, pp. 154–167.
- [10] R. Nath, S. Tomov, and J. Dongarra, "An Improved MAGMA GEMM for Fermi GPUs," University of Tennessee Computer Science (also LAPACK Working Note 227), Tech. Rep. UTCS-10-655, July 2010.
- [11] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou, "Cilk: an efficient multithreaded runtime system," SIGPLAN Not., vol. 30, no. 8, pp. 207–216, 1995.
- [12] S.-Z. Ueng, M. Lathara, S. S. Baghsorkhi, and W. mei W. Hwu, "CUDA-lite: Reducing GPU Programming Complexity," in In Languages and Compilers for Parallel Computing (LCPC) 21st Annual Workshop, August 2008.
- [13] S. Lee and R. Eigenmann, "Openmpc: Extended openmp programming and tuning for gpus," in Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '10. Washington, DC, USA: IEEE Computer Society, 2010, pp. 1–11. [Online]. Available: http://dx.doi.org/10.1109/SC.2010.36
- [14] T. D. Han and T. S. Abdelrahman, "hicuda: High-level gpgpu programming," IEEE Transactions on Parallel and Distributed Systems, vol. 22, pp. 78–90, 2011.
- [15] D. Unat, X. Cai, and S. B. Baden, "Mint: Realizing cuda performance in 3d stencil methods with annotated c," in Proceedings of the 25th ACM International Conference on Supercomputing, ser. ICS '11. New York, NY, USA: ACM, 2011, pp.214–224. [Online]. Available: http://doi.acm.org/10.1145/1995896.1995932
- [16] R. Dolbeau, S. Bihan, and F. Bodin, "HMPP: A Hybrid Multicore Parallel Programming Environment," in Workshop on General Processing Using GPUs, 2006.
- [17] P. Cooper, U. Dolinsky, A. F. Donaldson, A. Richards, C. Riley, and G. Russell, "Offload automating code migration to heterogeneous multicore systems," in Lecture Notes in Computer Science, HiPEAC Conference 2010, 2010, pp. 307–321.
- [18] T. J. Knight, J. Y. Park, M. Ren, M. Houston, M. Erez, K. Fatahalian, A. Aiken, W. J. Dally, and P. Hanrahan, "Compilation for explicitly managed memory hierarchies," in Proceedings of the 2007 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2007.
- [19] Portland Group Inc., "PGI Accelerator Compilers," Sep 2011.
- [20] Alistair Hart and Harvey Richardson and Alan Gray and Karthee Sivalingham, "Directive-based programming for GPUs, accelerators and HPC," December 2010. [Online]. Available: www.many-core.group.cam.ac.uk/ukgpucc2/talks/Hart.pdf
- [21] C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier, "StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures," Concurrency and Computation: Practice and Experience, Euro-Par 2009 best papers issue, 2010, accepted for publication, to appear. [Online]. Available: http://hal.inria.fr/inria-00550877
- [22] C. Augonnet and R. Namyst, "A unified runtime system for heterogeneous multicore architectures," in Proceedings of the International Euro-Par Workshops 2008, HPPC'08, ser. Lecture Notes in Computer Science, vol. 5415. Las Palmas de Gran Canaria, Spain: Springer, Aug. 2008, pp. 174–183. [Online]. Available: http://hal.inria.fr/inria-00326917
- [23] J. A. Pienaar, A. Raghunathan, and S. Chakradhar, "Mdr: performance model driven runtime for heterogeneous parallel platforms," in Proceedings of the international conference on Supercomputing, ser. ICS '11. New York, NY, USA: ACM, 2011, pp.225–234. [Online]. Available: http://doi.acm.org/10.1145/1995896.1995933