# **Accepted Manuscript** Auto-tuned OpenCL kernel co-execution in OmpSs for heterogeneous systems B. Pérez, E. Stafford, J.L. Bosque, R. Beivide, S. Mateo, X. Teruel, X. Martorell, E. Ayguadé PII: S0743-7315(18)30818-9 DOI: https://doi.org/10.1016/j.jpdc.2018.11.001 Reference: YJPDC 3971 To appear in: J. Parallel Distrib. Comput. Received date: 7 April 2018 Revised date: 7 September 2018 Accepted date: 1 November 2018 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. #### \*Manuscript **Click here to view linked References** # Auto-tuned OpenCL kernel co-execution in CompCs for heterogeneous systems B. Pérez, E. Stafford, J. L. Bosque, R. Levide $\label{lem:condition} Department \ of \ Computer \ Science \ and \ Electronics. \ Universida. \ de \ Cant \ bria. \ Santander, \\ Spain. \ \{perezpavonb, stafforde, bosquejl, beivid \ \} @u. \ ... es$ S. Mateo, X. Teruel, X. Martorell, E. Ayg adé $Barcelona\ Supercomputing\ Center.\ Universidad\ Politécn.\ \ ^de\ \_utalu\~a.\ Barcelona,\ Spain\ \{sergi.mateo,xavier.teruel,xavier.martorell,eduaru\ vyguade\}@bsc.es$ ### Abstract The emergence of heterogeneous systems has been very notable accountly. The nodes of the most powerful computers integrate several compute accelerators, like GPUs. Profiting from such node configurations is not a trivial endeavour. OmpSs is a framework for task based parallel applications, that allows the execution of OpenCl kernels on different compute Levie es. However, it does not support the co-execution of a single kerner research devices. This paper presents an extension of OmpSs that ric a to thi challenge, and presents Auto-Tune, a load balancing algorithm that "tomatically adjusts its internal parameters to suit the hardware capabilities and application behaviour. The extension allows programmers to tak full advantage of the computing devices with negligible impact on the code. It to escare of two main issues. First, the automatic distribution of datasets and the management of device memory address spaces. Second, the implementary n of a set of load balancing algorithms to adapt to the particularities of applications and systems. Experimental results reveal that the co-execut on of single kernels on all the devices in the node is beneficial in terms performance and energy consumption, and that Auto-Tune gives the best overal' rest its. Keywords: Heterogen ous systems, OmpSs programming model, OpenCL, co-execution #### 1. Introduction The undeniable success of computing accelerators in the apercomputing scene nowadays, is due not only to their high performance, but also to their outstanding energy efficiency. Interestingly, this success come in spite of the fact that efficiently programming machines with the adevices is far from trivial. Not long ago, the most powerful machines would in a set of identical processors. To further increase the computing power, not they are sure to integrate some sort of accelerator device, like GPGPUs in Intel Xeon Phi. In fact, architects are integrating several such devices in the nodes of recent HPC systems. The trend nowadays is towards highly her regeneous systems, with computing devices of very different capabilities. The shallenge for the programmers is to take full advantage of this vast computing power. But it seems that the rapid develor ment of heterogeneous systems has caught the programming language stake of olders unaware. As a result, there is a lack of a convenient language, or framework, to fully exploit modern multi-GPU heterogeneous systems. I saving the programmer to face these complex systems alone. It is true that see and frameworks exist, like CUDA[1] and OpenCL[2], that can be used to program Gr. Js. However, they all regard heterogeneous systems as a collection of independent devices, and not as a whole. These enable programmers to access the computing power of the devices, but do not help them to squeeze all the performance out of the heterogeneous system, as each device must be fand dependently. Guided by the host-device model introduced by these fameways, programmers usually offload tasks, or kernels, to accelerator devices one at a time. Meaning that during the completion of a task the rest of the machine is left idle. Hence, the excellent performance of these machines a tarnished by an energy efficiency lower than could be expected. With several devices in one system, using only one at a time is a considerable waste. Some rogrammers have seen this flaw, and have tried to divide the computing tasks among all the devices of the system [3, 4, 5]. But it is an expensive path in terms of coding effort, portability and scalability. This paper proposes the development of a means to ass'st the programmer with this task. Because, code length and complexity considerations aside, load balancing data-parallel applications on heterogeneous systems is a complex and multifaceted problem. It requires deciding what portions of the data-set of a given kernel are offloaded to the different devices so that they all complete it at the same time [6, 7, 8]. To achieve this, it is necessary to consider the behaliour of the kernels themselves. When the data-set of a kernel is dirided in equally sized portions, or packages, it can be expected that each or million, lire the same execution time. This happens in well behaved, regular kernels but it is not always the case. The execution time of the packages of the kernels may have a wide variation, or even be unpredictable. These the confidered irregular kernels. If how to balance a regular kernel can be decided prior to the execution, achieving near optimal performance, the same can not be said about irregular ones. Their unpredictable nature forces the use of a dynamic approach that marshals the different computing devices at execution time. This however, increases the number of synchronisation points between devices, which will have some overhead, reducing the performance and efficiency of the system. In conclusion, the diverse nature of kannals prevents the success of a single data-division strategy in maximising the performance and efficiency of a heterogeneous system. Aside from kernel behaviour, the other key factor for load distribution is the configuration of the heterogeneous system. For the load to be well balanced, each device must get the right amount of work, adapted to the capabilities of the device itself. Therefore, a work distribution that has been hand-tuned for a given system is likely to underperform on a different one. The OmpSs programming model is an ideal starting point in the path to hasle-fee kernel co-execution. It provides support for task parallelism due to nefits in terms of performance, cross-platform flexibility and reduction of 'ata motion [9]. The programmer divides the code in interrelating tasks and OmpSs essentially orchestrates their parallel execution maintaining their control and data dependences. To that end, OmpSs uses the inform tion applied by the programmer, via code annotations with pragmas, to de ermine at runtime which parts of the code can be run in parallel. It plances OpenMP with support for irregular and asynchronous parallelism, as well as support for heterogeneous architectures. OmpSs is able to run applications on symmetric multiprocessor (SMP) systems with GPUs, through Or and CUDA APIs [10]. However, OmpSs can only assign kernels to single cavices, therefore not supporting co-execution of kernels. An experienced programmer could decompose the kernel in smaller tasks so that OmpSs could not them to the devices. But there would be no guarantee that the resource would be efficiently used or the load properly balanced. The programmer would also be left alone in terms of dividing the input data and combining partial results. This would lead to longer code, which would be harder to maintain. As a solution to the above prob. This has this article presents an OmpSs extension which enables the efficient to execution of massively data-parallel OpenCL kernels in heterogeneous of tems. This has the advantage of providing a natural way to program using all the available resources that was not previously available in OmpSs. Manuelly a nieving an equivalent functionality would require rethinking the arp. This has the advantage of providing a natural way to program using all the available resources, creating different tasks with adequate granularities and even implementations. Moreover, these extra manual work would need to be repeated if the system configuration changed. By automatically using all the available resources, regardless of the perform kernel co-execution and extracting the maximum performance of the se systems. It takes care of load balancing, input data partitioning and out out data composition. The experimental results presented here show that, for all the used benchn ... s, being able to co-execute kernels on multiple devices has a positive impact n performance. In fact, the results indicate that it is possible to reach an efficiency of the heterogeneous system over 0.85. Furthermore, the results also show that, although the systems exhibit higher power demand, in shower execution time grants a notable reduction in the energy consumption. Indeed, the average energy efficiency improvement observed is 53%. The main contributions of this article are the following: - The OmpSs programming model is extended with a new scheduler, that allows a single OpenCL kernel instance to be a xecu ed by all the devices of a heterogeneous system. - The scheduler implements two classiaload by Lancing algorithms, Static and Dynamic, for regular and irregular cations. - Aiming to give the best perfor and on both kinds of applications, two new algorithms are presented, HC ided and Auto-Tune, which is a parameterless version of the form r. - An exhaustive experimental \*udy is presented, that corroborates that using the whole syst ... 'a beneficial in terms of energy consumption as well as performance. The rest of this poper is organized as follows. Section 2 presents background concepts key to the under anding of the paper. Next, Section 3 describes the details of the load bara ring algorithms. Followed by Section 4, that covers the implementation of the OmpSs extension. Section 5 presents the experimental methodology and discusses its results. Finally, Section 7 offers some conclusions and future wirk. #### 2. I ackgro ind This section explains the main concepts of the OmpSs programming model at wi', be used throughout the remainder of the article. impSs is a programming model based on OpenMP and StarSs. Which has een extended in order to allow the inclusion of CUDA and OpenCL kernels in Fortran and C/C++ applications, as a simple solution to execute on heterogeneous systems [9, 10]. It supports the creation of data-low driven parallel programs that, through the asynchronous parallel execution of tasks, can take advantage of the computing resources of a heterogeneous mechine. The programmer declares the tasks through compiler directives 'pragram) in the source code of the application. These are used at runtime to domain when the tasks may be executed in parallel. OmpSs is built on top of two tools: 120 130 135 - Mercurium is a source-to-source complete that processes the high-level directives, and transforms the input the into a parallel application [11]. In this manner, the programmer is reported or low level details like the thread creation, synchronization and contraction, as well as the offloading of kernels in a heterogeneous system. - Nanos++ is a run-time lib. rv that provides the necessary services for the execution of the parallel program [12]. Among others, these include task creation and vnchron zation, but also data marshaling and device management. In the pragma 'nno 'tio's, the programmer specifies the data dependences between the tas's. Then, when the execution of the parallel program commences, a three' pool is created. Of these, only the master thread is active, and uses the ser ices of the run-time library to generate tasks, identified by work descriptors, and adding them to a dependence graph. The master thread then schedies it execution of the tasks to the threads in the pool as soon as their input dependences are satisfied. In terms of heterogeneous systems, OmpSs provides a target directive that a ndicates a set of devices in which a given task can run. In addition to a task, the target directive can be applied to a function definition. OmpSs also offers to the drange clause that, together with the data-directionality clauses in and aut, guides the data transfer between the devices and the host CPU, so the programmer perceives a single unified address space. However, OmpSs does not support the execution of a single orner astance in several devices. The extension proposed in this article monifies the Nanos++ runtime system so that it can automatically divide a kern of into sub-kernels and manage the different memory address spaces. In order to make the coexecution efficient, four load balancing algorithms have been implemented to suit the behavior of different applications. #### 3. Load Balancing Algorithms The behavior of the algorithms is illustrated in Figure 1. It shows the ideal case in which in the execution of a regular application all devices finish simultaneously, thus achieving perfect load and received #### 3.1. Static algorithm 160 This algorithm works before the interior all starts its execution by dividing the dataset in as many packages as devices are in the system. The division relies on knowing the computing rower of the devices in advance. Then the execution time of each device can be equilized by proportionally dividing the dataset among the devices. As a onsequence, there is no idle time in any device, which would signify a worsteon of ources. The idea of assigning a single package to each device is dipicted in Figure 1. A formal test iption of the algorithm can be made considering a heterogeneous system with n devices. Each device i has computational power $P_i$ , which is define tas the amount of work that a device can complete per time unit, including the communication overhead. This value depends on the architecture of the device but also on the application that is being run. These powers are input purposed in the algorithm and can be extracted by a simple profiled execution. The application will execute a kernel over W work-items, grouped in G work-strong of fixed size $L_s = \frac{W}{G}$ . Since the work-groups do not communicate among the massless, it makes sense to distribute the workload taking the work-group as Figure 1: Depiction of how the four algorithms perform the data division among three devices. The work groups assigned to each device, identified by numbers, are joined in packages shown as larger rounded boxes. No e that the execution time of work groups in the CPU is four times larger than in the GPUs. the atomic unit. Each 'vi'e i will have an execution time of $T_i$ . Then the execution time $C_i$ the heterogeneous system will be that of the last device to finish its work, of $T_H = max_{i=1}^n T_i$ . Also, since the whole system is capable of executing W 'v's items in $T_H$ , it follows that its total computational power of the heter geneous system is $P_H = \frac{W}{T_H}$ . Note that it also can be computed as the sum of the individual powers of the devices. $$P_H = \frac{W}{T_H} = \sum_{i=1}^n P_i$$ The soal of the Static algorithm is to determine the number of work-groups to assign each device, so that all the devices finish their work at the same time. This means finding a tuple $\{\alpha_1,...\alpha_n\}$ , where $\alpha_i$ is the number of work-groups assigned to the device i, such that: $$T_H = T_1 = \dots = T_n \Leftrightarrow \frac{L_s \alpha_1}{P_1} = \dots = \frac{L_s \alpha_r}{P_r}$$ This set of equations can be generalised and solved as folic 's: $$T_H = \frac{L_s \alpha_i}{P_i} \Leftrightarrow \alpha_i = \frac{T_H P_i}{L_s} = \frac{T_H P_i G}{W} - \frac{P_i G}{\sum_{i=1}^n P_i}$$ Since $\alpha_i$ is the number of work-groups, its value—ast be an integer. For this reason, the expression used by the algorithm is: $$\alpha_i = \left\lfloor \frac{P_i G}{\sum_{i=1}^n P_i} \right\rfloor$$ If there is not an exact solution with integers then $\sum_{i=1}^{n} \alpha_i < G$ . In this case, the remaining work-groups are as igned to the most powerful device. The advantage of the Static are ithin, is that it minimises the number of synchronisation points. This realizes it perform well when facing regular loads with known computing powers that are stable throughout the dataset. However, it is not adaptable, so its particular loads. #### 3.2. Dynamic algorith n Some applications <sup>1</sup>, not present a constant load during their executions. To adapt to their regularities, the dynamic algorithm divides the dataset into small packages of equal size. The number of packages is well above the number of devices in the neterogeneous system. During the execution of the kernel, a master thread in the host is in charge of assigning packages to the different devices, likewing the next strategy: - 1 The m ster splits the G work-groups in packages, each with the package s. $\uparrow$ ceified by the user. This number must be a multiple of the work-gr up size. If the number of work-items is not divisible by the package aze, the last package will be smaller - 2. The master launches one package on each device, including the host itself if it is desired. - 3. The master waits for the completion of any package. - 4. When device i completes the execution of a package: - (a) The device returns the partial results corresponding to the processed package. - (b) The master stores the partial results. - (c) If there are outstanding packages, a new one is sunched on device i. - (d) If all the devices are idle and there are no many packages, the master jumps to step 5. - (e) The master returns to step 3. - 5. The master ends when all the packa on nave been processed and the results have been received. - This behaviour is illustrated in Figure 1. The dataset is divided in small, fixed size packages and the devices process them achieving equal execution time. As a consequence, this algorithm adapts to the irregular behaviour of some applications. However, each completed package represents a synchronisation point between the device and the host, where data is exchanged and a new package is launched. This over lead has a noticeable impact on performance. The Dynamic algorithm take the size of the packages as a parameter. ### 3.3. HGuided a' jorn. 210 215 220 The two .box algorithms are well known approaches to the problem of load balancing a general. But none satisfy three key aspects. First, take into acc unt the heterogeneity of the system. Second, control the overhead of the synch. Second, and third, give reasonable performance with regular and the regular applications. Thus a new load balancing algorithm method called PC sidea was proposed, which is based on the Guidea method from OpenMP. The nain difference between the HGuided and the Dynamic algorithms is the size and quantity of the packets. In Dynamic, the size of the packets is onstant, while in HGuided they vary throughout the execution and between the devices. As execution progresses, the size of the packets decreases with the remaining workload. This size is weighted with the relative ompulational capacity of each device. This way the less powerful devices CP is in this case) run smaller packets than they would in a homogeneous systal, and the more powerful run larger packets. The package size for device is calculated as follows: $$package\_size_H = \left\lfloor \frac{G_r}{kN} \cdot \frac{r_i}{\sum_{j} r_{j}} \right\rfloor$$ Note that the first term gives diminishing size of the packages, as a function of the number of pending work-groups $G_r$ , who number of devices N and the constant k. The latter is introduced due to the unpredictable behavior of the irregular applications. It limits the maximum package size and, in the experimental evaluation of Section 5, was a principally fixed to 2. The second term adjusts the package size with the ratio of the computing capacity of the device $P_i$ to the total capacity of the system. On the other hand, in the 'vna...' algorithm, the programmer sets the number of packages for each execution. However, in the HGuided, since the size of the packets depends on the order 'no hich the packets are assigned to the devices. This can differ greatly between runs and especially in irregular applications. Therefore, this algorithm is algorithm in the corresponding of erheat compared to the Dynamic. Figure 1 show how the size of the packages is large at the beginning of the execution, and because towards the end. #### 3.4. Aur. Tine Igorithm The HC tided algorithm strikes a balance between adaptiveness and overheads which makes it a good all-around solution that adequately distributes he would have both regular and irregular applications. However, it still requires the parameters to be provided by the programmer: the computing power and the minimum package size. These have a key impact on performance and are dependent on both the application to be executed and the system itself. Moreover, the HGuided algorithm is quite sensitive to these parameters, so choosing Figure 2: Evolution of the computing power per covice. an adequate value for them is sometimes a demanding ask that requires a thorough experimental analysis. The sensitivity of the HGuided algorithm to its parameters is further analyzed in Section 2.2. In addition, determining the minimum package size parameter is complicated, especially for GPUs, because it is essential to do a sweep to obtain a value that gives good results. The computing capability is easier to evaluate. It only requires obtaining the interpretation of the capacities. The Auto-Tune algorianm is an evolution of the previous algorithm that achieves near optimal period man e for both regular and irregular loads without the hassle of parameters. It uses the same formula to calculate the package size, but uses nominal parameters values that are adjusted at runtime and handles the minimum package size differently depending on the device that each package will be sent to. The compating power for the first package launched at each device is calculated using the theoretical GFLOPs of the hardware. These can be obtained at the installation of OmpSs either by querying the available devices or by running a simple conducted aking into account the computing speed displayed by each device. 280 This is alculated as the average number of work-items processed per second for the lact three packages launched to each device. By using the average speed of the last packages, a gradual adaptiveness is attained that keeps the algorithm a sistant to bursts of irregularity that would not be representative of the actual speed for the next packages. Figure 2 depicts the evolution of the computing power during the execution of one of the applications used for experimentation. The nominal computing powers are used at the beginning of the execution until all the devices have finished at least one package. The three the computing powers are updated at runtime. In the figure, the nominal power for the GPU was higher than the actual one for the application. Note that the load balancing, as all the devices are kept busy and do not delay the completion of the benchmark. Package size also has an influence on the counting speed of throughput based architectures, such as GPUs. Consequent, package size must be kept relatively high to prevent an inefficient use of the hardware and overheads. However, this is also a potential source or imbalance. If the computing power of the devices differs greatly, a high n. v. mun. package size that reduces overheads is likely to be too big for slow darices, namely, CPUs, which would cause delays. To prevent this, the Auto-Tune housed algorithm uses different minimum values for CPUs and GPU. 13 value selected for the CPU is one work-group per CPU core, so no hard are is left unused and imbalance is avoided. This is because the CPU is rot a throughput device, so its computing speed is usually much less sensitive to prokare size than the GPUs. Moreover, CPUs are often the least powerful vice of the system, so using a small minimum package size with them will prove the load balancing. Two values are considered for the GPU minim m rackage size. First, the equations implemented in the CUDA Occupanc / Celcula or are used to obtain the minimum number of work-groups that will rieve maximum occupancy for the current kernel and GPU. The CUTA Occ pancy Calculator is part of the CUDA Toolkit since version 4.1. This relue is a lower bound for the minimum package size, but might be too low f the a plication launches a large amount of work-items, producing too many prokages and high overheads. To prevent this, the number of work-items is also a ... zed and the final minimum package size is set to the maximum between the alue obtained by the Occupancy Calculator and 5% of the work-items. This percentage has been experimentally set to keep the number of packages low and avoid performance degradation in the GPU. These enhancements give forth an algorithm with implove ladaptiveness, that delivers comparable performance to the HGuided apploch for a fraction of the effort. It completely eliminates the need to provide any parameter and saves a great deal of pre-processing time per application and system, as will be seen in Section 5.3. #### 4. Implementation As stated before, the OmpSs infrastructure relies on the combination of two components: Mercurium, which is a sour reto-source compiler, and Nanos++, which is a runtime capable of manality tasks, their data and the Task Dependence Graph (TDG) they generate. It is a first approach, the new load balancing algorithms have been implemented focusing on making the changes as self-contained as possible and manaling the impacts on the OmpSs specification, Mercurium and the rest of Nanos++. As a result, neither directives nor clauses have been added to Mercurium. Nanos++ implements a set of different schedulers that deal was the management of the tasks submitted to the runtime. To offer the wark distribution strategies for a single OpenCL task presented in the previous action, a new scheduler has been implemented as a Nanos++ plugic, which has been called maat. The parameters of the algorithms are the following: - The device amputing powers for Static and HGuided. - The package size for Dynamic. 340 • The m nimum package size for HGuided. no avo. I altering the OmpSs specification, the selected algorithm and its paneters are set through environment variables, which is the normal way to so carry the scheduler in Nanos++. Figure 3 represents the outline of an OmpSs implementation of the Binomial benchmark used later in the experimentation. It shows how a call to a function ``` //Initializations binomial_options(NUM_STEPS, SAMPLES, randArray, output); #pragma omp taskwait //Free resources ``` Figure 3: Basic outline of an OmpSs " acatic a. ``` #pragma omp target device(opencl) copy_de_s ndrange(1, samples*(numSters+1) numSteps+1) #pragma omp task in([samples]ranq.~ray) \\ out([samples ou. +) __kernel void binomial_options( at numSteps, int samples, const _b obal float4* ``` gure 4 Header file for the task. defined as a task is followed by a wait. The header of that function, which is shown in Figure 4 indicates that the task must be run in an OpenCL device, as well as its lav ich parameters, input and output data. Figure 5 displays the environment valables that need to be set to run the task with each of the four algorithms proceed in Section 3. As shown, the selection of the auto-tune algorithm elir inates the need of specifying any other load balancing related parameter. Γ espite he efforts made to minimize the impact on Mercurium, a minor chang, was unavoidable. The original implementation did not make OpenCL cernel c nfiguration parameters available to Nanos++. This information is neces or or the operation of the plugin, as it defines the amount of work that will be performed. Nanos++ work descriptors do not hold this information either. 355 Consequently, a new Mercurium work descriptor creation function has been ``` # Static load balancing NX_SCHEDULE=maat NX_ALGORITHM=static NX_GPU_POWER = 34.0 # Dynamic load balancing NX_SCHEDULE=maat NX_ALGORITHM=dynamic NX_DYN_PACKAGE_SIZE = 409600 # HGuided load balancing {\tt NX\_SCHEDULE=maat} NX_ALGORITHM=hguided NX_GPU_POWER = 34.0 NX_MIN_PACKAGE_SIZE=115200 # Auto-Tune load balancing NX_SCHEDULE=maat NX_ALGORITHM = auto · tune ``` Figure 5: Environment variation to use standard OmpSs and the different load balancing algorithms. implemented which behaves like the original but including these parameters. When a work descriptor is submitted, the new scheduler manages its division in as man, work descriptors as the selected algorithm and parameters require. These work descriptors are considered as children of the one submitted, and represent ar aggregate workload equivalent to that of their parent. For the static and Dynamic algorithms, in which the number and size of the packages as known when the launch of the workload is made, all the work descriptors a conceated at the submission of their parent. They are stored in the scheduler and adequately returned when a thread is idle, receptive to another task. In the case of the HGuided and Auto-Tune algorithms, the package have varying sizes that depend on the prior execution and the device the wnl run them. As a consequence, the children work descriptors will be created a hen required by an idle thread, considering the device it manages and the execution. Each of the children work descriptors is identical to 's par' at except in two key aspects. First, they have different OpenCL p ram as, namely offset and global\_work\_size, defining the workload of the prokage they represent. There is no constraint on the number of dimensions of the Open CL data-set of the parent task, as the work division is always performed along the first dimension. Second, the output data is just a portion of that of its posses, which is conveniently offset so the results are written adequately. This, represented by an independent CopyData object, holding the start address and size that the package will have to work on. As a result, coherence problem, are avoided in the OmpSs directory. Apart from the aforementioned datails, data transfer relies on the methods used by standard OmpSs. To perform the correspondence between work descriptors and output data, an assur pure is made: each OpenCL work-item will produce the result for the position of the output buffers indexed by its identifier. This may seem a strong requir ment, but it is met by most kernels widely used in the industry and research. Input data is replicated in the memory of all the devices, as there is a way to predict the parts that will be read by each of the work-items. The might appear as an important source of overhead, but the experimenta, residts of Section 5 indicate otherwise, as good performances are obtained. The carries of the children work descriptors is performed by a modified version of the duplicate WD function that does this extra work. This function is also responsible for making the OpenCL parameters of the divided work rescriptors available to the Mercurium code, which will trigger the actual kernel to make a Once the submission of the original work descriptor is completed, the *done* of nuction is called. This is a Nanos++ function that is used to signal the completion of a work descriptor. It also waits for the completion of the children of the calling work descriptor. In this way, no task dependent " the divided one will be run until all the children resulting from the wark distribution are completed, so the dependencies of the task graph are maintained." #### 5. Evaluation This Section begins with a description of the women and the benchmarks used in the experiments, as well as definitions of the metrics used in the evaluation. Additionally experimental results are so weed and analyzed. #### 5.1. System Set-up 420 The test machine has two proces in chips and two GPUs and 16 GBs of DDR3 memory. The chips are Intel A on E5-2620, with six cores that can run two threads each at 2.0 GHz. They are connected via QPI, which allows OpenCL to detect them as a single of tice. Thus, any reference to the CPU considers both processors. The GPUs are NVIDIA Kepler K20m with 13 SIMD lanes and 2496 cores and a GBy as of VRAM each. These are connected to the system using independent F of 2 a slots. The experiments build upon a baseline system which uses a lingle GPT but consider the static energy of all the devices, regardless of if they are considered accelerators which, if left unused, are a potential so rice of inefficiency. Six applications have been chosen for the experimentation. Three of them: NBody, I rist and Perlin are part of the OmpSs examples offered by BSC, and the other three: Binomial, Sparse Matrix and Vector product (SpMV) and Rap have been so ecifically adapted to OmpSs from existing OpenCL applications. The first ar (NBody, Krist, Binomial and Perlin) are regular, meaning that all the wort-groups represent a similar amount of work. On the contrary, SpMV and Tup are irregular, which implies that each work-group represents a different amount of work. The parameters associated to each of the load balancing a gorithms have been set to maximize performance. The computing power for a device/application pair has been obtained as the relative per man, of the device, with respect to that of the fastest device for the application. Perlin implements an algorithm that generates noise pl. ', to improve the realism of moving graphics. Krist is used on crystallo raphy 'o find the exact shape of a molecule using Röntgen diffraction on sing' crystals or powders. Rap is an implementation of the Resource Allocat on Poolem. It has a certain pattern in its irregularity, because each successive backage represents an amount of work larger than the previous. The evaluation of the performance of the benchmarks is done through their response time. This includes the time required to the communication between host and the devices, comprising input date and result transfer, as well as the execution time of the kernel itsel. The benchmarks are executed in two scenarios, the heterogeneous system which advantage of the GPUs and CPU, and the baseline, that only used one GPU. Note that in both instances, the same version of the program is run, at there is no need to modify the source or recompile, only set environment variables. Based on these respondent times, two metrics are analyzed. The first is the speedup for each benchmark when comparing the baseline and the heterogeneous system response times. Note that, for the employed benchmarks, the CPU is much less powerfor that the GPUs, then the maximum achievable speedup using the three devications is not 3, but a fraction over 2 which depends on the computing power of the CP's for the application. The speedup for each application using a perfect balanced work distribution is shown in Table 1. These values give an idea or the advantage of using the complete system. They were derived from the response time $T_i$ of each device as shown in Equation 1. 445 $$S_{max} = \frac{1}{max_{i=1}^{n} \{T_i\}} \sum_{i=1}^{n} T_i$$ (1) rue second metric is the load balancing efficiency, obtained by dividing the eached speedup by the maximum speedup, shown in Table 1. The obtained value ranges between 0 and 1 giving an idea of the usage of the heterogeneous Table 1: Maximum achievable speedup per application. | Application | NBody | Krist | Binomial | Perlin | Sp' IV | RAP | |--------------|-------|-------|----------|--------|--------|------| | Max. Speedup | 2.61 | 2.2 | 2.03 | 2,04 | 2.0 | 2,16 | system. Efficiencies close to 1 indicate the best usage of the system is being made. The measured values do not reach this ideal because of the communication and synchronization times between the host and he devices. #### $5.2.\ Energy\ measurement$ To evaluate the energy efficiency of the costem it is necessary to take into account the power drawn by each device. Modern computing devices include Performance Management Units (100,000) to at allow applications to measure and control the power consumption. However, the power measured is associated to the device and not the kernel or process in execution. Together with the fact that it is impractical to add measurement code to all the test applications, this led to the development of a power monitoring tool named Sauna. It takes a program as its parameter, and as able to configure the PMUs of the different devices in the system, run the program while performing periodic power measurements. This tool rec and an unexpected amount of thought for its development. Since it had to monitor several PMUs, it had to adapt to the particularities of each one white giving consistent and homogeneous output data. For instance, each device has a caferent way to access its PMUs. Recent versions of the Linux kernel provides a coess to the Running Average Power Limit (RAPL) registers [13] If the Intel processors, which provide accumulative energy readings. On contrast, NV (DIA provides a library to access their PMUs. But this NVIDIA Management Library (NVML) [14] gives instant power measurements. During the development of Sauna, it was observed that these energy or power it ings have an impact on the kernel or process execution. Then, finding an dequate sampling period is an important task. To strike a balance between the overhead that was observed in the GPUs with high sampling rates and (a) Execution time with different computing (b) Expution to be with different numbers of powers for the Static algorithm. packa<sub>&</sub> \( \sigma \) for the Dynamic algorithm. (c) Execution time with different minimum powers for the HGuided algorithm package sizes for the HGuided algorithm. Figur. 6: Parameter sensitivity analysis the accuracy loss that is inherent of lower ones, it was decided to use 45ms as the sampling period. The performance and the energy consumption can be combine, in a single metric representing the energy efficiency of the system. This paper uses the Energy Delay Product (EDP) [15] for this purpose. # 5.3. 1 mm ter sensitivity As explained in Section 3, the Static, Dynamic and HGuided algorithms required different parameters for their operation. These have to be provided by the programmer and are one of the key factors for a successful load balancing. However, determining the most adequate values for a workload is not trivial, as they may differ greatly between applications and device configu. tions. Consequently, the selection of parameters is often a work intensive process, usually based on experimentation. The importance of adequately choosing the parameter values is illustrated in Figure 6, which displays the execution time for the applications when varying the parameters for each of the algorithms. Note that we the HGuided algorithm, when one of the parameters is modified, the other is set to the identified optimal value. As shown in the figure, for every of the parameters, the applications show very different behaviors, ranging from a minimum insensitivity to delivering greatly degraded performance, sometimes are in ling a clear relation with the parameter value, as is for example the case of Rap for the minimum package size. Moreover, the applications are not affected equally by the parameters. For example, Binomial is highly sensitive to the computing power in the Static algorithm and moderately sensitive to almost insensitive to the rest of parameters, while Rap behaves just the opposite it is insensitive to the Static computing power and tremendously sometimes are to the other parameters. Considering these results, it is obvious that, in order to achieve an accurate load balancing, an experimental tuning of the algorithm parameters is often a must. The Auto Tundal grithm frees the programmer from this burden by automatically adducting the parameters, matching and even surpassing the performance of the "Guided." #### 5.4. Experimental results The riments presented in this section have been developed with the optimal values for the parameters required by each algorithm, obtained in the prevous section. This implies that the results for the Static, Dynamic and 'aGuided algorithms are the best that can be achieved, but require a great fort to tune the parameters. rigure 7 shows the speedup obtained for each application calculated with espect to their execution time using the baseline system, as was explained in Section 5.1. This section also showed that the maximum achievable speedup Figure 7: Speedup per application depends on the application. These values, presented in Table 1, are shown in the graph as horizontal lines above each such mark. Additionally, the geometric mean is shown, which includes both four regular benchmarks and two irregular ones. From the results of the geom. \*ric mean it can be seen that the best result is obtained by the Auto-Tune algorithm, closely followed by the Static, the HGuided and finally the Jynam. Furthermore, it should be emphasized that the Auto-Tune algorithm is ... the easier to use, because it does not require finding optimal values for any parameter. 525 530 A detailed and ysis of the speedups reveals that the Static algorithm is the best option for regular applications. This is because they require no adaptivity, so they benefit for an the minimum overhead introduced by the Static algorithm. However, exception the case of Perlin, which is very sensitive to overheads as can be seninthy results for all the algorithms but the Static, the Auto-Tune algorithm actions were sensitive to overheads as can be seninthy results for all the algorithms but the Static, the Auto-Tune algorithm actions achieve good results with less configuration effort. The other two algorithms achieve good results, but suffer from a problem that reduces purpose. If one of the last packages is assigned to the slowest device it is ikely to delay the execution of the whole application. This problem could be avoided by increasing the number of packages, but in that case overheads come into play, which also degrade performance. The HGuided algorithm due to its very nature, partially solves this issue. Figure 8: Efficiency of the hetero, neous system. For irregular applications, the best recolts are obtained by Auto-Tune and HGuided algorithms. Their adaptive reactions favours load balancing in these applications, where the workload of each work-group is completely unknown and unpredictable. On the other hand, the reduction in synchronization points reduces the runtime overhead, which is inherent to this type of algorithm. This is the reason why the HGuided and Auto-Tune algorithms deliver equal or better performance than the simple. Dynamic algorithm, as they introduce less overhead. Finally, the Static of orithm fails to balance the load because it can not cope with the unpreductal flity of these applications. The load balar sing efficiency gives an idea of how well a load is balanced. A value of one represents that all the devices have been working all the time, thus achieving the maximum speedup. In Figure 8 the geometric mean efficiencies show that the Lot result is achieved by Auto-Tune with an efficiency around 0.85. In addition, there is at least one load balancing algorithm for every application that achieves an efficiency over 0.9 or even as high as 0.98, reached by Binomial and Perlin with the Static. This is true even for the irregular applications, in which obtaining a balanced work distribution is significantly harder. Nowadays, performance is not the only figure of merit used to evaluate comuting systems. Their energy consumption and efficiency are also very important. Figure 9 gives an idea of the energy saving the whole heterogeneous Figure 9: Normalised energy consumption per application. while the other devices are idle and some comming. This would be the case of a current HPC system, in which failing to use all the available resources may represent an energy waste. Therefore, the Figure shows for each benchmark the energy consumption of each algo. The mormalized to the baseline consumption, meaning that less is better. The values of the geometric man indicate that the algorithms that consume less energy are Static and Auto Fune, with a saving of almost 20% compared to the baseline. Regarding the individual benchmarks, it is always possible to find an algorithm where the normalised energy is less than one. Moreover, all the algorithms reduce a sumption, despite using the whole system. The use of more devices necessarily increases the instantaneous power at any time. But, since the total energy consumption is also less. Further norm, since idle devices still consume energy, making all devices contribute well is beneficial. The analosis of the algorithms shows a strong correlation between performance and energy saving. Consequently, the best algorithm for regular applications is like the Static, with an average saving of 26.5%. However, for irregular applications, it wastes 7.4% of energy. On the other hand, the Auto-Tune gives in average energy saving of 16%. Regarding the results of concrete benchmarks, it is interesting to comment Figure 10: Normalised EDP pe. application. Krist. The highest energy saving in this prochmark is provided by Auto-Tune, although it is not the best in performation. There are only two particular benchmarks where the use of the whole rysten employs more energy than the baseline. These are Perlin with Dynamic, Hgc ided and Auto-Tune, and SpMV with Static. This is because, in these cases, the gain in performance is too small and cannot compensate for the increased power consumption involved in using the complete system. Another interesting metric is the energy efficiency, which combines performance with consum; for Wish the dual goal of low energy and fast execution in mind, the *Energy Delay Product (EDP)* is the product of the consumed energy and the execution imperiod in the application. Figure 10 shows the EDP of the algorithms normalised to the EDP of the baseline. Since the E. 'D is a combination of the two above metrics, the previous results $\varepsilon$ efrithe corroborated. Therefore Auto-Tune also achieves the best energy fficie. Therefore Auto-Tune also achieves the best energy fficie. Therefore Auto-Tune also achieves the best energy fficie. The results on geometric mean, followed by Static, Hguided and Dyn mic. A tending to the individual algorithms, their relative advantages is a manual med. Although the Static algorithm on regular applications shows a significant reduction of the EDP of 65%, the same is not true on irregular ones, reducing only 12.4%. In contrast, the Auto-Tune is more reliable, as it achieves similar reduction on both kinds of applications; 48% on regular and 57% on irregular. #### 6. Related Work 615 Heterogeneity has taken computing platforms by storm, rarging from HPC systems to hand-held devices. The reason for this is 'beir of standing performance and energy efficiency. However, making the most of heterogeneous systems also poses new challenges. The extra computing power also involves new decisions on how to use all the available han 'ware, which currently have to be made by the programmer without much help from the programing frameworks and runtimes. The keys to make programming easy again are system abstraction, so the heterogeneous devices are landled transparently, and load balancing, so the resources are adequated. Nevertheless, related as they are, these problems are often address the programtely. The strategies for co-execution presented in this paper are built upon the system abstractions already offered who has a particularly on the load balancing problem. However, some related system abstraction research works are worth mentioning. Such is the case of DistCL [16], which is a framework that enables the distribution of a kernel over a GPU cluster by using user defined meta-functions. The wave callbacks that represent the memory access pattern of the of each devices, so the programmer can instruct the framework on how to distribute the data and reduce data transfers. In [17], the GPUs of the system are abstracted and the addresses accessed by each device are computed using sampling run on the host of some select work-items. The authors of [18] attain abstraction via kernel transformations and a static kernel analysis that determines which in the data need to be replicated or can be split. To the load palancing problem alone, there are two main approaches found in the literature: *static* and *dynamic*, which in turn can be adaptive or not. Regarding static methods, Lee *et al.*[19] propose the automatic modification of Open 3L code that executes on a single device, so the load is balanced among several. De la Lama *et al.* [20] propose a library that implements static load alancing by encapsulating standard OpenCL calls. The work presented in [21] uses machine learning techniques to come up with an offline model that predicts an ideal static load partitioning. However, this model do a not consider irregularity. Similarly, Zhong et al. [22] use performance model to identify an ideal static work distribution. In [18] the focus is on the sufficient of a single kernel execution to the available devices via code modifications. Qilin [23] is a training-based work distribution method that propose to balance the load using a database containing execution-time cata for all the programs the system has executed and a linear regression model. This technique is only useful in systems that run the same applications frequently. In the dynamic approach [24, 25] propose different techniques and runtimes. However, these focus on task distribution and not in the co-execution of a single data parallel kernel. The work of [26] deals with the dynamic distribution of TBB parallel for loops, adapting block size at each step to improve balancing. FluidicCL [5] does focus on co-execution but for systems with a CPU and a GPU. SnuCL [4] also tackles a balallelism, but is mostly centered on the distribution of the load among different nodes using an OpenCL-like library. Kaleem's et al. proporal in [7] and Boyer's et al. in [6] propose adaptive methods that use the expution time of the first packages to distribute the remaining load. However, they focus on a CPU/GPU scenario and do not scale well to configurations with more devices. Navarro et al. [26] propose a dynamic, adaptive algorithm for TBB that uses a fixed package size for the GPU and a verible one for the CPU to try to achieve good balancing. This work was exampled in [27], proposing an adaptive package size for the GPU too. This is also be sed on using small initial packages to identify a package size that obtains no optimal performance. If the traditional research area of dynamic loop scheduling, [28] presents Factoring, an algorithm with variable chunk sizes that addresses the problem of irregularity, referred to as iteration variance. However, it does not consider the terogeneity. HDSS [29] is a more recent work that proposes a load balancing a suithm that dynamically learns the computational power of each processor turing an adaptive phase and then schedules the remainder of the workload using a weighted self-scheduling scheme during the completion phase. How- ever, this algorithm assumes that the packages launched in the notical place are representative of the whole load, which might not be true for in equilar kernels. Besides, package size decreases linearly during the completion of nase, which may produce unnecessary overheads as substantiated in this paper 670 690 Scogland et al. [30] propose several work distribution scher es that fit different accelerated OpenMP computing patterns. Ho vever, they do not propose a single solution to the load balancing problem. The library presented in [31] also implements several load balancing algorithms and proposes the HGuided, which adapts to irregularity and considers heterogenery. This library is also used in Xeon Phi base systems in [32]. However, it are certain parameters from the programmer that may not be easy to our ain and uses linearly decreasing packages that might incur overheads. Some papers propose algorithms to distribute the workload between CPU and GPU taking performance and power into account. For instance, GreenGPU dynamically distributes work to Green and CPU, minimizing the energy wasted on idling and waiting for the lower device [33]. To maximize energy savings while allowing marginal performs ace degradation, it dynamically throttles the frequencies of CPU, CPU and memory, based on their utilizations. Wang and Ren [34] propose a power-efficient load distribution method for single applications on CPU-Green vistems. The method coordinates inter-processor work distribution and for energy scaling to minimize energy consumption under a length constraint. CPAT TA is a throughput-aware runtime task allocator for Heterogeneous Many Core platforms [35]. It analyzes tasks at runtime and uses the obtained of the forms tion to schedule the next tasks maximizing energy-efficiency. Vith respect to the problem of transparently managing a heterogeneous system the authors of [36] propose a framework for OpenCL that enables the ransparent use of distributed GPUs. In this same vein, Cabezas et al. [3] prosent an interesting architecture-supported take on efficient, transparent data and but but on among several GPUs. Nevertheless, this works overlook load balancing, which is essential when trying to make the most of several heterogeneous devices. Maestro [37] implements concepts related to the abstraction of the sys- tem, but the load balancing algorithm it proposes requires train. o. You [38], Zhong [8] and Ashwin [39] do address both lead relating while abstracting the underlying system and data movement. Neveral eless, their focus is on task-parallelism instead of on the co-execution of a single data-parallel kernel. Kim et al. [17] approach the problem by implementing an OpenCL framework that provides the programmer with a view of a single device by transparently managing the memory of the devices. Their approach is based on a Static load balancing strategy, so it can not adapt to irregularity. Besides, they only consider systems with several identical CDLs, lacking the adaptability that OmpSs offers. There are also some contributions that four on scheduling and load balancing for OmpSs tasks. For instance, the scheduler presented in [40] is closer to the idea of co-execution. It holds serve along plementations of a task, targeted for different devices, that will be remitter a vely. The scheduler stores the execution time of each implementation, so it are take load balancing decisions on what implementation is best to an analysis. However, the programmer is responsible for mapping the computation are several iterative tasks, which may not be an easy and natural approach for the application at hand. ### 715 7. Conclusions a. d Future Work This pape presents a new scheduler of the OmpSs programming model that allows to efficie. 'Iv co-execute a single OpenCL kernel instance using all the devices in a later geneous system. The scheduler has been conceived so that it is fully 'rans<sub>L</sub> and to the programmer, who only needs to select the algorithm and et its p rameters through environment variables. Gimmany to OpenMP, the scheduler provides different load balancing algorithms. These include the classic Static and Dynamic algorithms, as well as a version of the Guided, called HGuided, that takes into account the heterogeneity of the system. Achieving good results with these algorithms required the tuning of several parameters. Therefore, this paper also presents a novel load balancing algorithm called Auto-Tune, which is capable of outon atically determining suitable values for internal parameters through the execution. Judging by the results of all the experiments presented in this paper, two conclusions can be reached. First, the use of kernel co-execution on modern heterogeneous systems is very important, as the executed the benchmarks showed a significant improvement in performance, energy contact of and efficiency. Second, although there are some particular case in which the Static algorithm outperforms the Auto-Tune algorithm, the latter achies excellent results without a tedious and time-consuming phase of parameters optimization, which would necessary for each new benchmark or system. According to our experimental results, A. o-Tune is capable of taking advantage of the whole heterogeneous sy. ter 1, with an average efficiency of 0.85. Since the all the compute devices o. of o me chine are used, the execution time is reduced and consequently, an energy saving of 16% has been observed. The combination of these two improvements gives an reduction of the EDP close to 50%. The future of this exposion will see compatibility with new devices, like Intel Xeon Phi, FPG as a integrated GPUs. From the OmpSs perspective, a modification of the pragma pecification would allow the programmer to select different algorithms or parameters for different kernels of the same application. It would be integrated to extend the evaluation to different systems and device configuration. # Acknow. gm/nts This work has been supported by the University of Cantabria with grant CND-2014-18166, the Generalitat de Catalunya under grant 2014-SGR-1051, the Spanish Ministry of Economy, Industry and Competitiveness under contracts TIN2016-76635-C2-2-R (AEI/FEDER, UE) and TIN2015-65316-P. The Spanish Government through the Programa Severo Ochoa (SEV-2015-0493). The European Research Council under grant agreement No 321253 European Com- munity's Seventh Framework Programme [FP7/2007-2013] and Horizon 2020 under the Mont-Blanc Projects, grant agreement No 288777 of 10 10 2 and 671697 and the European HiPEAC Network. #### References 760 765 775 - [1] J. Nickolls, I. Buck, M. Garland, K. Skadron, Scale Jarallel programming with cuda, Queue 6 (2) (2008) 40–53. - [2] J. E. Stone, D. Gohara, G. Shi, OpenCL: A parallel programming standard for heterogeneous computing systems, IEEL Des. Test 12 (3) (2010) 66–73. - [3] J. Cabezas, I. Gelado, J. E. Stone, N. Invarro, D. B. Kirk, W. m. Hwu, Runtime and architecture support for efficient data exchange in multi-accelerator applications, IEEL Transactions on Parallel and Distributed Systems 26 (5) (2015) 140<sup>-1418</sup>. - [4] J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, J. Lee, SnuCL: An opencl framework for heterogeneous C U/GP I clusters, in: Proceedings of the ACM ICS, ACM, New York, NY, CAA 2012, pp. 341–352. - P. Pandit, R. Gorndarajan, Fluidic kernels: Cooperative execution of opencl programs on multiple heterogeneous devices, in: Proceedings of Annual IFFE/ACM CGO, ACM, 2014, p. 273:283. - [6] M. Boyer, <sup>V</sup>. Skadron, S. Che, N. Jayasena, Load Balancing in a Changing Wor'd: Γ ealing with Heterogeneity and Performance Variability, in: Proc. of the 'CN. Int. Conference on Computing Frontiers, ACM, New York, NY, U. A, 2013, pp. 21:1–21:10. - [1] R. Kaleem, R. Barik, T. Shpeisman, B. T. Lewis, C. Hu, K. Pingali, Adaptive heterogeneous scheduling for integrated GPUs, in: Proc. of PACT, ACM, New York, NY, USA, 2014, pp. 151–162. - [8] J. Zhong, B. He, Kernelet: High-throughput GPU kernel executions with dynamic slicing and scheduling, CoRR abs/1303.5164. - [9] A. Duran, E. Ayguadé, R. M. Badia, J. Labarta, L. Martinel, X. M. rtorell, J. Planas, Ompss: A proposal for programming heter geneous multi-core architectures, Parallel Processing Letters 21 (02) (2011, 173–193. - <sup>785</sup> [10] F. Sainz, S. Mateo, V. Beltran, J. L. Bosque, X. Martor ll, E. Ayguad', Leveraging ompss to exploit hardware accelerates, in. L. Symp. on Computer Architecture and High Performance Computing 2014, pp. 112–119. - [11] Mercurium C/C++/Fortran source-to-source co. piler, last accessed April 2018. - URL https://github.com/bsc-pm/mcxx 795 800 - [12] Nanos++ Runtime Library, last consed April 2018. URL https://github.com/bsc-ph/nanox - [13] E. Rotem, A. Naveh, D. Rajwan, A. Ananthakrishnan, E. Weissmann, Power management architecture of the 2nd generation Intel Core microarchitecture, formerly command Sandy Bridge, in: IEEE Int. HotChips Symp. on High-Perf Chips, 011. - [14] NVIDIA, Management Library (NVML), last accessed November 2016. URL https://deve.orgr.nvidia.com/nvidia-management-library-nvml - [15] E. Castillo, C. Can, rero, A. Borrego, J. L. Bosque, Financial applications on multi-cpv and multi-gpu architectures, J. Supercomput. 71 (2) (2015) 729–759. - [16] T. D. S. Gurfinkel, J. Anderson, N. E. Jerger, Distcl: A framework for the distributed execution of opencl kernels, in: Proceedings of the 1013 IEEE 21st International Symposium on Modelling, Analysis & Timulation of Computer and Telecommunication Systems, MASCOTS '15, IEEE Computer Society, Washington, DC, USA, 2013, pp. 556–566. doi:10.1109/MASCOTS.2013.77. URL http://dx.doi.org/10.1109/MASCOTS.2013.77 - [17] J. Kim, H. Kim, J. Lee, J. Lee, Achieving a single computed devict image in OpenCL for multiple GPUs, in: Proc. of the ACM Proc. ACM, 2011, pp. 277–287. - [18] J. Lee, M. Samadi, Y. Park, S. Mahlke, Skmd: fingle ke nel on multiple devices for transparent cpu-gpu collaboration, ACM Anals. Comput. Syst. 33 (3) (2015) 9:1–9:27. - [19] J. Lee, M. Samadi, Y. Park, S. Mahlke, Trans, arent CPU-GPU Collaboration for Data-parallel Kernels on Faterogeneous Systems, in: Proc. of PACT, IEEE Press, Piscataway, NJ. USA 213, pp. 245–256. - [20] C. S. de la Lama, P. Toharia, J. L. Posque, J. D. Robles, Static multi-device load balancing for opencl, in: Proc. \('\) ISPA, IEEE Computer Society, 2012, pp. 675–682. 820 825 - [21] K. Kofler, I. Grasso, B. Corenza, T. Fahringer, An automatic inputsensitive approach for hoterogeneous task partitioning, in: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, CS '15, ACM, New York, NY, USA, 2013, pp. 149–160. doi:10.1145/2-14/96.'465007. URL http://doi.acm.org/10.1145/2464996.2465007 - [22] Z. Zhong , Rychkov, A. Lastovetsky, Data partitioning on multicore and multiply and pultiforms using functional performance models, Computers, LEF Trans. on 64 (9) (2015) 2506–2518. - [23] C. K. Lu. S. Hong, H. Kim, Qilin: Exploiting parallelism on heterogeneous nultiprocessors with adaptive mapping, in: Proc. of the 42Nd Annual needs/ACM International Symposium on Microarchitecture, MICRO 42, ACM, New York, NY, USA, 2009, pp. 45–55. - T. Gautier, J. Lima, N. Maillard, B. Raffin, Xkaapi: A runtime system for data-flow task programming on heterogeneous architectures, in: IPDPS, 2013, pp. 1299–1308. [25] A. Haidar, C. Cao, A. Yarkhan, P. Luszczek, S. Tomov, K. Yabir, J. Dongarra, Unified development for mixed multi-GPU and multi-corressor environments using a lightweight runtime environment, n. Proc. of IPDPS, 2014, pp. 491–500. 840 - [26] A. Navarro, A. Vilches, F. Corbera, R. Asenjo, Straugues for maximizing utilization on multi-CPU and multi-GPU he er gene as architectures, J. Supercomput. 70 (2) (2014) 756–771. - [27] A. Vilches, R. Asenjo, A. Navarro, F. Co. bert, R. Gran, M. Garzarn, Adaptive partitioning for irregular and cations on heterogeneous cpu-gpu chips, Procedia Computer Ccience 51 (2015) 140 149, international Conference On Computational Science, ICCS 2015. doi:https://doi.org/10.1cml/j.phocs.2015.05.213. URL http://www.sciencedirect.com/science/article/pii/ S1877050915010212 - [28] S. F. Hummel, E. Cchont rg, L. E. Flynn, Factoring: A method for scheduling parallel loop Commun. ACM 35 (8) (1992) 90-101. doi: 10.1145/135226.135232 URL http://doi.ac.org/10.1145/135226.135232 - [29] M. E. Belviranli, L. N. Bhuyan, R. Gupta, A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures, ACM Trans. Archit. Code Optim. (4) (2013) 57:1-57:20. doi:10.1145/2400682.2400716. UR. http://doi.acm.org/10.1145/2400682.2400716 - [30] Γ. Scog'and, B. Rountree, W. chun Feng, B. de Supinski, Heterogeneous to 1, co'reduling for accelerated openmp, in: Proc. IPDPS, 2012, pp. 144– 15ε - B. Pérez, J. L. Bosque, R. Beivide, Simplifying programming and load balancing of data parallel applications on heterogeneous systems, in: Proceedings of the 9th Annual Workshop on General Purpose Processing Using - Graphics Processing Unit, GPGPU '16, ACM, New York, N. US., 2016, pp. 42–51. doi:10.1145/2884045.2884051. URL http://doi.acm.org/10.1145/2884045.288405. - [32] R. Nozal, B. Perez, J. L. Bosque, R. Beivide, I and ball noing in a heterogeneous world: Cpu-xeon phi co-execution of latar-parallel kernels, The Journal of Supercomputingdoi:10.1007/s11'27 018 -2318-5. URL https://doi.org/10.1007/s11227-01'-2516-5 870 880 - [33] K. Ma, X. Li, W. Chen, C. Zhang, X. Wang, Greengu: A holistic approach to energy efficiency in gpu-cpu heterogeneous architectures, in: 2012 41st Int. Conf. on Parallel Processing, 2012, No. 48–57. - <sup>875</sup> [34] G. Wang, X. Ren, Power-efficient vork distribution method for cpu-gpu heterogeneous system, in: Int. Tv., p. on Parallel and Distributed Processing with Applications, 201, pp. 1 2–129. - [35] B. Donyanavard, T. Minck, S. Sarma, N. Dutt, Sparta: Runtime task allocation for energy efficient neterogeneous many-cores, in: Int. Conf. on Hardware/Software Code. on and System Synthesis, CODES '16, ACM, New York, NY, TS', 20.6, pp. 27:1–27:10. - [36] A. L. R. Tur ma, bá, A. Sztajnberg, Transparent and optimized distributed processin, cygpus, IEEE Trans. on Parallel and Distributed Systems 27 (12) (2013) 3673–3686. - [37] K. 'pafford, J. Meredith, J. Vetter, Maestro: Data orchestration and tuning for openical devices, in: Proceedings of the 16th International Euro-Par Confere ce on Parallel Processing: Part II, Euro-Par'10, Springer-Verlag, Bernn, Heidelberg, 2010, pp. 275–286. UR http://dl.acm.org/citation.cfm?id=1885276.1885305 - <sup>8</sup> [, ], Y.-P. You, H.-J. Wu, Y.-N. Tsai, Y.-T. Chao, Virtcl: A framework for OpenCL device abstraction and management, in: Principles and Practice of Parallel Programming, PPoPP 2015, ACM, 2015. [39] A. M. Aji, A. J. Peña, P. Balaji, W.-c. Feng, Multicl: Enalling au omatic scheduling for task-parallel workloads in opencl, Parallel Comput. 58 (C) (2016) 37–55. 895 [40] J. Planas, R. M. Badia, E. Ayguadé, J. Labarta, Self-adap ive ompss tasks in heterogeneous environments, in: 2013 IEEE 17th n... symp. on Parallel and Distributed Processing, 2013, pp. 138–14 # ACCEPTED **Borja Pérez** graduated in Computer Science from University of Cantabria in 2014 and is currently pursuing a PhD degree on Technology and Science. He is a Pre-PhD researcher in the Dept. of Computer Engineering and Electronics of the University of Cantabria. His research in erests include heterogeneous systems both from the architecture and the programming view point and high performance computing. **Esteban Stafford** received the M. Sc. degree in Telecomunication Engineering from the University of Cantabria in 2 101 and he obtained the PhD degree in Computer Science in 2015. He is currently a part-time professor in the Dept. of Computer Engineering and Electronics of the University of Cantabria. His research increst include computer architecture, parallel computers and interconnection networks. Jose Luis Bosque graduated in Computer Science from Universidad Politécnica de Madrid in 1994. He received the PhD degree in Computer Science and Engineering in 2003 and the Extraordinary Ph.D Award from the same University. He joined the Universidad de Cantabria in 2006, where he is currently Associations for some in the Department of Computer and Electronics. His research interests include high performance computing, heterogeneous systems and interconnection networks. Ramón Beivide received the B.Sc. and M.Sc. degrees in Computer Science from the Universidad Autónoma de Barcelona (UAB) in 1981 and 1982. The Ph.D. degree, also in Computer Science, from the Universidad Politécnica de Lata unva (UPC) in 1985. He joined the Universidad de Cantabria in 1951 where he is currently a Professor in the Dept. of Computer Figineering and Electronics. His research interests include computer architecture, interconnection networks, coding theory and graph theory Serg. Mater obtained his bachelor's degree in Computer Science by the Universital Politècnica de Catalunya in 2012. Since 2011, he has been vorking as a compiler engineer in the Programming Models group at Bandona Supercomputing Center. In 2014 he joined the OpenMP Language Committee. His areas of interest are compilers, domain specific languages, parallel programming models and performance analysis. **Xavier Teruel** received the Computer Engineering degree and the Master on Computer Architecture, Network and Systems at Technical University of Catalonia (UPC) in 2006 and 2008, respectively. Since 2006 Xavier is working as a researcher within the group of Parallel Programming Models © 2019. This manuscript version is made available under the CC-BY-NC-ND 4.0 license in the Computer Science department at the Barcelona Supercomputing Center (BSC). His research interests include the areas of operating systems, programming languages, compilers, runtime systems and applications for high-performance computing and multiprogessor systems. He has published several papers in international workshops, conferences and journals in these topics. **Xavier Martorell** received the M.S. and Ph.D. degrees. Computer Science from Universitat Politecnica de Catalunya (UPC) in 1531 and 1999, respectively. He has been an associate professor in the Computer Architecture Department at UPC since 2001 teaching on operating systems. His research interests cover the aleas of paralellism, runtime systems, compilers and applications for high-performance multiprocessor systems. Since 2005 he is the manager of the team working on Parallel Programming Models at the Barcelona Supercomputing Center. He has participated in several european projects dealing with parallel environments (Nanos, Intone, POF, SARC ACOTES). He is currently participating in the European HiPLAC INCOMORE, Montblanc and DEEP european projects. Eduard Ayguade received the Tiple Coring degree in Telecommunications in 1986 and the Ph.D. degree in Computer Science in 1989, both from the Universitat Political de Calandrya (UPC), Spain. Since 1987 he has been lecturing on computer organization and architecture and optimizing compilers. Currently, and since 1997, he is full professor of the Computer Architecture Department at UPC. He is currently associate director for research on Computer Sciences at the Barcelona Supercomputing Center (BSC). His research in the state of multicore architectures, and programming models and compilers for high-performance architectures. He has rublished more than 300 publications in these topics and participated in several research projects with other universities and industries, in framework of the European Union programmes or in direct collaboration with technology leading companies.