Système unifié de transformation de code et d'éxécution pour un passage aux architectures multi-coeurs hétérogènes by LI, Pei
Unified system of code transformation and execution for
heterogeneous multi-core architectures.
Pei Li
To cite this version:
Pei Li. Unified system of code transformation and execution for heterogeneous multi-core
architectures.. Software Engineering [cs.SE]. Universite´ de Bordeaux, 2015. English. <NNT :
2015BORD0441>. <tel-01342119>
HAL Id: tel-01342119
https://tel.archives-ouvertes.fr/tel-01342119
Submitted on 5 Jul 2016
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destine´e au de´poˆt et a` la diffusion de documents
scientifiques de niveau recherche, publie´s ou non,
e´manant des e´tablissements d’enseignement et de
recherche franc¸ais ou e´trangers, des laboratoires
publics ou prive´s.
THÈSE PRÉSENTÉE 
POUR OBTENIR LE GRADE DE
DOCTEUR DE
L’UNIVERSITÉ DE BORDEAUX
École Doctorale de Mathématiques et Informatique
SPÉCIALITÉ: Informatique 
Par Pei Li
Système unifié de transformation de code et d'exécution 
pour un passage aux architectures multi-coeurs 
hétérogènes
Sous la direction de : Raymond Namyst
Encadrante de thèse : Elisabeth Brunet
Soutenue le 17 Décembre 2015
Membres du jury :
M. ROMAN Jean           Professeur Institut Polytechnique de Bordeaux  Président
M. MEHAUT Jean-François Professeur Université de Grenoble 1                  Rapporteur
M. SENS Pierre  Professeur Université Pierre et Marie Curie        Rapporteur
M. CARRIBAULT Patrick Chercheur  CEA        Examinateur
M. NAMYST Raymond Professeur Université de Bordeaux 1     Directeur de thèse
Mme. BRUNET Elisabeth Maître de conférence Institut Télécom SudParis   Encadrante de thèse 

Remerciement
Cette the`se est le fruit de travail qui appartient non seulement a` moi, mais aussi a` tous
les personnes qui m’a supporte´ mon travail et ma vie pendant ces 3 ans de the`se. Je
profite de cette occasion ici pour exprimer ma since`re gratitude.
Je tiens a` remercier en tout premier lieu mon encadrant Elisabeth Brunet. Je vous
remercie de m’avoir propose´ ce sujet de the`se, m’avoir fait confiance et m’avoir accueilli
au sein de votre e´quipe. Grace a` vous, j’ai pu entrer dans un nouveau monde ou` je
n’avais jamais explore´. Quand j’ai rencontre´ des proble`mes sur mes recherches, vous e´tiez
toujours la premie`re personne qui m’a encourage´ et m’a donne´ la suggestion. Je remercie
e´galement mon directeur de the`se Raymond Namyst. Vous m’avez guide´ la direction de
recherche pendant ma the`se. Sans votre conseil, tous ce travail n’aurait pas e´te´ possible!
Je vous remercie aussi de m’avoir accueilli au sein de l’e´quipe Runtime d’INRIA Bordeaux,
j’ai donc pu profiter la ressource de recherche et les e´quipements de expe´rimentation de
tre`s haute qualite´.
Je remercie chaleureusement les membres de mon jury de the`se. Je remercie tout
d’abord mes rapporteurs Jean-Franc¸ois Mehaut et Pierre Sens pour avoir pris le temps
d’e´valuer mon travail. Je remercie Jean Roman et Patrick Carribault pour avoir accepte´
mon invitation et participer au jury.
Je remercie tous les membres et les doctorants de de´partement informatique de TELE-
COM SudParis. Un grand merci a` Franc¸ois Trahay, Gae¨l Thomas et Christian Parrot
qui m’ont beaucoup aide´ et m’ont beaucoup inspire´ pendant la pre´paration de the`se. Un
e´norme merci a` Brigitte Houassine qui m’a aide´ sur tous les de´marches administratives.
Je remercie Alda Gancarski, Chantal Taconet, Denis Conan, Sophie Chabridon, Olivier
Berger, Christian Bac, Amel Mammar et tous les autres que je n’ai pas cite´s ici. Je remer-
cie tous les membres de l’e´quipe Runtime de Inria Bordeaux, particulie`rement, merci a`
Denis Barthou, Samuel Thibault, Marie-Christine Counilh et Sylvain Henry pour m’avoir
enseigne´ les reconnaissances sur StarPU runtime.
Je remercie mes anciens colle`gues Rachid Habel, Alain Muller, Soufiane Baghadadi
pour avoir partage´ leurs connainssances sur la compilation. Un e´norme merci a` tous mes
co-bureaux qui m’ont supporte´ pendant ces trois ans. Je e´galement remercie Fabienne
Je´ze´quel et Mounira Bachir qui m’ont encadre´ mon stage de Master et m’ont propose´ a`
mon encadrant actuel.
Enfin, Je remercie les membres de ma famille pour leur aide et soutien. J’ai commence´
mes e´tudes en France depuis le 3 septembre 2009. Pendant ces 6 ans, j’ai eu tre`s peu de
l’occasion de leur rendre visite, mais vous m’avez toujour supporte´ et m’avez encourage´.
Vous eˆtes toujour les personnes plus importantes dans ma vie.

Re´sume´
Re´sume´ en franc¸ais :
Les travaux de recherche pre´sente´s dans cette the`se se positionnent dans le domaine du
calcul haute performance ; plus particulie`rement dans la de´mocratisation de l’exploitation
efficace des plates-formes de calcul he´te´roge`nes. En effet, les exigences de performance
des applications de simulation scientifique me`nent a` une queˆte perpe´tuelle de puissance
de calcul. Actuellement, le paysage architectural des plates-forme est tourne´ vers l’exploi-
tation de co-processeurs, tels les GPU et les Xeon Phi, mate´riel satellite du processeur
principal aux performances surpuissantes sur des cas d’utilisation idoines. Depuis 2007,
les GPU (pour Graphical Processing Unit) inte`grent des milliers de coeurs au design peu
sophistique´ capables de traiter efficacement simultane´ment des milliers de taˆches. Plus
re´cemment est apparu le Intel Xeon Phi, co-processeur qualifie´ de many-core car il pos-
se`de plus de coeurs, plus de threads et des unite´s d’exe´cution vectorielles plus larges que
le processeur Intel Xeon, son homologue standard. Les coeurs du Intel Xeon Phi sont
certes moins rapides si on les conside`re individuellement mais la performance cumule´e
est bien supe´rieure si l’ensemble des ressources est correctement mobilise´e a` l’exe´cution
d’une application paralle`le. Le duo de teˆte Tianhe-2/Titan du Top500 de ces deux der-
nie`res anne´es, classement recensant les 500 machines les plus puissantes, atteste cette
tendance : Tianhe-2 est un super-calculateur he´te´roge`ne compose´ de 32.000 processeurs
Intel Xeon et de 48.000 co-processeurs de type Xeon Phi, tandis que Titan voit ses 18688
AMD processeurs seconde´s par 18688 Nvidia Telsa GPU.
Au niveau applicatif, l’exploitation conjointe de ces ressources de calcul aux profils
he´te´roge`nes est un re´el de´fi informatique que ce soit en terme de portabilite´ logicielle, aux
vues de la diversite´ de mode`les de programmation de chaque mate´riel, ou de portabilite´
de performances avec notamment les couˆts de de´port de calcul sur de telles ressources.
La portabilite´ logicielle pourrait passer par l’utilisation de standards de programmation
tels OpenCL ou OpenACC, qui permettent d’exploiter conjointement l’ensemble des res-
sources d’une machine, a` savoir les processeurs principaux et leurs co-processeurs. Ce-
pendant, leur mode`le de programmation est statique. C’est a` l’utilisateur de de´crire quel
calcul exe´cuter sur quelle ressource. L’e´quilibrage de charge entre les ressources he´te´ro-
ge`nes est donc laisse´ a` la charge du programmeur. Ainsi, meˆme si la portabilite´ logicielle
est assure´e d’une plate-forme a` l’autre, le changement du nombre de ressources ou de leur
capacite´ de calcul impliquent le re-de´veloppement de l’application. Il existe des environ-
nements d’exe´cution qui interfacent les diffe´rents co-processeurs et prennent en charge la
Output
STEPOCL configuration
file
Inputs
STEPOCL
Code Generator
Full multi-device 
OpenCL application
CPU Xeon Phi
STEPOCL runtime
GPU ...
OpenCL compute kernels
Figure 1 – Vue globale de STEPOCL.
dimension e´quilibrage de charge tels StarPU [6].
Cependant, les outils existants ne reconside`rent pas la granularite´ des taˆches de cal-
cul de´finies par le programmeur alors que les donne´es a` traiter sont toujours massives
et qu’il est de plus en plus fre´quent d’avoir plusieurs co-processeurs au sein d’une meˆme
machine. Dans ce contexte, il devient inte´ressant de conside´rer la distribution d’un cal-
cul sur plusieurs ressources de calcul he´te´roge`nes, d’autant que les calculs adapte´s a` une
exe´cution sur co-processeur sont ge´ne´ralement massivement paralle`les. Plusieurs aspects
non-fonctionnels sont a` prendre en charge par le programmeur comme de´terminer le par-
titionnement de charge de travail suivant les capacite´s de calcul des ressources devant
participer au traitement, le maintien de la cohe´rence des donne´es, l’e´change de donne´es
interme´diaires, etc. Leur mise en ouvre de manie`re portable et efficace est inde´niablement
fastidieuse et sujette a` erreur pour un programmeur meˆme expe´rimente´. C’est pourquoi
l’objectif de cette the`se est de proposer une solution de programmation paralle`le he´te´ro-
ge`ne qui permet de faciliter le processus de codage et garantir la qualite´ du code. Nous
proposons ici un nouvel outil STEPOCL qui comporte deux volets, comme illustre´ par la
Figure 3.6 : un ge´ne´rateur de code conc¸u pour ge´ne´rer une application OpenCL comple`te
capable d’exploiter des architectures he´te´roge`nes a` partir de noyaux de calcul et de leur
description basique e´crite graˆce a` un DSL (Domain Specific Language) ; un environne-
ment d’exe´cution capable de ge´rer dynamiquement des proble`mes comme l’e´quilibre de
charge, la distribution de noyaux de calcul sur plusieurs co-processeurs, la gestion des
communications et synchronisation, maintien de la cohe´rence de donne´es, etc.
Le ge´ne´rateur de code de STEPOCL prend en entre´e des noyaux de calcul e´crits en
OpenCL et un fichier de configuration e´crit graˆce a` un DSL base´ sur XML afin de ge´ne´rer
une application OpenCL comple`te faisant appel au support d’exe´cution de STEPOCL.
Le DSL de STEPOCL permet de de´crire tous les aspects non-fonctionnels des calculs
a` re´aliser comme la fac¸on dont doivent eˆtre subdivise´es les donne´es (par exemple, si
elles doivent l’eˆtre et si oui, suivant quels axes), leur taille, le flot de controˆle attendu
entre les diffe´rents noyaux de calcul, etc. dans l’application a` ge´ne´rer. Le code source
ge´ne´re´ est un application OpenCL capable d’exploiter plusieurs device OpenCL, code
que le programmeur est ensuite libre de modifier. Apre`s une phase d’initialisation de
l’environnement OpenCL, le code commence par la de´tection des ressources de calcul
effectives auxquelles sont associe´es un facteur de performance, facteur positionne´ graˆce a`
un module d’e´chantillonnage hors ligne. Vient ensuite la phase de de´ploiement des donne´es
et des calculs sur les ressources se´lectionne´es. La distribution des calculs est de´termine´e
a` partir de la taille des donne´es a` traiter, de la forme de partitionnement des calculs
donne´e dans le fichier de configuration et des capacite´s de calcul de chacune. Les donne´es
ne´cessaires au calcul de chaque partition sont identifie´es graˆce a` une analyse polye´drique
des acce`s aux donne´es assure´e par le compilateur PIPS avant d’eˆtre copie´es dans leurs
me´moires locales respectives. Les donne´es sont par la suite maintenus au maximum en
me´moire locale. Seules les donne´es frontie`res sont communique´es aux ressources qui en ont
besoin. Ces donne´es frontie`res sont e´galement identifie´es graˆce aux analyses de donne´es
produites par PIPS et sont transfe´re´es graˆce aux me´canismes de copies d’OpenCL via
la me´moire globale. Une fois les calculs acheve´s, les donne´es sont collecte´es en me´moire
globale afin de produire le re´sultat. Au dela` du calcul des re´gions suivant le nombre effectif
de participants au calcul et de l’e´change des donne´es au cours de l’exe´cution, le support
d’exe´cution de STEPOCL s’occupe d’e´quilibrer dynamiquement la charge de travail des
applications ite´rant sur un meˆme noyau de calcul en monitorant le temps d’exe´cution de
chaque ite´ration sur chaque ressource de calcul.
STEPOCL a donne´ lieu a` deux publications d’inte´reˆt qui ont mis en relief bien des
pistes d’ame´lioration et perspectives a` long terme. En premier lieu, il s’agira de pousser
la simplification a` l’extreˆme du fichier de configuration en utilisant des outils d’analyse
de flot de controˆle a` la compilation et des outils de profiling d’exe´cution permettant
d’affiner la granularite´ des noyaux de calcul. A` plus long terme, le contexte d’utilisation
de STEPOCL pourra eˆtre e´largi afin de cibler des objectifs applicatifs diffe´rents, comme
la re´duction d’e´nergie, avec l’utilisation d’autres strate´gies d’ordonnancement et le passage
a` des architectures diffe´rentes.
STEPOCL a e´te´ e´value´ sur trois cas d’application classiques : un stencil 2D 5 point,
une multiplication de matrices et un proble`me a` N corps, chacun pre´sentant des noyaux de
calcul adapte´s a` une exe´cution sur acce´le´rateurs de type GPU ou Xeon Phi car fortement
paralle`les. En terme de re´partition des donne´es, le stencil et la multiplication de matrice
sont des cas assez similaires dans le sens ou` les donne´es vont pouvoir eˆtre distribue´es
sur les ressources ; alors que dans le cas du N-Body, la re´plication de la structure de
donne´es stockant les particules est ne´cessaire. Pour chacun de ces cas test, diffe´rents points
d’e´valuation ont e´te´ conside´re´s. En premier lieu, est compare´ le volume du code ge´ne´re´ par
STEPOCL a` partir d’un noyau de calcul donne´ et du fichier de configuration associe´ afin
de produire une application OpenCL multi-device comple`te en comparaison de celui de
l’application mono-device de re´fe´rence tire´e de benchmarks de la litte´rature. Le fichier de
configuration e´tant massivement plus court et plus simple a` e´crire que toute la machinerie
ne´cessaire a` l’utilisation d’un environnement OpenCL, STEPOCL simplifie le cycle de
de´veloppement d’une application OpenCL comple`te en apportant en supple´ment une
dimension multi-device. STEPOCL est ensuite e´value´ suivant des crite`res de performance
en temps d’exe´cution. Les e´valuations ont e´te´ mene´es sur deux plates-formes mate´rielles
he´te´roge`nes diffe´rentes. La premie`re Hannibal allie la force de calcul de trois GPU de
type NVidia Quadro FX5800 a` un biprocesseur quad-coeur Intel Xeon X5550 ; tandis
que sur la seconde, le surpuissant processeur principal, un bi-processeur Intel Xeon E5-
2670 comptant 2 fois 10 coeurs de calcul, est e´quipe´ de deux acce´le´rateurs de type Intel
Xeon Phi offrant chacun 61 coeurs. Chacune de ces ressources est exploite´e au travers de
l’imple´mentation OpenCL fournie par Intel a` l’exception des GPU qui est adresse´ graˆce au
support OpenCL de´die´ de NVidia. Tout d’abord, l’application produite par STEPOCL
est compare´e a` sa version de re´fe´rence dont le noyau de calcul a e´te´ extrait et utilise´ pour
la ge´ne´ration du code. Pour les trois cas tests, les performances du code OpenCL ge´ne´re´
par STEPOCL s’exe´cutant sur un seul et meˆme acce´le´rateur sont comparables a` celle
de la version originale. Ainsi, STEPOCL permet de produire un code OpenCL complet
aux performances satisfaisantes ayant le potentiel de distribuer les calculs sur plusieurs
ressources et ce, avec un effort de programmation moindre. Les codes ge´ne´re´s sont ensuite
exe´cute´s sur les ressources he´te´roge`nes des plates-formes de test. Chaque application est
e´value´e sur des configurations ou` seul le processeur principal est active´, puis seul un co-
processeur, puis deux pour enfin arriver a` l’exploitation totale des ressources he´te´roge`nes
de la machine. L’e´quilibrage de charge de calcul re´alise´ a` partir de l’e´chantillonnage hors-
ligne de STEPOCL permet d’exploiter de manie`re conjointe toutes les ressources dans
les trois cas applicatifs. De plus, dans le cas du stencil et de la multiplication de matrices,
le fait que les donne´es puissent eˆtre distribue´es sur les diffe´rentes ressources permet de
traiter des jeux de donne´es plus larges en paralle`le. Avec l’application originale, seule une
exe´cution sur le processeur hoˆte de la plate-forme Hannibal permettait de mener a` bien
le calcul car la taille de la me´moire des GPU est trop limite´e.
Mots-cle´s : Calcul Haute Performance, Paralle´lisme, Architectures he´te´roge`nes, OpenCL,
ge´ne´ration de code, e´quilibrage de charge
Re´sume´ en Anglais:
Heterogeneous architectures have been widely used in the domain of high performance
computing. However developing applications on heterogeneous architectures is time con-
suming and error-prone because going from a single accelerator to multiple ones indeed
requires to deal with potentially non-uniform domain decomposition, inter-accelerator
data movements, and dynamic load balancing.
The aim of this thesis is to propose a solution of parallel programming for novice devel-
opers, to ease the complex coding process and guarantee the quality of code. We lighted
and analysed the shortcomings of existing solutions and proposed a new programming
tool called STEPOCL along with a new domain specific language designed to simplify
the development of an application for heterogeneous architectures. We evaluated both
the performance and the usefulness of STEPOCL. The result show that: (i) the per-
formance of an application written with STEPOCL scales linearly with the number of
accelerators, (ii) the performance of an application written using STEPOCL competes
with an handwritten version, (iii) larger workloads run on multiple devices that do not
fit in the memory of a single device, (iv) thanks to STEPOCL, the number of lines of
code required to write an application for multiple accelerators is roughly divided by ten.
Keywords: High-Performance Computing, Parallelism, Heterogeneous Architectures,
OpenCL
Contents
1 Introduction 1
1.1 Objective of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Outline and contribution . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 The rise of heterogeneous computing 5
2.1 Heterogeneous platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Multi-core processor . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 GPU computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Intel Xeon Phi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.4 AMD APU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Exploitation of heterogeneous architectures . . . . . . . . . . . . . . . . . 9
2.2.1 Computing on CPUs . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Computing on accelerators . . . . . . . . . . . . . . . . . . . . . . 13
2.2.3 Computing on heterogeneous architectures . . . . . . . . . . . . . 16
2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 STEPOCL 23
3.1 Programming on heterogeneous multi-device architectures in OpenCL . . 24
3.1.1 Background of OpenCL . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.2 Challenge of using multiply compute devices . . . . . . . . . . . . 27
3.2 Component of STEPOCL . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.1 Compute kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.2 STEPOCL configuration file . . . . . . . . . . . . . . . . . . . . 32
3.2.3 STEPOCL Output of OpenCL application . . . . . . . . . . . . 34
3.3 STEPOCL internal mechanism . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.1 Automatic device management . . . . . . . . . . . . . . . . . . . 35
3.3.2 Automatic workload partition . . . . . . . . . . . . . . . . . . . . 36
3.3.3 Automatic adjusting workload . . . . . . . . . . . . . . . . . . . . 37
3.3.4 Automatic data consistency management . . . . . . . . . . . . . . 37
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4 Implementation 39
4.1 Analysis of region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.1 PIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1.2 Analysis of regions with PIPS . . . . . . . . . . . . . . . . . . . . 40
4.1.3 Analysis of OpenCL kernel . . . . . . . . . . . . . . . . . . . . . . 42
i
4.2 Oﬄine profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 Workload partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.1 Data space partition . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4 Workload balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5 Data transmission between multiple devices . . . . . . . . . . . . . . . . 50
4.6 Generation of Host code . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5 Evaluation 55
5.1 Test cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Volume of the generated source code . . . . . . . . . . . . . . . . . . . . 61
5.3 Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.1 Experimental platforms . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.2 Evaluation of the profiler . . . . . . . . . . . . . . . . . . . . . . . 64
5.3.3 Comparison with the reference codes . . . . . . . . . . . . . . . . 64
5.3.4 Stencil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3.5 Matrix multiplication . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.6 N-body . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4 Analysis of overhead of communication . . . . . . . . . . . . . . . . . . . 68
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6 Conclusion 71
6.1 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.3 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Appendices 79
ii
List of Figures
1 Vue globale de STEPOCL. . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 A basic diagram of a dual-core processor . . . . . . . . . . . . . . . . . . 6
2.2 Floating-Point Operations per Second for the CPU and GPU. . . . . . . 7
2.3 Memory bandwidth for the CPU and GPU. . . . . . . . . . . . . . . . . 8
2.4 The GPU devotes more transistors to data processing. . . . . . . . . . . 9
2.5 Intel Xeon Phi microarchitecture. . . . . . . . . . . . . . . . . . . . . . . 11
2.6 Theads model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 Host-accelerator model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.8 Data Partitioning in homogeneous way . . . . . . . . . . . . . . . . . . . 19
2.9 Comparison of original kernel and partitioned kernel . . . . . . . . . . . . 20
3.1 OpenCL Platform Model (image source: KHRONOS group). . . . . . . . 24
3.2 OpenCL Work-Groups and Work-Items. . . . . . . . . . . . . . . . . . . 25
3.3 Matrix production on single device . . . . . . . . . . . . . . . . . . . . . 28
3.4 Matrix production on multiple device . . . . . . . . . . . . . . . . . . . . 29
3.5 Data consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6 Overview of the STEPOCL environment. . . . . . . . . . . . . . . . . . 31
3.7 Device management: using different kernel . . . . . . . . . . . . . . . . . 35
3.8 Device management: using a common kernel . . . . . . . . . . . . . . . . 36
4.1 Over-approximated analysis . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Workload partitioning for multi-device platform with single context . . . 45
4.3 Workload partitioning for multi-device platform with multiply contexts . 45
4.4 Available data partitioning over multiple devices . . . . . . . . . . . . . . 46
4.5 Data partition with ghost region . . . . . . . . . . . . . . . . . . . . . . 47
4.6 The process of Data partition . . . . . . . . . . . . . . . . . . . . . . . . 48
4.7 Overview of the load balancing algorithm used in the STEPOCL runtime. 49
4.8 Ping-pong effect of workload adjustment. . . . . . . . . . . . . . . . . . . 49
4.9 Data transmission between two devices . . . . . . . . . . . . . . . . . . . 51
5.1 5-point 2D-stencil computation. . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 the structure of generated 2D stencil code . . . . . . . . . . . . . . . . . 63
5.3 Workload adjustment performance of the 3D-stencil application. . . . . . 65
5.4 Performance of the 5-point 2D-stencil application. The horizontal axis
corresponds to the size the input and output matrices required to solve
the problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
iii
5.5 Performance of the matrix multiplication application. The horizontal axis
corresponds to the summed size of the A, B, and C matrices. . . . . . . . 67
5.6 Performance of the N-body application. . . . . . . . . . . . . . . . . . . . 68
5.7 Partitioning data by column . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.8 Partitioning data by row . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.9 Data partition of a 2D table . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.1 Distributions of a 2D table on 4 devices . . . . . . . . . . . . . . . . . . . 73
iv
List of Tables
4.1 PIPS overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 OpenCL scalar data type . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 OpenCL vector data type . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4 Get information about an OpenCL device . . . . . . . . . . . . . . . . . 44
4.5 read and write buffer objects . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.1 Generated STEPOCL code size (in lines). . . . . . . . . . . . . . . . . . 62
5.2 Distribution of the lines of code of the generated 2D stencil application . 62
5.3 Experimental platform outline. . . . . . . . . . . . . . . . . . . . . . . . 64
5.4 Relative performance of STEPOCL as compared to a native OpenCL
implementation on Hannibal. . . . . . . . . . . . . . . . . . . . . . . . 65
v
Chapter 1
Introduction
Que peu de temps suffit pour
changer toutes les choses.
—Victor Hugo
Contents
1.1 Objective of thesis . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Outline and contribution . . . . . . . . . . . . . . . . . . . . 3
Since the first general-purpose electronic computer ENIAC was created in 1946, sci-
entists have never stopped the pace of creating new computers with higher performance
and computational power. Scientists had kept enhancing the performance by increasing
CPU frequency and complexity of CPU architecture until they meet the bottleneck of
heat dissipation. Meanwhile, integrated circuit transistor technology has almost reached
its physical limit. The number of transistors contained in processor chip can be used as
a rough estimate of its complexity and performance. Moore’s law which is an empirical
observation states that the number of transistors of a typical processor chip doubles ev-
ery 18 to 24 months. However today, there is not much space for increasing the density
of transistors in a single CPU chip, because extreme miniaturization of electronic gates
makes the effects of phenomena like electromigration and subthreshold leakage become
much more significant [24]. These factors make scientist to investigate new solutions:
parallelism.
Instead of using single high frequency unit, the trend of computer architecture has
turned to use more but relatively slow processing units (multi-core processors). In the
early 2009s, most desktop CPUs have become multi-cores. Since then, more and more
types of multi-core processor have been used in general purpose computing and heteroge-
neous architectures which are usually composed of a host multi-core processor (CPU) and
some auxiliary specially designed processors (called accelerators, such as GPU and Intel
Xeon Phi) have become very important in the domain of high performance computing.
In Top500 [5] fastest supercomputer lists of 2015, both Tianhe-2 (ranked as No. 1) and
Titan (ranked as No. 2) used different type of processors: Tianhe-2 consists of 32,000
Intel Xeon CPUs and 48,000 Intel Xeon Phi coprocessors; Titan consists of 18,688 AMD
Opteron CPUs and 18,688 Nvidia Tesla GPUs.
1
Compared to homogeneous architectures, heterogeneous architectures have greater ad-
vantage on computing performance. These systems gain performance not just by adding
cores, but also by incorporating specialized processing capabilities to handle particular
tasks. Heterogeneous architectures utilize multiple processor types to benefit the best of
each devices. For example, GPU processing, apart from its well-known 3D graphics ren-
dering capabilities, can also perform mathematics computations on very large data sets,
while CPUs can run the operating system and perform traditional serial tasks, such as
data transfer management. Moreover, GPU and Xeon Phi have vector processing capa-
bilities that enable them to perform parallel operations on very large sets of data at much
lower power consumption relative to the serial processing of similar data sets on CPUs.
Followed with the rise of heterogeneous architectures, applications such as augmented
reality, rich media composition, numerical simulations makes heterogeneous architectures
usage more productive.
However the development of such applications is a challenge due to the usage of var-
ious hardware and many APIs (application program interface) together with a goal for
power-performance efficiency. Designing applications for multi-accelerator heterogeneous
systems requires that developers have rich experience of parallel programming and some
strong knowledge of architectures. Firstly, developers need to choose the most appropri-
ate set of computing devices for a given application. Then they have to write compute
kernels optimized for the selected accelerators. In order to achieve the best possible per-
formance, the developer has to identify an efficient partitioning of the workload among
the accelerators. This efficient partitioning is not only related to the theoretical perfor-
mance of the accelerator, but also to the actual application and its workload. Then, the
developer has to write code to coordinate the execution of multiple kernels running on
different devices, to partition the workload among them and to perform data movements
and synchronization between them. Implementing these features is time-consuming and
error-prone. Moreover, the developer often implements these features for a specific hard-
ware and has to drastically modify his code for a different target.
As we will present in this documents, existing programming tools for heterogeneous
architectures typically define APIs to deploy and execute compute kernels on the pro-
cessing units of accelerators, however, none of these programming tools are tailored for
multi-accelerator application development. The challenges mentioned above are still left
for developers to resolve.
1.1 Objective of thesis
The main objective of this thesis is to study heterogeneous computing at programming
level and to propose a unified system that could automatically generate code for hetero-
geneous multi-device architectures. We want to relief the developers from tedious and
time consuming coding process, and make them more focus on solving the problem itself.
More specifically, this thesis will focus on studying how to solve following problems:
• How to split and distribute a parallel task to multiple devices;
• How to automatically balance the workload of whole system and manage the com-
munications between devices;
2
• How to automatically generate the whole application without too much coding
effort.
1.2 Outline and contribution
Chapter 2 introduces the state of the art of heterogeneous computing including the com-
puting devices equipped on current heterogeneous architectures and programming tools
for developing applications.
In Chapter 3, we first introduced the background of OpenCL which is the basis of
our main work. Then we presented our contribution: a new programming tool called
STEPOCL. STEPOCL separates the functional aspects of the code, i.e., the compute
kernels, from the non-functional ones. The non-functional aspects of the code are de-
scribed in a domain specific language. It mainly consists in describing the data layout
and the kernel of the application. Using a domain specific language improves the produc-
tivity of the developer by decreasing the number of lines of codes required to implement a
full multi-device application. Moreover, the code written in the domain specific language
is not limited to a specific hardware setting, which increases the portability of the code.
Finally, it also has the advantage of avoiding many errors caused by the use of a low-level
language. Based on the compute kernels and on the description of the non-functional
aspects, STEPOCL automatically generates a full OpenCL application, but also an of-
fline profiler for this application. The profiler computes an efficient workload partitioning
for a specific set of accelerators and a specific workload. The generated application han-
dles the initialization of the OpenCL environment, which includes the discovery of the
accelerators, the mapping of the data to the accelerators according to the oﬄine profil-
ing results, and the launch of the kernels over the accelerators. During execution, the
generated application also automatically exchanges the data between the accelerators to
maintain the data consistency thanks to a polyhedral data analysis, and, at the end of
the execution, the generated application retrieves the results from the accelerators.
Chapter 4 introduces the implementation of main modules of STEPOCL including
analysis of region, oﬄine profiling, runtime system and code generation. These modules
ensure that the generated applications can profit the full potential of any multiple devices
heterogeneous architectures.
Chapter 5 presents the evaluation of generated applications. We evaluate STEPOCL
with three application kernels (a 5-point 2D-stencil, a matrix multiplication and an N-
body application) on two multi-device heterogeneous machines that combine CPUs with
different accelerators: CPU+GPUs and CPU+Xeon Phis. Our main results show that:
• When running on multiple heterogeneous devices 1, the performance of the code
generated by STEPOCL scales linearly with the number of devices.
• As compared to the same applications written directly in OpenCL and provided
with the OpenCL version of AMD, the applications written with STEPOCL have
similar performance.
• Thanks to STEPOCL, we are able to run large workloads on multiple devices that
do not fit in the memory of a single device.
1A device is either a CPU or an accelerator.
3
• As compared to the generated code, STEPOCL is able to divide by ten the num-
ber of lines of code of the application. Also as compared to the native OpenCL
applications, STEPOCL is able to divide by five their number of lines of code.
Furthermore, while the applications shipped with the OpenCL version of AMD only
runs on a single device, the STEPOCL applications are able to run on multiple
heterogeneous devices.
At last, chapter 6 concludes the whole works of this thesis and presents future work.
4
Chapter 2
The rise of heterogeneous computing
Essayez de ne pas devenir un
homme de succe`s, mais plutoˆt
essayez de devenir un homme de
valeur.
—Albert Einstein
Contents
2.1 Heterogeneous platforms . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Multi-core processor . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 GPU computing . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Intel Xeon Phi . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.4 AMD APU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Exploitation of heterogeneous architectures . . . . . . . . . 9
2.2.1 Computing on CPUs . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Computing on accelerators . . . . . . . . . . . . . . . . . . . . 13
2.2.3 Computing on heterogeneous architectures . . . . . . . . . . . 16
2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
This chapter introduces heterogeneous computing from two aspects: hardware archi-
tecture and programming on heterogeneous architectures.
2.1 Heterogeneous platforms
A computer is usually equipped with two types of processor: CPU (central processing
units) and GPU (graphics processor unit). Before the 1990s, CPUs still play important
role in massive computing while GPUs are merely displaying the monitor. In order to
improve the performance of computers, engineers focused on improving the performance
of CPU mainly by increasing in the frequency and complicating the architecture (mainly
by using more transistors on a CPU die). However, if the frequency gets too high, the
5
Processor
L1 Cache
L2 Cache
Core 0 Core 1
L1 Cache
L2 Cache
System Bus
System Memory
(a) Multi-core processor seperate L2
Processor
L1 Cache
L2 Cache
Core 0 Core 1
L1 Cache
System Bus
System Memory
(b) multi-core processor shared L2
Figure 2.1 – A basic diagram of a dual-core processor
chip will melt by the extreme heat. That’s the reason why most current CPUs runs
about 2 GHz to 3 GHz, but not more than 4GHz. Meanwhile, due to the physical limit,
increasing the density of transistor on a single chip also meet the bottleneck.
2.1.1 Multi-core processor
In order to keep improving the performance over the last 10 years, engineers found other
ways – using more CPU cores on the same CPU chip. Figure 2.1 presents two typical
multi-core processor architectures. The advantage of multi-core processors is that mul-
tiply tasks (such as running a software or scientific computation) can be dispatched to
different CPU core, each core can work at relatively low frequency but we can still get
a faster experience. The dual core CPUs and quad core CPUs on the market today are
good enough for regular tasks in daily life. However, for scientists and engineers, CPUs
are still not powerful enough. Meanwhile, GPUs, due to their special architecture and
extraordinary data parallel processing ability, have become very important in the domain
of high performance computing.
2.1.2 GPU computing
The original GPUs were designed to rapidly render images in a frame buffer intended
for output to a display. In the last 20 years, the improvement of GPUs have been pri-
marily driven by the demand for realtime, high-definition graphic, the evolution of GPU
hardware architecture has gone from a specific single core, function hardware pipeline
implementation made only for graphics, to a set of parallel and programmable cores for
more general purpose computations [22]. Modern GPU has become a highly parallel,
6
multithreaded, manycore processor with tremendous computational horsepower and very
high memory bandwidth, as illustrated by Figure 2.2 and Figure 2.3.
Figure 2.2 – Floating-Point Operations per Second for the CPU and GPU.
The reason behind the difference of floating-point capability between the CPU and the
GPU is that the architecture of modern GPU is specialized for compute-intensive, highly
parallel computation – exactly what graphics rendering is about – and therefore designed
such that more transistors are devoted to data processing rather than data caching and
flow control, as illustrated by Figure 2.4.
More specifically, the GPU is well designed for dealing with the problems that can
be expressed as data-parallel computations (the same program is executed on many data
elements in parallel). Because the same program is executed for each data element, there
is a lower requirement for sophisticated flow control and because it is executed on many
data elements and has high arithmetic intensity, the memory access latency can be hidden
with calculations instead of big data caches. Many applications that process large data
sets can use a data-parallel programming model to speed up the computations, from
image rendering or signal processing to physics simulation or computational biology. The
success of GPU computing opened a new page of heterogeneous computing, and new
accelerators (such as Intel Xeon Phi [26] ) begin to emerge.
2.1.3 Intel Xeon Phi
Intel Xeon Phi Coprocessor is the brand name for all Intel Many Integrated Core Archi-
tecture (Intel MIC Architecture) based products. The Intel Xeon Phi coprocessors are
7
Figure 2.3 – Memory bandwidth for the CPU and GPU.
designed to extend the reach of applications that have demonstrated the ability to fully
utilize the scaling capabilities of Intel Xeon processor-based systems and fully exploit
available processor vector capabilities or memory bandwidth. Compared to traditional
Intel Xeon processor, the Intel Xeon Phi coprocessors provide extra power-efficient scal-
ing, vector support, and local memory bandwidth. Each Intel Xeon Phi coprocessor
equips more than 50 cores (it varies between generations) interconnected by a high-speed
bidirectional ring as presented in Figure 2.5. Each core has four hardware threads and
clocked at 1 GHz or more. The many core architecture makes Intel Xeon Phi coprocessors
competitive with GPUs on the performance of parallel computing. Moreover, Intel Xeon
Phi coprocessor offers more programmability. Applications that show positive results
with GPUs should always benefit from Intel Xeon Phi coprocessors because the same
fundamentals of vectorization or bandwidth must be present. The opposite is not true,
the flexibility of an Intel Xeon Phi includes support for applications that cannot run on
GPUs [18].
2.1.4 AMD APU
Current CPUs and GPUs have been designed as separate processing elements, each has
a separate memory space. In order to launch an execution on GPU, the required data
located on CPU memory needs to explicitly be copied to GPU memory and then copied
8
Figure 2.4 – The GPU devotes more transistors to data processing.
back again. AMD Accelerated Processing Unit (APU) used Heterogeneous System
Architecture (HSA) that integrates CPU and GPU on the same bus with unified shared
main memory. As a result, APU performs zero data movement between CPU, GPU and
other accelerators. The design of unified memory reduces communication latency between
CPU and other accelerators. Moreover, unified memory is more practical on programming
level.
2.2 Exploitation of heterogeneous architectures
Since heterogeneous architectures become more and more popular, the computing com-
munity is building tools and libraries to ease the use of these heterogeneous systems.
2.2.1 Computing on CPUs
The parallelism of a computer program on multi-core CPUs can be achieved through using
threads model. In threads model, a parallel main program can have multiple concurrent
execution paths, and each path is executed by a thread. In Figure 2.6, program main.out
initializes two arrays, and then creates four tasks (threads), each tasks will be scheduled
and run by operating system concurrently. Each thread has local data and shares a
global memory. The synchronization mechanism are usually applied to avoid the conflict
of operation on same global memory address, such as avoiding that several threads update
the same global address at same time.
Pthreads
Pthreads is a POSIX (Portable Operating System Interface) standard [4] for threads.
It provides an API (Application Programming Interface) for creating and manipulating
threads. Pthreads is implemented in a pthread.h header and a thread library. Although
Pthreads only supports C language, it is available on many mainstream operating sys-
tems such as FreeDSB, Linux, Mac OS x and Windows.
The example in Listing 2.1 creates 4 threads with pthread create function, and each
thread prints its own id. Listing 2.2 shows the execution result in Listing 2.2.
9
Listing 2.1 – Pthreads application in C
#inc lude <pthread . h>
#inc lude <s t d i o . h>
#d e f i n e NUM THREADS 4
void ∗ p r i n t t h r e a d i d ( void ∗argument )
{
long thread id ;
th r ead id = ( long ) argument ;
p r i n t f ( ”I am thread %ld !\n” , th r ead id ) ;
p th r ead ex i t (NULL) ;
}
i n t main ( i n t argc , char ∗argv [ ] )
{
pthread t thread [NUM THREADS] ;
i n t e r r ;
long t ;
f o r ( t =0; t<NUM THREADS; t++)
{
p r i n t f ( ”Creat ing thread %ld !\n” , t ) ;
e r r = pthread c r ea t e (&thread [ t ] ,NULL, p r i n t th r e a d i d , ( void ∗) t ) ;
i f ( e r r ){
p r i n t f ( ”ERROR!\n” ) ;
e x i t (−1);
}
}
pth r ead ex i t (NULL)
}
Listing 2.2 – Execution result
Creat ing thread 0 !
Creat ing thread 1 !
I am thread 0 !
Creat ing thread 2 !
Creat ing thread 3 !
I am thread 1 !
I am thread 2 !
I am thread 3 !
10
Figure 2.5 – Intel Xeon Phi microarchitecture.
OpenMP
OpenMP is an API that supports multi-platform shared memory multiprocessing pro-
gramming in C, C++ and Fortran. As an industrial standard, OpenMP can run on
most platforms including Linux, Mac OS X and Windows platforms. OpenMP program
is composed of a set of compiler directives, runtime library routines and environment
variables. OpenMP uses the fork-join model of parallel execution. At the beginning,
OpenMP program starts as a single thread: the master thread. The master thread exe-
cutes sequentially until it reaches the parallel region construct. The master thread then
creates a group of parallel threads to execute the statements that are enclosed in parallel
region construct. Once the group of threads finish the execution of the statements in
parallel region construct, they will be synchronized and deleted, leaving only a master
thread.
OpenMP directives can be inserted directly into serial code and achieve a considerable
improvement of performance. Listing 2.3 presents a program that initiates a table with
multiple threads. The number of threads used can be defined as environment variables.
Listing 2.4 presents an example of setting OpenMP environment. In this case, eight
threads will be used to initialize in parallel tableA of Listing2.3.
Listing 2.4 – Setting OpenMP environment variables
export OMP NUM THREADS=8
11
...
… 
do i=1,n
    A[i]=i
    B[i]=A[i]
end do
call func1()
call func2()
call func3()
call func4()
…
...
Tim
e  lin e
T :thread
T1
T2
T3
T4
main.out
Figure 2.6 – Theads model
OpenMP is well suited for shared memory multi-core architectures. However, if pro-
grammers want to use more computing resources such as multiple nodes on large-scale
clusters, they have to use other APIs, such as MPI [11] (Message Passing Interface), to
manage computation and communication on distributed memory environment.
STEP and dSTEP
STEP [23] is an OpenMP directives based programming tools that transforms a pro-
gram into an MPI source program and produces MPI code close to hand-written MPI
code. It is developed by HP2 (Haute Performance et Paralle´lisme) team of TELECOM
SudParis. The programmer adds OpenMP directives and then STEP generates a MPI
program automatically. STEP extends the usage of OpenMP (which is restricted to
Listing 2.3 – Initializing a table in parallel
i n t main ( i n t argc , char ∗ argv [ ] ) {
const i n t s i z e = 100 ;
i n t tableA [ s i z e ] ;
#pragma omp p a r a l l e l f o r
f o r ( i n t n=0; n<s i z e ; n++)
tableA [ s i z e ] = n∗n
. . . . . .
}
12
shared-memory architectures) to distributed-memory architectures such as clusters of
many-cores. Thanks to the region analysis of compiler PIPS [16], STEP can efficiently
generate the communications between devices. Habel et al. improved the data partition-
ing module of STEP, and named the new tool as dSTEP [13]. dSTEP provides a large
set of distribution types for data partitioning. These distribution types are unified in
a ”dstep distribute” directive. It also provide a ”dstep gridify” directive to express the
computation distribution and the schedule constraints of loop iterations.
At the beginning of this thesis, the main limitation of STEP and dSTEP is that they
do not fully support the computation on the accelerators such as GPUs and Intel Xeon
Phi (dSTEP supports CUDA). Besides, the data partitioning of dSTEP is homogeneous:
every sub-data has same size. Considering the varieties of hardware performance of
computing devices, a homogeneous data partitioning may lead to poor performance.
2.2.2 Computing on accelerators
Some programming platforms provide accessibility to multiple type accelerators such as
GPU and co-processors. Different from multi-core CPU architecture, the memory on the
accelerator may be completely separated from main memory (which is shared by multi-
core CPU). Most of programming platforms for diverse computing resources use an oﬄoad
model which includes a single host and one or more accelerators which can be any kind
of processor (CPU, GPU and Xeon Phi etc.). Figure 2.7 presents a processing flow of
an oﬄoad model. First of all, developers allocate data region on main memory (host
memory). Then data region are transferred from main memory to remote memory on
accelerators. In the third step, the host instructs the process to accelerators, accelerators
execute computing functions submitted from host. At last, the results are copied to main
memory. Due to the fact that accelerators do not share a common memory, programmers
have to manually maintain the consistency of data in each remote memory.
host
Accelerator
(CPU, GPU or
other processors)Main
Memory
Memory on
accelerator
step 1
step 2
step 4
step 3
step1 : allocate data
step2 : copy data 
step3 : compute on data 
 
step4 : copy  result 
Figure 2.7 – Host-accelerator model
13
Listing 2.5 – The syntax of OpenACC directives
#d e f i n e SIZE 1000
f l o a t a [ SIZE ] ;
f l o a t b [ SIZE ] ;
i n t main ( )
{
i n t i ;
// I n i t i a l i z e arrays .
f o r ( i = 0 ; i < SIZE ; ++i ) {
a [ i ] = ( f l o a t ) random ()%1000;
b [ i ] = 0 .0 f ;
}
// 1D s t e n c i l computing
#pragma acc k e r n e l s copyin ( a ) copy (b)
f o r ( i = 1 ; i < SIZE−1; ++i ) {
b [ i ]=( a [ i−1]+a [ i ]+a [ i +1])/3
}
re turn 0 ;
}
CUDA
CUDA (Computer Unified Device Architecture) is a parallel computing platform and
programming model created by NVIDIA [1]. It allows software developers to benefit the
computing performance of CUDA-enable GPU for general purpose processing.
The CUDA platform is composed of CUDA libraries and compiler directives. CUDA
supports several industry-standard programming language including C, C++ and For-
tran. Unlike OpenMP which achieves parallelism by inserting directives on serial code,
CUDA users need to define specific functions (called kernel) for GPU computing. A ker-
nel is executed by a number threads, each having a unique threadID. CUDA organizes
threads into a hierarchy of a grid of thread blocks. A grid contains a set of thread blocks
with unique bockID, and each blocks contains same amount of threads. The threads in a
block can access to a shared memory space private to the block and can be synchronized
through synchronization function.
CUDA supports most of operating systems such as Microsoft Windows, Linux and
Mac OS. However, CUDA only works with Nvidia GPUs.
OpenACC
OpenACC (Open Accelerators) [29] is a programming standard for programming on het-
erogeneous CPU/GPU systems in C/C++ and FORTRAN. OpenACC standard is de-
veloped by Nvidia, Cray, CAPS and PGI. Similar to OpenMP, OpenACC provides a set
of compiler directives to specify parallel regions, control flow instructions on accelerators
and manages data movements on remote memory of accelerators.
Listing 2.5 presents an example of an OpenACC program that performs a 1D 3-points
stencil computation with oﬄoad mode. The OpenACC directives in Listing 2.5 tells the
14
Listing 2.6 – OpenACC in OpenMP threads
. . . . . .
numgpus = acc get num dev ices ( a c c d e v i c e n v i d i a ) ;
#pragma omp p a r a l l e l num threads (numgpus )
{
gpunum = omp get thread num ( ) ;
acc se t dev ice num ( gpunum , a c c d e v i c e n v i d i a ) ;
#pragma acc data copy ( x [ 0 : n ] )
{
// thread tnum c o p i e s x to GPU gpunum
. . . . . .
// launch computation
. . . . . .
}
}
. . . . . .
compiler following information:
• #pragma acc: This is an OpenACC directive.
• kernels : This is a parallel regions (also called kernel region), which contains work-
sharing loops.
• copyin(a): Array a needs to copied from the host memory to the device memory.
• copy(b): Array b needs to copied from the host memory to the device memory, and
the assigned value of array b on device memory needs to be copied back to the host.
As we can see from Listing 2.5, OpenACC eases the way we program on heteroge-
neous architectures. However, OpenACC is not suited to multiply device programming.
In order to concurrently use multiple GPUs on a machine, programmers have to mix
OpenACC with other API (such as OpenMP or MPI [11]). Listing 2.6 presents an ex-
ample of OpenACC program using multiple Nvidia GPU. By using OpenMP directives,
each OpenMP thread uses a different GPU. Then, programmers have to manage work-
load balancing, data movement and communications with hybrid directives (OpenMP
and OpenACC directives) which is quite confusing and error-prone.
OpenCL
OpenCL is designed for writing programs that execute across heterogeneous platforms
consisting of CPUs, GPUs, field-programmable gate arrays (FPGAs) and other proces-
sors. As an open standard maintained by Khronos Group, OpenCL is supported by most
of hardware manufacturers such as Apple, IBM, Intel, Nvidia, Qualcomm, Samsung, etc.
OpenCL is also supported by most computer systems, such as Windows, Linux and Unix.
OpenCL also defines computing functions as kernels for the execution on compute
device. A work-item in OpenCL plays the same role as a thread in CUDA. Work-items
15
are organised into work-groups. OpenCL achieves parallelism by simultaneously executing
kernel by each work-item. A main purpose of OpenCL is to unify the programming model
of heterogeneous platforms. OpenCL views a processor of any type (CPU, GPU, FGPAs
...) on a machine as a compute device. All devices are logically defined a common abstract
memory hierarchy. Thus an OpenCL program is portable across different platforms,
although the performance has not necessarily the same portability. Our main works in
this thesis are based on the OpenCL framework and our primary objective is to improve
the portability of performance of OpenCL applications. More details about OpenCL is
presented in Section 3.1.
2.2.3 Computing on heterogeneous architectures
Although some programming languages can natively support using multiple computing
resources, many studies further provide facility to exploit full power of heterogeneous
architecture.
Bolt C++ Template Library
Bolt [2] is an OpenCL–based C++ template library optimized for GPUs. Compared to
Standard OpenCL API, Bolt provides a more simplified STL-like interface, it can directly
interact with host memory structures such as std::vector or host arrays (e.g. float∗). Bolt
also provides common optimized routines like sort(), reduce(). The library itself can select
automatically where to execute such a routine (GPU or CPU). No OpenCL API calls are
required since Bolt library handles all initialization of the OpenCL environment and the
communication between devices. Listing 2.7 presents how to use Bolt to sort a random
array. However, the usage of Bolt is limited to built-in routines and it is only available
on AMD devices.
Listing 2.7 – Sort a random array with Bolt
#inc lude <bo l t / c l / s o r t . h>
#inc lude <vector>
#inc lude <algorithm>
i n t main ( )
{
// genera te a random array on ho s t
std : : vector<int> array ( 1 0 2 4 ) ;
s td : : generate ( array . begin ( ) , array . end ( ) , rand ) ;
// sor t , run on b e s t d e v i c e in the p la t form
bo l t : : c l : : s o r t ( array . begin ( ) , array . end ( ) ) ;
r e turn 0 ;
}
Boost.Compute
Boost.Compute [3] is a GPU/parallel-computing library for C++ based on OpenCL. It
provides an STL-like C++ interface which is very similar to Bolt. It contains common
algorithms (e.g. transform(), sort()) along with common containers (e.g. vector<T>,
16
flat set<T>). In Boost.Compute, the interaction between host and computing devices is
managed by an object of the command queue class which corresponds clCommandQueue
in standard OpenCL. Listing 2.8 presents how to use Boost.Compute to sort a random
array. Boost.Compute as well as Bolt is very well suited for implementing data initializa-
tions or launching tasks with existed built-in routines. However, expressing some tasks
which require more complicated data structures cannot be supported. The studies of Pe-
ter et al. [25] also prove that Boost.Compute and Bolt provide high compute performance
for simple tasks, but very low performance for complex tasks.
Listing 2.8 – Sort a random array with Boost.Compute
#inc lude <vector>
#inc lude <algorithm>
#inc lude <boost /compute . hpp>
namespace compute = boost : : compute ;
i n t main ( )
{
// g e t the d e f a u l t compute d e v i c e
compute : : dev i c e gpu = compute : : system : : d e f a u l t d e v i c e ( ) ;
// c r e a t e a compute c o n t e x t and command queue
compute : : context ctx ( gpu ) ;
compute : : command queue queue ( ctx , gpu ) ;
// genera te random numbers on the h os t
std : : vector<f l o a t> h o s t ve c to r ( 1 0 2 4 ) ;
s td : : generate ( ho s t v e c to r . begin ( ) , ho s t v e c t o r . end ( ) , rand ) ;
// c r e a t e v e c t o r on the d e v i c e
compute : : vector<f l o a t> d e v i c e v e c t o r (1024 , ctx ) ;
// copy data to the d e v i c e
compute : : copy ( ho s t v e c to r . begin ( ) , ho s t v e c t o r . end ( ) ,
d e v i c e v e c t o r . begin ( ) , queue ) ;
// s o r t data on the d e v i c e
compute : : s o r t ( d e v i c e v e c t o r . begin ( ) ,
d e v i c e v e c t o r . end ( ) , queue ) ;
// copy data back to the h os t
compute : : copy ( d e v i c e v e c t o r . begin ( ) , d e v i c e v e c t o r . end ( ) ,
h o s t v e c t o r . begin ( ) , queue ) ;
r e turn 0 ;
}
BOAST
BOAST [10] is an automatic source-to-source transformation tool which optimizes loop
structures in order to find the best performance configuration for a given type of com-
puting resource. According to the configurations which are defined by users, BOAST can
generate code in C or Fortran. The main interest for our work is the idea of architecture-
specific optimization used in BOAST. Due to the difference of hardware features on
heterogeneous architectures, the generic methods of code transformation may lead to a
poor performance. In order to optimize the performance for a specific architecture, we
have to tune the code accordingly. The techniques of code transformation could also be
17
used for the optimization of OpenCL kernel in our future works.
StarPU
StarPU [6] is a software that helps programmers to exploit the computing power of het-
erogeneous multi-device architectures. StarPU is a C library. The core component of
StarPU is a run-time support library which is used for scheduling the tasks on hetero-
geneous multi-core systems. An execution of StarPU is defined by two data structures:
codelet and task. A codelet describes a computational kernel and indicates runnable
architectures such CPUs or GPUs. A task describes which data can be accessed by
codelet. Once a task is submitted, StarPU will automatically assess accessibility of
computing resources and will schedule the task to ensure load balancing over heteroge-
neous systems.
StarPU has implemented a rich set of scheduling policies such as eager, random
and heft (heterogeneous earliest finish time), etc. These features provide great facility
for the implementation of task-parallel applications. However, StarPU is not perfect
at partitioning data-parallel tasks. StarPU provides an interface which can partition a
data region homogeneously, however a perfect partition should consider the diversity of
performance of available computing devices, then partition the data region in appropriate
proportion. Moreover, programmers still have to manually manage communication and
consistency of data which is time-consuming and error-prone.
StarPU provides an OpenCL extension called SOCL [14]. Unlike a standard OpenCL
application, SOCL is not dedicated to a specific kind of device nor to a specific hard-
ware vendor, it virtualizes all existing available devices (CPUs and GPUs) as one unique
OpenCL device on a unified OpenCL Platform. Thus, programmers no longer need to
create extra OpenCL context and OpenCL command queues for each CPU or GPU, all
computing kernel just go to only one command queue which is responsible for the exe-
cution on virtual machine. Once computing kernels are submitted, SOCL transparently
distributes computing kernels over physical devices at runtime.
As an extension of StarPU, SOCL provides powerful automatic scheduling abilities on
heterogeneous architectures and simplified the programming interface of OpenCL. But
the key issues are still unsolved. SOCL is still not able to handle workload partition of
data-parallel tasks and manage the data consistency between devices.
libWater
libWater [12] is a C/C++ library-based extension of the OpenCL programming model
that aims to alleviate programming process on large-scale clusters. libWater simplified the
programming interface of OpenCL and introduces a new device query language (DQL)
for accessing distributed devices on different compute nodes. Listing 2.9 presents the
basic syntax of DQL.
Listing 2.9 – DQL of libWater
SELECT [ALL | TOP k | POS i ]
FROM NODE [ n [ , . . . ] ]
WHERE [ r e s t r i c t i o n s a t t r i b u t e va lue s ]
ORDER BY [ a t t r i b u t e [ , . . . ] ]
18
libWater eliminates and replaces some redundant OpenCL instruction ( such as the
initialization of OpenCL platforms and contexts) by its own API which allows program-
mers to more focus on the development of computing kernels. Meanwhile, the distributed
runtime system of libWater dispatches the OpenCL commands to the addressed devices
and transparently moves data across the cluster nodes. libWater also enhances the event
system of OpenCL by enabling inter-context and inter-node device synchronization and
improves the efficiency of runtime system by analysing the collected event information.
Although libWater provides a convenient interface to access the distributed devices and
dispatch the work task on large-scale clusters, it has not yet provided any solution for
workload partition which is also a big problem for programmers.
Kim’s framework for multiple GPUs
Kim et al.[20] propose an OpenCL framework to manage multiple GPUs within a node.
This OpenCL framework combines multiple GPUs as a virtual single GPU. Programmers
only need to provide an application for single GPU, the framework takes in charge the
deployment of the code on the multiple GPUs transparently by partitioning the work-
load equally among the different devices. In order to partition the workload precisely,
the runtime system applies a run-time memory access range analysis to the kernel by
performing a sampling run and identifies an optimal workload distribution for the kernel.
The sampling code is generated from the OpenCL kernel code by using a OpenCL-C-to-
C translator, it only accesses to memory and performs no computation. Meanwhile, the
runtime maintains virtual device memory that is allocated in the main memory and keeps
it consistent to the memories of the multiple GPU devices. They also use an OpenCL-C-
to-CUDA-C translator to generates the CUDA kernel code for the distributed OpenCL
kernel.
Figure 2.8 – Data Partitioning in homogeneous way
However Kim et al. only focuse on homogeneous multiple device systems and data
independent tasks. Their OpenCL framework can not work with heterogeneous devices.
Due to the diversity of performance between devices, an equal data partitioning (such as
Figure 2.8 ) can not efficiently exploit full computing power of heterogeneous multi-device
systems. Moreover the overhead of sampling run and the overhead of translation from
OpenCL-C kernel to CUDA are not negligible.
19
SKMD
Lee Janghaeng et al. propose a tool, called SKMD, that generates multi-device source
code from a single-device kernel [17]. This system transparently orchestrates collaborative
execution of a single data-parallel kernel across multiple asymmetric CPUs and GPUs.
The programmer is responsible for developing a single data-parallel kernel in OpenCL,
while the system automatically partitions the workload across an arbitrary set of devices,
generates kernels to execute the partial workloads, dynamically adjusts the workload
across devices and merges the partial outputs together.
SKMD implements workload partition through assigning partial work-group on origi-
nal kernel. As showed in figure 2.9, SKMD adds two parameters WG from and WG to to
represent a target range of work-group index to be computed on a device. In other world,
each computing device only runs (WG from−WG to + 1) work-groups of the kernel. At
end of the execution, SKMD also utilises these two parameters to retrieve the computing
result from discrete device memory.
1  __kernel void add_CPU(
2      __global float *input,
3      __global float *output,
4      int WG_from, int WG_to)
5  {
6       int idx = get_group_id(0);
7       int idy = get_group_id(1);
8       int size_x = get_num_groups(0);
9       int tag_id = idx + idy * size_x;
10     // check whether to execute
11     if (tag_id < WG_from || tag_id > WG_to)
12         return;
13
14     int tid = get_global_id(1) 
15 * get_global_size(0)
16          + get_global_id(0);
17
18     output[tid] = input[tid]+output[tid];
19 }
1  __kernel void add_CPU(
2      __global float *input,
3      __global float *output,
4     )
5  {
6
7     int tid = get_global_id(1) 
8 * get_global_size(0)
9            + get_global_id(0);
10
11   output[tid] = input[tid]+output[tid];
12 }
Figure 2.9 – Comparison of original kernel and partitioned kernel
To do so, SKMD duplicates all data on every device. Thus each device needs to allo-
cate all the data, even if the device will not use it. This unnecessary memory redundancy
for the whole system limits the problem size.
InsiemeTP
Klaus Kofler et al. [21] propose an approach to automatically optimize task partition-
ing for different problem sizes and different heterogeneous architectures. They use the
Insieme[19] source-to-source compiler to translate a single device OpenCL program into
a multi-device OpenCL program.
The architecture of their framework consists of two main phases: training and deploy-
ment. At the training phase, a set of applications is executed with different partitioning
20
strategies. The statistics (including static program features, runtime feature and the time
of execution) of each execution is used for building a task partitioning prediction model.
The prediction model is based on Artificial Neural Networks approach. At the deploy-
ment phase, the Insieme compiler analyse the static feature of input code and translate
it into a multi-device version. After analysing the static feature of input code, prediction
model will predict the best task partitioning, and the runtime system will take charge of
the remaining execution.
However, due to the limitation of training set, the prediction model still need to be
improved. Moreover, The framework proposed by Klaus Kofler et al. didn’t provide any
mechanism to ensure data consistency.
2.3 Conclusion
In this chapter, we have introduced existing standard programming languages and pro-
gramming tools for heterogeneous architectures. Considering the performance and porta-
bility [28, 7, 27] of existing programming language, only OpenACC and OpenCL are
supported by many vendors. However OpenACC is not suited to multi-accelerator ar-
chitectures. Most of existing programming tools are designed for one specific problem
which is not convenient enough for novice programmers. In our research, we proposed a
programming tool STEPOCL that generates optimised OpenCL code for heterogeneous
multi-device architectures. STEPOCL guarantees the portability and performance of
OpenCL applications.
21
22
Chapter 3
STEPOCL
L’homme qui sait re´fe´chir est
celui qui a la force illimite´e.
—Honore´ de Balzac
Contents
3.1 Programming on heterogeneous multi-device architectures
in OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.1 Background of OpenCL . . . . . . . . . . . . . . . . . . . . . . 24
3.1.2 Challenge of using multiply compute devices . . . . . . . . . . 27
3.2 Component of STEPOCL . . . . . . . . . . . . . . . . . . . . 30
3.2.1 Compute kernels . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.2 STEPOCL configuration file . . . . . . . . . . . . . . . . . . . 32
3.2.3 STEPOCL Output of OpenCL application . . . . . . . . . . . 34
3.3 STEPOCL internal mechanism . . . . . . . . . . . . . . . . . 35
3.3.1 Automatic device management . . . . . . . . . . . . . . . . . . 35
3.3.2 Automatic workload partition . . . . . . . . . . . . . . . . . . . 36
3.3.3 Automatic adjusting workload . . . . . . . . . . . . . . . . . . 37
3.3.4 Automatic data consistency management . . . . . . . . . . . . 37
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
In this chapter, we present more details about programming on heterogeneous multi-
device architectures in OpenCL. Then, we introduce our contribution STEPOCL, which
is an OpenCL based programming tool, along with a new domain specific language de-
signed for simplifying the development of an application for heterogeneous multi-device
architectures.
23
Figure 3.1 – OpenCL Platform Model (image source: KHRONOS group).
3.1 Programming on heterogeneous multi-device ar-
chitectures in OpenCL
As we presented in Chapter 2, OpenCL provides portability and performance across
heterogeneous platforms. It combines a unified programming interface with a variant of
the C language to use different parallel processing devices together (e.g. CPU, GPU and
Xeon Phi).
3.1.1 Background of OpenCL
Before further investigating programming on multi-device architecture, we briefly intro-
duce some basic concepts of OpenCL.
Platform mode
The platform model of OpenCL consists of a host (a CPU) connected to one or more
OpenCL compute devices. A device can be a CPU, a GPU or any other processor
supported by the OpenCL vendor. The OpenCL devices are further divided into compute
units which are further divided into one or more processing element (PEs) where the
computations on a device occur. Figure 3.1 shows an overview of the OpenCL platform
model.
Programming model
OpenCL program consists of two parts: a host code and a computing function. The
developer first defines a computing function, called a kernel, a basic unit of executable
code which is executed on an OpenCL device. In host code, developer sets up the envi-
ronment for the execution of kernel and orchestrates the copy of the data used by the
kernel on the device before launching the execution of kernel on a chosen device. When
24
...
...
...
... ... ......Gy
Gx
sx = 0
sy = 0
work item
(wxSx+sx,
wySy+sy)
sx = Sx-1
sy = 0
work item
(wxSx+sx,
wySy+sy)
...
sx = 0
sy = Sy-1
work item
(wxSx+sx,
wySy+sy)
sx = Sx-1
sy = Sy-1
work item
(wxSx+sx,
wySy+sy)
...
... ......
work group (wx,wy)
Sx
Sy
Figure 3.2 – OpenCL Work-Groups and Work-Items.
a kernel is submitted for execution by the host, an index space is defined. Each point
in this index space presents an instance of executing kernel. In OpenCL, an instance
of executing kernel is called a work-item. Work-item is identified by its coordinates in
index space. These coordinates are the global ID for work-item which is given by a spe-
cific OpenCL primitive (the get_global_id primitive). Work-items are organized into
work-groups. The work-groups provide a more coarse-grained decomposition of the index
space. Work-groups are assigned a unique work-group ID with the same dimensionality
as the index space used for the work-item. In each work-group, work-items are assigned
a unique local ID, thus a single work-item can be uniquely identified by its global ID or
by a combination of its local ID and work-group ID.
Figure 3.2 illustrates this notion of work-group: a 2D-kernel index space is split into
Gx×Gy work-groups, each of which consists of Sx× Sy work-items.
OpenCL context and command-queue
OpenCL context defines the environment of kernel’s execution. An OpenCL context
consists of following components:
• Devices: a set of OpenCL devices to be used by the host.
• Kernels: the computing functions that run on OpenCL devices.
• Program objects: the source code program of kernel.
• Memory objects: the objects that contain data which are visible to all devices within
a context.
25
Command-queue is used for the interactions between the host and the OpenCL de-
vices. Each OpenCL device is attached by a command-queue after the definition of con-
text. With command-queue, host can deliver the commands, such as kernel execution,
memory movement and synchronization, through command-queue.
Memory model
The memory model of OpenCL defines four distinct memory regions:
• Host memory : Host memory region is visible only to the host. OpenCL defines
only how the host memory interacts with OpenCL objects and constructs.
• Global memory : This memory region is visible for all work-items.Every work-item
can do the read or write action on any element in global memory.
• Local memory : This memory region is local to a work-group. The elements in
local memory are only shared by the work-items in that work-group. For the local
memory, OpenCL provides efficient synchronization mechanisms to guarantee the
values seen by a set of work-items in a work-group are consistent.
• Private memory : This memory region is private to a work-item. The elements in
one work-item’s private memory are not visible to other work-items.
The OpenCL device memory and host memory supports global memory. The in-
teraction between host memory and global memory can be achieved by copying data
(clEnqueueRead/WriteBuffer) or by mapping and unmapping regions of a memory ob-
ject (clEnqueueMapBuffer/clEnqueueUnmapObject).
Meanwhile, OpenCL defines four address space qualifiers: global, local, con-
stant and private. These qualifiers, used in variable declaration, specify the region of
memory that is used to allocate the object.
• The global address space name is used to indicate that a variable is allocated on
global memory.
• The local address space name is used to indicate that a variable will be allocated
in local memory, it will be shared by all work-items of a work-group.
• The constant address space name is used to indicate that a variable is allocated
on global memory but in read-only mode.
• The other unqualified variables inside a kernel function are private, these vari-
ables are only performed READ/WRITE by actual work-items.
Parallel models
OpenCL supports two different parallel programming models: data parallelism and task
parallelism.
Data-Parallel Programming Model The basic idea of data-parallel programming is
operating concurrently on a collection of data elements. When an OpenCL kernel is
launched by host code, OpenCL index space automatically maps onto OpenCL memory
26
objects, which means the identification of each work-item ( global ID or local ID ) also
maps onto memory objects. If the kernel doesn’t contain any branch statements, each
work-item will execute identical operations on a subset of data items (depends on its
global ID). If branch statements exist in a kernel, each work-item still execute the same
program, but may accomplish different computing tasks. OpenCL also defines vector
instructions and types to support Single Instruction Multiple Data (SIMD) models.
Task-Parallel Programming Model OpenCL permits that several submitted kernels
can be executed at the same time. In this case, programmers must define and manipulate
the execution of concurrent tasks for each computing devices. Because the tasks and the
computing capacities of devices vary widely, distributing them so that they all finish at
about the same time can be difficult.
In this thesis, we focus on dealing with massive data-parallel tasks which are too large
for single computing device’s memory. We split large tasks into several small subtasks.
Then we use task-parallel programming model to schedule the execution of subtasks.
Based on this idea, we developed STEPOCL. It can generate workload partitioning and
scheduling modules automatically in C language.
Execution of OpenCL program
Overall, a typical execution of an OpenCL application should be follows the steps below:
• First of all, a CPU host initiates OpenCL initiates OpenCL environment where the
contexts, compute devices, program objects and command-queues are defined.
• CPU host defines an N-dimensional index space over some regions of DRAM mem-
ory where the input data allocated. Every index of this N-dimensional index space
will be a work-item and each work-item executes the same kernel.
• CPU host defines the size of work-group for these work-items. Each work-item in a
work-group shares a common local memory. If the size of work-group is undefined,
OpenCL implementation will determine itself how to be break the global work-items
into appropriate work-group instances, but the performance can be unpredictable.
• The global memory supported by OpenCL devices will load DRAM memory, then
OpenCL devices execute each work-groups. On NVIDIA hardware, the multiproces-
sor will execute 32 threads (32 threads is a ”warp group”) at once. If the work-group
contains more threads than this, they will be serialized.
• When the execution of kernel is finished, the memory object produced from the
execution will be copied back to DRAM memory.
3.1.2 Challenge of using multiply compute devices
More compute devices means more parallelism, which leads to better performance. How-
ever, developing an application for multi-device architectures is difficult.
27
=×
A[4][4] Work-items in
kernel index space
B[4][4] C[4][4]
Figure 3.3 – Matrix production on single device
Scheduling is totally hand-made by the programmer
In OpenCL, the process of utilising multiple devices for kernel computation is not done
automatically. Programmer should manually controls the process on host code. When
using a single device, programmer submits all kernels to the command queue associated
with that device. While using multiple devices, programmer has to create a command
queue for each device. These command queues either share a same context or each of
them has its own context. Then programmer must decide how to distribute the kernels
across multiple devices. One of the challenges which has a great impact on the per-
formance of parallel application on heterogeneous multi-device architectures is workload
balancing. In order to benefit from the full potential of heterogeneous architecture, pro-
grammers are responsible for partitioning and distributing the workload across multiply
devices and avoiding the situation in which some devices overloaded while others are un-
derutilized. Manually handling the workload balancing requires that programmers have a
solid knowledge of hardware architecture and a rich experience of parallel programming.
Data partitioning
An OpenCL application can contain several kernels. Once these kernels are distributed to
different compute devices for the execution, programmers should schedule and synchronize
the execution process. If the OpenCL application is data parallel, programmers should
manually divide the kernel space and the corresponding data space. Sometimes a precise
data partition can be very complicated. Taking a matrix multiplication problem as an
example.
Given two matrix A[n][m] and B[m][p], the matrix production A∗B is a matrix C[n][p]
where the element C[i][j] in matrixC is calculated as below:
C[i][j] =
m∑
k=1
A[i][k] ∗B[k][j] (3.1)
Figure 3.3 illustrates the production of two matrices A and B on single OpenCL
device. Each work-item Iij calculates a element C[i][j] through reading all elements of
matrix A in row i and all elements of matrix B in column j. All data of matrix A, B and
C are allocated on the same physical memory of a OpenCL device. However, if we want
to develop a parallel multiply device version, matrix A, B and C should be split into
28
=×
A[4][4] Work-items in
kernel index space
B[4][4] C[4][4]
Dev 1 Dev 2
Dev 3 Dev 4
Dev 1
Dev 3 Dev 4
Dev 2Dev 1 & Dev 2 
Dev 3 & Dev 4 
Dev 1
   &
Dev 3
Dev 2
   &
Dev 4
Figure 3.4 – Matrix production on multiple device
Work-item
Data 
Border region 
device a (CPU) device b (GPU)
Replicated region 
Partition workload into two devices
Figure 3.5 – Data consistency
submatrix. Then these submatrix will be allocated on different device memories for the
execution of kernel. Figure 3.4 illustrates the workload partition of matrix multiplication
problem using four OpenCL devices. The work-items are split into four sets, each set
calculates a part of matrix C on one of four OpenCL devices. In order to follow the data
dependence, matrix A, B and C are equally split into submatrix which are distributed
(or duplicated) to difference devices for further execution. Programmer is responsible for
manually estimating the size and position of subregions where the data of submatrix are
stored. Due to data dependence, matrix A, B and C are split into different shapes, man-
ually estimating the size and position of these submatrix will become more complicated
if more heterogeneous devices are used.
29
Communication
Another challenge is to maintain data consistency across multiple device memory. List-
ing 3.1 presents a sequential 2D-stencil iteration. Each update calculates a weighted
average of the old cell with its neighbours in the six cardinal direction. In a parallel
version for heterogeneous multi-device architecture, the original data will be split into
several chunks (the number of chunks is equal to the adequate number of devices follow-
ing their availability and capabilities.) and be transferred into corresponding dedicated
device memory. Figure 3.5 illustrates a possible workload partition for a Jacobi itera-
tion. The data space are split into two subsets. However, due to the data dependencies,
two subsets have overlapped areas which is a replicated region in Figure 3.5. In order
to maintain data consistency across multiple device memory, programmers should have
correctly synchronized the preview computation and update the replicated region before
each launch of OpenCL kernel for next Jacobi iteration.
However implementing these features is time-consuming and error-prone, there is a
huge demand for programming tools that help the novices designing applications for
heterogeneous multi-device systems.
Listing 3.1 – Sequential 2D-stencil iteration
f o r ( i n t n = 0 , n < t , n++)
{
f o r ( i n t y = 1 ; y < DIM Y − 1 ; ++y )
f o r ( i n t x = 1 ; x < DIM X − 1 ; ++x ){
b [ y ] [ x ] = alpha ∗ a [ y ] [ x ] + beta ∗
( a [ y − 1 ] [ x ] + a [ y + 1 ] [ x ] +
a [ y ] [ x − 2 ] + a [ y ] [ x + 2 ] ) ;
}
swap ( a , b ) ;
}
3.2 Component of STEPOCL
STEPOCL programming tool is created for facilitating the development of OpenCL
applications. It provides an environment for developing efficient and portable parallel ap-
plications on multiple heterogeneous OpenCL devices. Programmer only need to focus on
developing the OpenCL computing kernel and let STEPOCL to manage data partition,
workload balancing and data consistency.
As illustrated in Figure 3.6, the core component of STEPOCL is a code generator,
which takes as input a list of raw OpenCL kernels (see Subsection 3.2.1) together with a
configuration file (see Subsection 3.2.2) which describes:
• the layout of the data, which expresses how the data shall be split among the
devices,
• the association between a compute kernel and a specific device type (see Subsec-
tion 3.2.2), and
• the expected control flow of the application to generate.
30
Output
STEPOCL configuration
file
Inputs
STEPOCL
Code Generator
Full multi-device 
OpenCL application
CPU Xeon Phi
STEPOCL runtime
GPU ...
OpenCL compute kernels
Figure 3.6 – Overview of the STEPOCL environment.
Based on this input, STEPOCL then generates a complete OpenCL program (see
Subsection 3.2.3) able to exploit concurrently different accelerators, e.g., CPU, GPU and
Xeon Phi that runs on top of the STEPOCL runtime.
At bootstrap, the generated code sets up a unified OpenCL multi-device environment
and distributes the workload among the accelerators based on the results of the oﬄine
profiler. Then, during the run, the generated code maintains data consistency between
the devices by using the result of a data flow analysis performed during compilation.
3.2.1 Compute kernels
The input kernels are regular OpenCL kernels, which express the computation to execute
on a device. As a basic example for the remaining of this Section, we provide a 1D-stencil
kernel written for a single generic device in Listing 3.2.
Listing 3.2 – Generic OpenCL 1D-stencil kernel.
k e r n e l void g e n e r i c s t e n c i l 1 D ( g l o b a l f l o a t ∗A, g l o b a l f l o a t ∗B) {
const unsigned i n t i = g e t g l o b a l i d (0 ) + 1 ;
B[ i ]= (A[ i−1]+A[ i ]+A[ i +1] )/3 ;
}
The developer may prepare several versions of the compute kernel to achieve the best
performance on a specific device type. For instance, in order to favour data cache effects
of a CPU in the 1D-stencil, it is more efficient to process data by tile, i.e. by block
of elements instead of element-by-element, as presented in Listing 3.3. In this case, the
computation performed by the CPU remains the same, but the amount of work-items
differs.
Listing 3.3 – Tiled OpenCL 1D-stencil kernel.
k e r n e l void t i l e d s t e n c i l 1 D ( g l o b a l f l o a t ∗A, g l o b a l f l o a t ∗B) {
const unsigned i n t i = g e t g l o b a l i d (0 ) + 1 ;
f o r ( i n t k = 0 ; k < 4 ; k ++)
B[ i+k]= (A[ i+k−1]+A[ i+k]+A[ i+k +1])/3 ;
}
31
3.2.2 STEPOCL configuration file
The configuration file takes the form of a tree, implemented in XML format. It defines
the arguments of the kernel, how the kernels are mapped to the device type, and the
control flow of the program.
Argument
An argument describes an array of values, transferred to the devices and accessed by the
work-items. It is defined by three elements:
• A name (ID), later used in the kernel section to refer to the argument.
• The type (data_type) of its tokens.
• Its size, which further specifies the dimension and length of the array.
For instance, the configuration file given in Listing 3.4 defines the arguments of our
1D-stencil example: arrays A and B are two 1D vectors, which contain 1026 float values
each.
Listing 3.4 – Argument description of the 1D-stencil kernel.
<argument>
<ID> A </ID>
<data type> f l o a t </data type>
<a r g s i z e>
<d im s i z e a x i s=x> 1026 </dim s ize>
</a r g s i z e>
</argument>
<argument>
<ID> B </ID>
<data type> f l o a t </data type>
<a r g s i z e>
<d im s i z e a x i s=x> 1026 </dim s ize>
</a r g s i z e>
</argument>
Kernel
The kernel section of the configuration file describes the mapping between the kernels
and the device types. It contains three elements:
• A name element. It references the kernel in the subsequent control flow section.
• A data_split element. For each argument, it gives along which dimension its
related data should be split.
• At least one implem element. An implem element associates a compute kernel code
to a specific device and defines the size of a tile, i.e. the size of the data to compute
by each work-item. The default tile size is 1.
32
The kernel section may contain several implem elements in order to associate different
kernel codes (funcname) to different devices (device_type). The DEFAULT implem is
used when no other implem corresponds to the device.
As introduced in Section 2.2.2, the programmer can also specify the size of the work-
groups (work-group) used to design a specific kernel version.
Listing 3.5 presents the kernel configuration of the 1D-stencil. The user expresses
that arrays A and B shall be split by column (i.e. split along the x axis). Three versions
of the kernel are provided. The first version (the tiled_stencil1D kernel) targets CPU
devices and performs its computation on tiles of four elements. The second version (the
GPU_stencil1D kernel) is designed for GPU devices. It specifies the aggregation of work-
items by groups of 16 along the x axis, which means that a group of 16 work-items can
share their memory. The third kernel version (the generic_stencil1D kernel) is the
more basic as it uses neither tile nor work-group. This version is also the more generic
as it can be used on any other available device.
Listing 3.5 – 1D-stencil kernel information.
<kerne l>
<name> Stenc i l1D </name>
<d a t a s p l i t>
<a x i s ID=B> x </axis>
<a x i s ID=A> x </axis>
</d a t a s p l i t>
<implem>
<platform>I n t e l (R) OpenCL</platform>
<dev ice type> CPU </dev ice type>
<funcname> t i l e d s t e n c i l 1 D </funcname>
<t i l e >
<t a rge t a rg> B </ta rge t a rg>
<s i z e a x i s=x> 4 </s i z e>
</ t i l e >
</implem>
<implem>
<platform>NVIDIA CUDA</platform>
<dev ice type> GPU </dev ice type>
<funcname> GPU stencil1D </funcname>
<work group>
<s i z e a x i s=x> 16 </s i z e>
</work group>
</implem>
<implem>
<platform>I n t e l (R) OpenCL</platform>
<dev ice type> ACCELERATOR </dev ice type>
<funcname> Acc stenc i l 1D </funcname>
</implem>
</kerne l>
Control
The third component of the STEPOCL configuration file describes the program control
flow. The control flow is basically the meta-algorithm of the application and is used by
the data analysis pass to ensure consistency. This component describes the number of
iterations of the kernel to launch (loop keyword) and how data are exchanged between
two iterations (switch keyword). For instance, in Listing 3.6, the STEPOCL application
executes 10 times the Stencil1D and has to switch the arguments A and B between two
iterations.
33
Listing 3.6 – 1D-stencil kernel execution.
<cont ro l>
<loop i t e r a t i o n s =10 >
<exec> Stenc i l1D </exec>
<switch>
<arg ID=A> B </arg>
</switch>
</loop>
</cont ro l>
3.2.3 STEPOCL Output of OpenCL application
Based on the kernel(s) and on the configuration file, STEPOCL generates a multi-device
OpenCL host program. The generated codes are saved in two files run.c and it’s header
file run.h. run.c predefines the modules of initialisation of OpenCL environment, data
partition, communication and scheduling. Function func in file run.c is the core program
which defines the execution process of host code.
Listing 3.7 – run.c of 1D-stencil code
#inc lude ”run . h”
/∗∗∗∗∗ p r e d e f i n i t i o n o f f unc t i ons ∗∗∗∗∗∗/
. . . . . . . .
/∗∗∗∗∗ hos t code ∗∗∗∗∗∗/
void func ( f l o a t ∗ h idata , f l o a t ∗ h odata ){
. . . . . . . .
}
Listing 3.7 and listing 3.8 presents the method of using generated code. We take
still 1D-stencil computation as an example. Once STEPOCL generated the host code,
the only thing left for programmer is to initialize the data of input in main.c file. In
listing 3.8, we initialized two arrays h idata and h odata, then we called function func for
the 1D-stencil computation. According to predefined configuration file, func will manage
automatically all technical aspect of parallel programming problems that we mention in
section 3.1.2.
Listing 3.8 – main.c of 1D-stencil code
#inc lude ”run . h”
#d e f i n e XDIM 1000002
#d e f i n e LINESIZE XDIM
#d e f i n e TOTALSIZE XDIM
i n t main ( i n t argc , char ∗ argv [ ] ) {
f l o a t ∗ h idata = NULL ;
f l o a t ∗ h odata = NULL ;
const unsigned i n t mem size = TOTALSIZE∗ s i z e o f ( f l o a t ) ;
h ida ta = ( f l o a t ∗) mal loc ( mem size ) ;
h odata = ( f l o a t ∗) mal loc ( mem size ) ;
f o r ( unsigned i n t i = 0 ; i < TOTALSIZE; i ++){
h idata [ i ]= i ;
h odata [ i ] = 0 . 0 ;
}
func ( h idata , h odata ) ;
r e turn 0 ;
}
34
3.3 STEPOCL internal mechanism
3.3.1 Automatic device management
In a regular OpenCL program, programmer is responsible for manually initiating OpenCL
environment, choosing available device, creating context and then associating kernels
and command queues for each device. STEPOCL automatically generates these parts.
According to the specification described in Figure 3.5, the generated host code searches
devices which matches the specified requirement from different OpenCL platforms. On
an OpenCL platform, host code creates a context for each type of compute devices. Then
each context associates an OpenCL kernel and each device is associated to a command
queue. Figure 3.7 shows the concept of device management in OpenCL environment.
STEPOCL user is also able to use one context for all devices on the same platform.
For example, in listing 3.9, the device type defined under the platform is ALL. Thus
STEPOCL will create a context for all devices on the same platform, each devices run
the same kernel(Figure 3.8).
Begin
Nvidia platformIntel platform
CPU Acc. Acc. GPU GPU GPU
Context 1 Context 2 Context 3
Kernel
 for Intel CPU
Kernel
 for Intel Acc.
Kernel
 for Nvidia GPU
Q
ueue 2
Q
ueue 3
Q
ueue 4
Q
ueue 5
Q
ueue 6
Q
ueue 1
Figure 3.7 – Device management: using different kernel
Listing 3.9 – 1D-stencil kernel information.
<kerne l>
<name> Stenc i l1D </name>
<d a t a s p l i t>
<a x i s ID=B> x </axis>
<a x i s ID=A> x </axis>
</d a t a s p l i t>
<implem>
<platform>NVIDIA CUDA</platform>
<dev ice type> ALL </dev ice type>
<funcname> s t e n c i l </funcname>
<work group>
35
<s i z e a x i s=x> 16 </s i z e>
</work group>
</implem>
<implem>
<platform>I n t e l (R) OpenCL</platform>
<dev ice type> ALL </dev ice type>
<funcname> s t e n c i l </funcname>
</implem>
</kerne l>
3.3.2 Automatic workload partition
STEPOCL runtime evaluates the performance of selected compute devices, and then
automatically performs the workload partition. STEPOCL profiler (see Section 4.2)
collects the static data of compute devices with OpenCL API, then each selected device
is associated with a ratio which represents its relative performance capacity on whole
system.
Begin
Nvidia platformIntel platform
CPU Acc. Acc. GPU GPU GPU
Context 1 Context 2
Kernel for 
general devices
Q
ueue 2
Q
ueue 3
Q
ueue 4
Q
ueue 5
Q
ueue 6
Q
ueue 1
Figure 3.8 – Device management: using a common kernel
Based on the ratio and on the tile size (provided by the configuration file, see Sec-
tion 3.2.2) of each device, the generated code computes the proportion of workload for each
device, then partitions kernel space and data space (see Section 4.3) with corresponding
proportion. Meanwhile, the compiler of STEPOCL (see Section 4.1.1) launches region
analysis on OpenCL kernel, the analyse result is used for data partitioning and managing
the data consistency.
36
3.3.3 Automatic adjusting workload
Iterative applications, which are typical in HPC, relaunch kernels until the result meets
the requirement. For example, in each time step n of Jacobi iteration in listing 3.1,
OpenCL kernel will be executed on each selected compute device. STEPOCL runtime
keeps watching over the status of execution on each device, then it adjusts the workload
(if necessary) to keep the load balancing on whole systems (see Section 4.4).
3.3.4 Automatic data consistency management
In OpenCL, data consistency on global shared memory can be achieved with barrier
or fence. However, this mechanism doesn’t work across multiple devices. Programmers
have to manually synchronize the execution of kernel using command queues, then update
intermediate results on dedicated memory of each device. With the information of region
analysis and the proportion of data partition, STEPOCL estimates the size and position
of regions where the data need to be updated. Then, STEPOCL automatically generates
the module of kernel synchronization and the module of communication for intermediate
result update (see Section 4.5).
3.4 Conclusion
In this chapter, we have introduced STEPOCL which is our answer to the existing
problems of programming on heterogeneous architectures. STEPOCL automatically
generates the host code of OpenCL, programmers only need to provide compute kernels
and a configuration file where the information of input data and execution are specified.
The elements and attributes defined in configuration file are familiar to OpenCL program-
mer since STEPOCL uses the same concepts of OpenCL objects. The generated host
code is capable of performing workload partition, workload balancing and management
of data consistency on any multiple devices heterogeneous architectures.
37
38
Chapter 4
Implementation
Plus tu travailles dur, plus tu es
chanceux.
—Thomas Fuller
Contents
4.1 Analysis of region . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.1 PIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1.2 Analysis of regions with PIPS . . . . . . . . . . . . . . . . . . . 40
4.1.3 Analysis of OpenCL kernel . . . . . . . . . . . . . . . . . . . . 42
4.2 Oﬄine profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 Workload partition . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.1 Data space partition . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4 Workload balancing . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5 Data transmission between multiple devices . . . . . . . . . 50
4.6 Generation of Host code . . . . . . . . . . . . . . . . . . . . . 52
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
This chapter presents the main modules implemented in STEPOCL including anal-
ysis of regions, oﬄine profiling, workload partition, workload balancing, communication.
The results of region analysis and the information of oﬄine profiling are used for workload
partitioning and the management of communication. The module of workload balancing
is implemented for monitoring the status of each computing devices, STEPOCL adjusts
the proportion of workload on each device if the unbalanced status is detected. The
structure of generated code are presented at the end of chapter.
4.1 Analysis of region
Workload partitioning requires precise array data flow analysis. In this section, we con-
cisely introduce compiler PIPS [16], a compiler, and how to analyse array data flow that
allows us to generate them.
39
INPUTS C Fortran 77
ANALYSES Array Privatization Array Section Privatization Array Element Region
Control Flow Graph Continuation Conditions Dependences
Preconditions Program Complexity Reduction Detection
Transformers Use-Def Chains Call Graph
Memory Effects Scalar variable Privatization
RESTRUCTURATIONS Atomization Cloning Control Restructuration
Useless Definition Elimination Declaration Cleaning Dead Code Elimination
OUTPUTS Call Graph Control Flow Graph Dependence Graph
Parallel Fortran 77 C Optimised Fortran 77
TRANSFORMATIONS Coarse Grain Parallelization Expression Optimizations Forward Substitution
Loop Distribution Loop Interchange Loop Normalize
Loop Reductions Loop Unrolling Partial Evaluation
Table 4.1 – PIPS overview
4.1.1 PIPS
PIPS is a source-to-source compilation framework. It takes as input Fortran or C codes
and uses inter-procedural techniques for program analyses such as precondition [15] and
array region [9, 23] computation. PIPS can also perform code transformation such as
loop interchange, tiling or can even generate parallel Fortran or C code. Table 4.1 lists
the modules and functions of PIPS.
PIPS provides a line interface tpips to PIPS users. As a scripting language of the
PIPS project, tpips can access to Unix shell and to PIPS properties.
4.1.2 Analysis of regions with PIPS
Exact Region Over-approximation
(convex polyhedron)
Figure 4.1 – Over-approximated analysis
In PIPS, array regions are represented by convex polyhedra [8]. They are used to
summarise accesses to array elements. Due to region representation, the analysis can be
over-approximated as Figure 4.1 presented. In PIPS, EXACT region represents exactly
40
matched array region, and MAY region represents the array region which can be over-
approximated. The READ and WRITE regions represent the effects of statements and
procedures on sets of array elements. PIPS also provides IN and OUT regions analysis
to represent array data flow. For a block of statements or a procedure, an IN region
is the subset of the corresponding READ region containing the array elements that are
imported and an OUT region is the subset of the corresponding WRITE region that will
be read outside of block.
Program example for array regions analysis. The program given in the List-
ing 4.1 is a C program which doubles the value of elements in a one-dimensional array.
The Main function initializes a one-dimension array with ten elements, then calls a loop
to update the elements in the array.
Listing 4.1 – C program for array regions analysis
#inc lude <s t d i o . h>
#d e f i n e N 10
void loop ( i n t l , i n t u , i n t array [N] )
{
i n t i ;
f o r ( i=l ; i<u−1; i++)
{
array [ i ]=2∗ i ;
}
}
i n t main ( void )
{
i n t array [N ] ;
i n t i , j ;
f o r ( i =0; i<N; i++)
{
array [ i ]= i ;
}
loop (0 ,N, array ) ;
r e turn 0 ;
}
Array region analysis on this example. Based on the program on Listing 4.1,
PIPS analyses array region at every statement level of the abstract syntax tree. For
example, an analysis result inside the loop function can be seen in Listing 4.2.
Listing 4.2 – Array regions analysis by PIPS
#inc lude <s t d i o . h>
#d e f i n e N 10
// <array [PHI1]−W−MAY−{0<=PHI1 , PHI1<=9, l ==0, u==10}>
// <array [PHI1]−OUT−MAY−{0<=PHI1 , PHI1<=9, l ==0, u==10}>
void loop ( i n t l , i n t u , i n t array [N] )
{
i n t i ;
// <array [PHI1]−W−MAY−{0<=PHI1 , PHI1<=9, l ==0, u==10}>
// <array [PHI1]−OUT−MAY−{0<=PHI1 , PHI1<=9, l ==0, u==10}>
f o r ( i=l ; i<u ; i++)
{
// <array [PHI1]−W−EXACT−{PHI1==i , l ==0, u==10, 0<=i , i<=9}>
// <array [PHI1]−OUT−MAY−{PHI1==i , l ==0, u==10, 0<=i , i<=9}>
array [ i ]=2∗ i ;
41
}
}
i n t main ( void )
{
i n t array [N ] ;
i n t i , j ;
// <array [PHI1]−W−EXACT−{0<=PHI1 , PHI1<=9}>
f o r ( i =0; i<N; i++)
{
// <array [PHI1]−W−EXACT−{PHI1==i , 0<=i , i<=9}>
array [ i ]= i ;
}
loop (0 ,N, array ) ;
r e turn 0 ;
}
In Listing 4.3, PHI1 represents the index of array array. The WRITE region (repre-
sented by W ) of array is exactly the sub-array array[PHI1] with: PHI1 == i, 0 ≤ i, i ≤
9. The symbol OUT means that array is updated in current block or statement, it might
be transmitted or updated in later.
Listing 4.3 – Array regions analysis by PIPS
// <array [PHI1]−W−EXACT−{PHI1==i , l ==0, u==10, 0<=i , i<=9}>
// <array [PHI1]−OUT−MAY−{PHI1==i , l ==0, u==10, 0<=i , i<=9}>
The analysis of READ, WRITE, IN and OUT provide the information of data de-
pendencies and the size of each region. In our work, we use PIPS to perform region
analysis on OpenCL kernels, because we require the information of region size to parti-
tion the workload, we also require the information of data dependencies to perform data
exchanges.
4.1.3 Analysis of OpenCL kernel
In the scope of STEPOCL, OpenCL kernels are written in OpenCL C programming
language which is based on the ISO/IEC 9899:1999C language specification (also known
as C99 specification) with specific extensions and restrictions. Some features of OpenCL
C are not supported by the PIPS compiler, thus some data types or functions should be
redefined before passing region analysis with PIPS.
OpenCL type for application Corresponding C type
cl char char
cl uchar unsigned char
cl short short
cl int int
cl uint unsigned int
cl long long
cl ulong unsigned long
cl float float
cl double double
cl half half
Table 4.2 – OpenCL scalar data type
42
Table 4.2 describes the list of OpenCL scalar data types that should be converted
before using PIPS. OpenCL also provides vector data types. They are defined with the
type name (char, int and float etc.) followed by a number n that defines the number
of elements in the vector. These vector data types presented in Table 4.3 should be
converted into corresponding scalar data types to adapt PIPS’s C parser.
OpenCL type for application Description
cl charn A vector of n 8-bit signed two’s complement integer values.
cl ucharn A vector of n 8-bit unsigned integer values
cl shortn A vector of n 16-bit signed two’s complement integer values.
cl ushortn A vector of n 16-bit unsigned integer values.
cl intn A vector of n 32-bit signed two’s complement integer values.
cl uintn A vector of n 32-bit unsigned integer values.
cl longn A vector of n 64-bit signed two’s complement integer values.
cl ulongn A vector of n 64-bit unsigned integer values.
cl floatn A vector of n 32-bit floating-point values.
cl doublen A vector of n 64-bit floating-point values.
Table 4.3 – OpenCL vector data type
The address space qualifiers (in Section 3.1.1) and built-in work-item functions are
redefined beside PIPS for region analysis at single work-item level. Listing 4.4 and List-
ing 4.5 present a classic 1D-stencil kernel and its result of region analysis. Each work-item
performs three READ actions on table A and one WRITE action on Table B. For each
work-item, the relative interval of READ action on table A is [−1, 1] in one dimension
and the interval of WRITE action on table B is [0,0]. Meanwhile, the interval of READ
action on table A presents the range of ghost region. When the large region is split into
subregions, each subregion shall keep the extra ghost region for computational purpose.
In some applications, such as stencil iteration computation, the ghost regions (as inter-
mediate results) need to be updated from other subregion after each iteration. Thanks to
the information of data dependency provided by region analysis, STEPOCL can manage
automatically the communication of intermediate result.
Listing 4.4 – 1D-stencil kernel
k e r n e l void g e n e r i c s t e n c i l 1 D ( g l o b a l f l o a t ∗ A, g l o b a l f l o a t ∗ B)
{
const unsigned i n t i = g e t g l o b a l i d (0)+1;
A = A+1;
B = B+1;
B[ i ] = (A[ i−1]+A[ i ]+A[ i +1] )/3 ;
}
Listing 4.5 – Analysis of 1D-stencil kernel
k e r n e l void g e n e r i c s t e n c i l 1 D ( g l o b a l f l o a t ∗ A, g l o b a l f l o a t ∗ B)
{
const unsigned i n t i = g e t g l o b a l i d (0)+1;
A = A+1;
B = B+1;
// <A[PHI1]−R−EXACT−{PHI1==i+1, i==0}>
// <A[PHI1]−R−EXACT−{PHI1==i , i==0}>
// <A[PHI1]−R−EXACT−{PHI1==i−1, i==0}>
// <B[PHI1]−W−EXACT−{PHI1==i , i==0}>
B[ i ] = (A[ i−1]+A[ i ]+A[ i +1] )/3 ;
}
43
cl device info Description
CL DEVICE GLOBAL MEM CACHE SIZE Return type: cl ulong
Size of global memory cache in bytes.
CL DEVICE GLOBAL MEM CACHE TYPE Return type:cl device mem cache type
Type of global memory cache supported.
CL DEVICE MAX CLOCK FREQUENCY Return type: cl uint
Maximum configured clock frequency of the device in MHz.
CL DEVICE MAX COMPUTE UNITS Return type: cl uint
The number of parallel compute cores on the OpenCL device.
The minimum value is 1.
CL DEVICE NAME Return type: char[]
Device name string.
CL DEVICE PLATFORM Return type: cl platform id
The platform associated with this device.
Table 4.4 – Get information about an OpenCL device
4.2 Oﬄine profiling
In order to ensure the load balancing on heterogeneous architectures, it is important to
have the information about the problem size and the hardware configurations. STEPOCL
oﬄine profiling is used for evaluating the performance of OpenCL devices.
Modern multi-core processors are very complex architectures. There are many factors
that affect a processor performance and the performance model varies among different
processor types. For example, CPU has less latency issues than GPU. Some factors,
such as the impact of cache architecture and the register size, are difficult to analyse.
Meanwhile, the input data size also affects the efficiency of performance.
In order to simplify the problem, STEPOCL assumes that developer want to use
the full potential of whole heterogeneous platforms, which means that the input data
set is large enough to saturate all computing devices. The performance model used by
STEPOCL is expressed in a simple form as Equation 4.1.
Performance = CoreCount ∗ Frequency ∗ InstructionPerCycle
InstructionCount
(4.1)
Considering that OpenCL applications are massively parallel, the Instruction per cy-
cle for each core should be 1. Meanwhile, the instruction count should be equal for all
compute devices, since they execute the same kernel. Thus, there are only Core count and
frequency that we shall consider. OpenCL API clGetDeviceInfo provides the function-
ality that permits programmers to access the critical attributes of an available compute
device. Table 4.4 lists a part of information that can be accessed through clGetDeviceInfo
function.
STEPOCL assesses the performance of each device i (perfi) during the initializa-
tion of OpenCL environment. The performance of whole heterogeneous system can be
expressed as the sum of each perfi (in Equation 4.2, n is the number of available compute
devices).
perf system =
n∑
i=0
perfi (4.2)
At the moment of workload partition, the proportion of workload for each device propi
44
0 1 2 3 4 5 6 7
Device 0 Device 1
Work-items
Data  A A0 A1 A0 A1
Figure 4.2 – Workload partitioning for multi-device platform with single context
A0 A1
0 1 2 3 4 5 6 7
Device 0 Device 1
Work-items
Data  A
Figure 4.3 – Workload partitioning for multi-device platform with multiply contexts
is calculated in Equation 4.3 .
propi =
perfi
perf system
(4.3)
4.3 Workload partition
There are two methods for partitioning an OpenCL workload:
• Redundantly copy all data to each computing device and index the work-items using
global offsets (Figure 4.2).
• Split the data into subsets and index the work-items into the subset (Figure 4.3).
The first method uses only one OpenCL context for all computing devices. The
benefit of this method is that it is an easy way to manage data, developers do not have
to manually code the data partition for each device. However the shortcomings are also
obvious. This method can not work on a mixed platform (e.g. combining an Intel CPU
with Nvidia GPUs and AMD GPUs). Meanwhile, redundantly copying all data to each
computing device consumes a lot of memory resources. The memory available on each
computing device can be different, some devices may not even have enough space to
allocate the whole data.
The second method is to create redundant OpenCL contexts for computing devices.
This method can directly access all computing devices on different OpenCL platforms.
45
Original table Split on axis y
P0
P1
P2
P0 P1 P2
P0
P1
P2
x
y
z
Split on axis x Split on axis z
Figure 4.4 – Available data partitioning over multiple devices
The utilization of device’s memory is also more efficient than the first method, Since no
redundant data is allocated on the device. STEPOCL adopts the second method, and
automatically generates the code of data partition.
4.3.1 Data space partition
The data space is partitioned with the pre-estimated proportion thanks to the profiler
(Section 4.2), however the details of partition depends on the description of STEPOCL
configuration file. Figure 4.4 presents the possible partitions of a 3D data for three
computing devices.
Ghost region
For some applications, updating an element of data region requires neighbouring re-
gion. In order to localize the computation, the subregions that are split from original
region should contain the replicated neighbouring regions (called ghost region). Fig-
ure 4.5 presents the data partition of a 2D table for four computing devices with ghost
region.
The size of ghost region can be calculated from the result of region analysis (Sec-
tion 4.1). The relative interval (α0, α1) of READ action of a single work-item represents
exactly the border information of ghost region.
Partitioning process
Assuming that we have N devices, the proportion of performance for each devices Pi is
propi, the projection of ghost region on axis axisn is (α0, α1), the data size on axisn is
Dsize, the tile size on axisn for each device Pi is tilei and the size of work-group on axisn
for device Pi is wgi. Then the data partition on axisn will be achieved in following steps:
First of all, the data used for computation Dsize c is computed as below.
Dsize c = Dsize− α0 − α1 (4.4)
According to the statistics of device performance, DataP0 (the preallocated data for
device P0) is calculated in equation 4.5
DataP0 = bDsize c ∗ prop0c (4.5)
46
P0
P1
P2
P3
Original table Split in row
P1 P2P0 P3
Split in column
Ghost region
Figure 4.5 – Data partition with ghost region
Next, adjusting the data size DataP0 to match the tile size. The tile size described in
STEPOCL configuration file represents the size of data to compute by each work-item.
We must ensure that each work-item can access to its own data region. Thus the size
of data must be divisible by the tile size. Equation 4.6 presents the process of adjusting
DataP ′0.
DataP ′0 = DataP0 −DataP0%(tile0 ∗ gwi) (4.6)
Then the size of data that will be allocated on device P0 is DataP
′
0 with its ghost
region (Equation 4.7 ).
DataOnP0 Axisn = α0 +DataP
′
0 + α1 (4.7)
Finally, the data size for the remaining devices can be calculated step by step in the
same way as DataOnP0 Axisn. However if the data space is a multi-dimensional table,
the final data size for device t will be calculated as Equation 4.8. In Equation 4.8, n
represents the number of dimension. Figure 4.6 illustrates a complete partition process
on one dimension.
DataOnPt =
n−1∏
i=0
DataOnPt Axisi (4.8)
After data partitioning, each subregion contains the following information: the global
ID, the information of written data region (data region) and the information of ghost
region (ghost region). The global ID of subregion can be represented with relative ID
(re ID x, re ID y) and device partitioning information (the number of devices in the first
dimension numdev x, the number of devices in the second dimension numdev y).
global ID = re ID y × numdev y+
re ID x(2D : global ID)
(4.9)
47
data
α0 α1
prop0
 P0
data
α0 α1
prop0
 P0
data
α0 α1 P0
data
α0 α1
 P0
α1
data
α0 α1
 P0
α1 α1α0
 P1
α0
 Pn-1
Matching tile size
Padding data if necessary
Step1 :
Step2 :
Step3 :
Step4 :
Step5 :
Figure 4.6 – The process of Data partition
global ID = re ID z × numdev y × numdev x+
re ID y × numdev y+
re ID x(3D : global ID)
(4.10)
STEPOCL uses these IDs to locate the neighbouring IDs for each subregion and
keeps these neighbouring information in a data structure called neighbour list.
4.4 Workload balancing
This section presents the mechanisms that are used by STEPOCL to dynamically adjust
the workload of each computing devices. The first workload partitioning may be not
perfectly balanced. STEPOCL follows the status of several executions at the beginning,
then it will adjust the proportion of workload if unbalances are detected.
In order to distinguish data ratio from performance ratio propi, we use r
0
i to represent
the initial data ratio that has to be deployed on the device i at the first time of workload
partition.
Once the computing kernels are executed, STEPOCL runtime observes the elapsed
time dji to perform the iteration j on each device i. After each iteration, STEPOCL
runtime copies all the output arguments that have to be switched with an input argument
in the main memory, re-estimates a new ratio rj+1i based on the elapsed time and the
previous ratio rji , and redeploys the input arguments on the device.
48
Task
CPU
GPU1
Xeon Phi
Partition
Execution
Subtask a
Subtask b
Subtask n
Subtasks
<d0j-1, d1j-1, …, dnj-1>
w0jw1j
wnj
Figure 4.7 – Overview of the load balancing algorithm used in the STEPOCL runtime.
Figure 4.7 summarizes the mechanism. In the Figure, wji represents the workload of
the device i at the iteration j, i.e., the size of the input data multiplied by rji .
A naive way to adjust the ratios after each iteration would be to consider the ratio
between dj, the mean duration at iteration j, and dji , for instance, by defining r
j+1
i as
rji × 1 + dj/dji . The main drawback we may encounter by scheduling with this method
is the occurrence of the ping-pong effect illustrated in Figure 4.8. Let us consider two
homogeneous devices A and B. Let us assume, for instance, that due to the instability
caused by cache misses or some other reasons, the evaluation of the duration of iteration
j for CPUa is over estimated. Then at iteration j + 1, the naive re-balancing formulae
will assign more work-items to CPUb than to CPUa. As a result, at the iteration j, the
oﬄine profiler considers that CPUb works slowly and underloads it. We have observed
that the naive re-balancing formulae can overload CPUa at iteration j + 2 and that this
phenomenon may be amplified.
In order to reduce the risk of this ping-pong effect, STEPOCL adapts the changing
speed of the ratio adjustment from one iteration to another. The main idea behind this
method is to reduce this changing speed whenever an inversion of the direction of variation
of rji is detected. To this end, a value Q, initialized to 1, is incremented after each such
Iteration Workload Partition
j CPU bCPU a
j+1 CPU bCPU a
j+2 CPU bCPU a
Figure 4.8 – Ping-pong effect of workload adjustment.
49
inversion.
We consider that when Q increases, i.e., that after few inversions, the re-balancing
factor dj/dji should become less important because the workloads are converging to an
efficient configuration. For this reason, we define the ratio for iteration j and for device
i as follows:
rj+1i := r
j
i
(
1 +
1
Q
×
(
dj
dji
− 1
))
;
The self-adjustment process keeps on running until σj, defined as the standard de-
viation of execution time on each device for iteration j, becomes small enough. In
STEPOCL, we consider that the workload is ”calibrated” when σj < 0.05 dj. Once
the calibration is complete, the new ratios rj+1i are returned, so that the next invocations
of the generated application distribute the workload efficiently among the tested devices.
4.5 Data transmission between multiple devices
This Section introduces how STEPOCL manages data transmission between multiple
devices.
Read/write memory objects
STEPOCL uses buffer objects for the movement of data in and out of the compute
device memory, from the host’s memory. These memory objects are stored in the host
memory (typically, in RAM) or in the device memory (typically, in GRAM directly on
the graphic card). There are several functions that can be used to read and write memory
object. The Table 4.5 presents five functions that read and write buffer objects.
Function Purpose
clEnqueueReadBuffer Reads data from a buffer
object to host memory
clEnqueueWriteBuffer Writes data from host memory
to a buffer object
clEnqueueReadBufferRect Reads a rectangular portion of data
from a buffer object to host memory
clEnqueueWriteBufferRect Writes a rectangular portion of data
from host memory to a buffer object
clEnqueueCopyBuffer Enqueues a command to copy a buffer
object to another buffer object
Table 4.5 – read and write buffer objects
Exchanging the intermediate results between devices
The intermediate data that needs to be transferred from device A to device B can be
calculated as the intersection of the region that is written on device A and will be read
50
on device B (see equation 4.11).
region aTob = exact write A ∩ exact read B (4.11)
Since STEPOCL defines data region as EXACT WRITE region and ghost region
as EXACT READ region, the equation 4.11 can be redefined as:
region aTob = data region A ∩ ghost region B (4.12)
Then all the data can be transferred with the neighbor list following the process
presented in Listing 4.6.
Listing 4.6 – The data transmission between all devices
f o r each device A
load ( data reg ion A ) from device A
fo r each device B in n e i g h b o r l i s t A
load ( ghost reg ion B ) from device B
trans f e rData ( region aTob )
wait ( ) ;
OpenCL does not assume that data can be transferred directly between devices, so
commands only exist to move from a host to a device, or from a device to host. Copying
data from one device to another requires an intermediate transfer to the host. Figure 4.9
presents the procedure of data transmission between devices.
Figure 4.9 – Data transmission between two devices
The pointer of host memory cannot be simultaneously possessed by several devices,
one device cannot communicate with host until the data transmissions of other devices
have finished.
51
Listing 4.7 – Resulting READ and WRITE array regions of the 1D-stencil kernel.
k e r n e l void g e n e r i c s t e n c i l 1 D ( g l o b a l f l o a t ∗A, g l o b a l f l o a t ∗B) {
const unsigned i n t i = g e t g l o b a l i d (0)+1;
// <A[PHI1]−R−EXACT−{PHI1==i , i==1}>
// <A[PHI1]−R−EXACT−{PHI1==i+1, i==1}>
// <A[PHI1]−R−EXACT−{PHI1==i−1, i==1}>
// <B[PHI1]−W−EXACT−{PHI1==i , i==1}>
B[ i ]= i + (A[ i−1]+A[ i ]+A[ i +1] )/3 ;
}
4.6 Generation of Host code
This Section introduces the overall process of host code generation performed by STEPOCL.
Based on the kernel(s) and on the configuration file, STEPOCL generates a multi-device
OpenCL program. After the initialization of the OpenCL environment, the generated
program contains three main components: the detection component, the deployment
component, and the consistency component. The generated code ends by collecting and
aggregating the result from each device. Algorithm 1 illustrates this process.
Detection component During the initialization (line 1 of Algorithm 1), the generated
code detects the available OpenCL devices and associates a ratio to each device thanks
to the oﬄine profiler (see Section 4.2). It indicates the percentage of the workload that
the generated code has to deploy to each device. After the initialization, the Devices
variable thus contains a list of (device, ratio) pairs.
Deployment component The deployment component (lines 2 to 5 and lines 8 to 12
in Algorithm 1) performs the initial deployment of the kernels on the devices.
First, based on the ratio (provided by the detection component) and on the tile size
(provided by the configuration file, see Section 3.2.2) of each device, the generated code
computes the number of tasks for each device. For a given device, the number of tasks is
simply equal to the size of the output argument multiplied by the ratio and divided by
the size of a tile.
Then, with the data_split axis provided by the configuration file (see Subsection 3.2.2)
and on the list of tasks computed at the previous line, the generated code computes the
subset of the input data accessed by each device (line 4 of Algorithm 1). This computa-
tion relies on the PIPS compiler. PIPS analyzes the instructions and represents memory
accesses as convex polyhedra.As an illustration, Listing 4.7 reports the access patterns to
arrays A and B in the 1D-stencil: the kernel reads A at three different positions (i, i+ 1
and i− 1) while it writes B at the position i. From this information, the generated code
computes the convex hull of the data accessed by the tasks of each device, and stores this
result in the Subdata variable. Due to region representation of PIPS, the size of Subdata
can be overestimated.
Finally, the generated program copies the data to the devices (lines 8 to 9 of Algo-
rithm 1) and deploys the tasks (lines 11 to 12).
Consistency component This component ensures data consistency and tries to min-
imize the data transfers between the devices between two iterations. It includes the lines
52
from 6 to 7 and line 14 of Algorithm 1.
The generated code first identifies the regions that have to be exchanged between the
iterations (line 6 of Algorithm 1). Again, the generated code uses the PIPS analysis.
As presented in Section 4.5, for each argument exchanged between two iterations, the
generated code identifies which regions of the argument are replicated between at least
two devices, accessed in read in the input argument and accessed in write in the output
argument. For each device, the FindDeviceNeighboring function at line 6 identifies both
neighbors and their associated replicated regions.
At the end of an iteration (line 15 of Algorithm 1), STEPOCL retrieves the comput-
ing results from each devices.
Algorithm 1: Generated host code.
Input:
Data host: Data location on the host
Data size: Data size
Data split: Data-splitting plan
Kernel tile: Computation size done by each kernel instance
N iter : Number of iterations
Data:
Devices: List of detected computing devices
Subtasksi: Workload assigned to device i
Subdatai: Data chunk to distribute on device i
Kernel accesses: Data access pattern by the kernel
Neighborsi: Data neighboring of device i
Data devicei: Data location on device i
1 Devices← InitOpenCLEnvironment();
2 Subtasks← PartitionWorkload(Data size,
3 Kernel tile,Devices);
4 Subdata← PartitionData(Data host,Data size,
5 Data split, Subtasks);
6 Neighbors← FindDeviceNeighboring(Subdata,
7 Kernel accesses);
8 foreach devi in Devices do
9 Data devicei ← InitDeviceData(devi, Subdatai);
10 for k=1 to N iter do
11 foreach devi in Devices do
12 InvokeKernel(Subtasksi, Data devicei);
13 WaitForTermination();
14 UpdateSubData(Subdata,Neighbors);
15 foreach devi in Devices do
16 ReadBack(Data host, devi, Data devicei);
17 FinalizeOpenCLEnvironment();
4.7 Conclusion
This chapter introduces the core modules that are implemented in STEPOCL. STEPOCL
uses the result of region analysis and information from oﬄine profiling to perform the
workload partition. Then, STEPOCL runtime manages the process of workload balanc-
ing and the communications between multiple devices.
STEPOCL is well suited to massive data-parallel applications. The data size and cost
of communication can greatly affect the performance. The ideal situation is that the data
size is large enough to saturate the computing power of the whole platform. As for the
53
applications with smaller data size, the cost of communication between devices becomes
more important, and the better performance might be achieved by using fewer computing
devices. In the next chapter, we will evaluate STEPOCL with three applications. For
each application, we will analyse the impact of data size and cost of communication on
performance.
54
Chapter 5
Evaluation
Le ge´nie commence les beaux
ouvrages, mais le travail les
ache`ve.
—Joseph Joubert
Contents
5.1 Test cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Volume of the generated source code . . . . . . . . . . . . . 61
5.3 Performance evaluation . . . . . . . . . . . . . . . . . . . . . . 63
5.3.1 Experimental platforms . . . . . . . . . . . . . . . . . . . . . . 63
5.3.2 Evaluation of the profiler . . . . . . . . . . . . . . . . . . . . . 64
5.3.3 Comparison with the reference codes . . . . . . . . . . . . . . . 64
5.3.4 Stencil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3.5 Matrix multiplication . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.6 N-body . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4 Analysis of overhead of communication . . . . . . . . . . . . 68
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
We evaluate STEPOCL on three test cases: a 5-point 2D-stencil, a matrix multipli-
cation and an N-body application.
We start by describing the target applications context before evaluating the generated
codes according to different criteria:
• we compare the size of handwritten applications to the size of the equivalent appli-
cations generated by STEPOCL;
• we evaluate the accuracy of the oﬄine profiler;
• we compare the performance of the handwritten applications to the performance
of the equivalent applications generated by STEPOCL when running on a single
device;
55
• we evaluate the performance of the generated applications when running on multiple
devices.
5.1 Test cases
In order to assess the STEPOCL usability and performance, we use three applications
(Stencil computation, Matrix multiplication and N-body) that can run on heterogeneous
multi-device platforms.
Figure 5.1 – 5-point 2D-stencil computation.
Stencil computation The first application is a 5-point 2D-stencil computation. It
takes a 2D matrix of scalar values as input and outputs, for each element, a weighted
average of its 4-neighborhood. Figure 5.1 illustrates this computation. We compare the
generated stencil application with an handwritten OpenCL version. We set the configu-
ration file (Listing 5.2) so that:
• It defines two kernel arguments: an input and an output matrices of float elements
(line 3 – line 25 of Listing 5.2), which are marked to be split by lines (line 29 – line
32 of Listing 5.2).
• It defines two kernel versions: the first kernel (reported in Listing 5.1) is designed to
fit CPU and Xeon Phi devices by setting work-items to work over 4-elements wide
tiles while the second kernel is optimized for GPUs and takes advantage of their
shared memory. Each work-group instantiates 16 × 4 work-items that first copy
their data to their local shared memory before entering the computation phase and
processing 4 elements. (line 34 – line 64 of Listing 5.2)
• The target application iterates ten times, switching input and output matrices after
each iteration (line 66 – line 73 of Listing 5.2). The STEPOCL code updates
data in device memory in a fixed pattern :when a sub-matrix, corresponding to
56
the bounded interval [(x1, y1), (x2, y2)], is assigned to a device, the frontiers of its
corresponding output are fetched from all its neighbor devices.
Listing 5.1 – 2D-stencil kernel with a tile of 4 elements.
1 k e r n e l void s t e n c i l ( g l o b a l f l o a t ∗B,
2 g l o b a l f l o a t ∗A,
3 unsigned i n t l i n e s i z e ) {
4 const unsigned i n t x = g e t g l o b a l i d ( 0 ) ;
5 const unsigned i n t y = g e t g l o b a l i d ( 1 ) ;
6 A += l i n e s i z e + 1 ; // OFFSET
7 B += l i n e s i z e + 1 ; // OFFSET
8 f o r ( unsigned i n t k=0; k<4; k++){
9 B[ ( y∗4 + k )∗ l i n e s i z e + x ]
10 = 0.75 ∗ A[ y∗4 + k ] [ x ]
11 + 0.25 ∗ ( t i l e [ y∗4 + k +1] [ x−1]
12 + t i l e [ y∗4 + k +1] [ x+1]
13 + t i l e [ y∗4 + k−1] [ x−1]
14 + t i l e [ y∗4 + k−1] [ x+1] ) ;
15 }
16 }
List 5.2 presents the configuration file of 2D-stencil application.
Listing 5.2 – Configuration file of 2D-stencil application
1 <?xml ve r s i on=”1 .0 ”?>
2 <root>
3 <argument>
4 <ID>B</ID>
5 <data type>f l o a t </data type>
6 <a r g s i z e>
7 <d im s i z e a x i s=”x ”> 1026 </dim s ize> <!−−<dim s ize >1026</dim s ize> −−>
8 <d im s i z e a x i s=”y ”> 1026 </dim s ize> <!−−<dim s ize >1026</dim s ize> −−>
9 </a r g s i z e>
10 </argument>
11 <argument>
12 <ID>A</ID>
13 <data type>f l o a t </data type>
14 <a r g s i z e>
15 <d im s i z e a x i s=”x ”> 1026 </dim s ize>
16 <d im s i z e a x i s=”y ”> 1026 </dim s ize>
17 </a r g s i z e>
18 </argument>
19 <argument>
20 <ID> l i n e s i z e </ID>
21 <data type>unsigned int</data type>
22 <depend>
23 <a x i s ID=”A”>x</axis>
24 </depend>
25 </argument>
26 <kerne l>
27 <name>s t e n c i l </name>
28 <t a rge t a rg>B</ta rge t a rg>
29 <d a t a s p l i t>
30 <a x i s ID=”B”>y</axis>
31 <a x i s ID=”A”>y</axis>
32 </d a t a s p l i t>
33 <!−−mul t ip l e p lat form i s ok−−>
34 <implem>
35 <platform>NVIDIA CUDA</platform>
36 <dev ice type>GPU</dev ice type>
37 <f i l ename>stencilGPU . c l</f i l ename>
38 <t i l e >
39 <s i z e a x i s=”x ”>1</s i z e>
40 <s i z e a x i s=”y ”>4</s i z e>
41 </ t i l e >
42 <work group>
57
43 <s i z e a x i s=”x ”>16</s i z e>
44 <s i z e a x i s=”y ”>4</s i z e>
45 </work group>
46 </implem>
47 <implem>
48 <platform>I n t e l (R) OpenCL</platform>
49 <dev ice type>CPU</dev ice type>
50 <f i l ename>stencilCPU . c l</f i l ename>
51 <t i l e >
52 <s i z e a x i s=”x ”>1</s i z e>
53 <s i z e a x i s=”y ”>4</s i z e>
54 </ t i l e >
55 </implem>
56 <implem>
57 <platform>I n t e l (R) OpenCL</platform>
58 <dev ice type>ACCELERATOR</dev ice type>
59 <f i l ename>stenci lMIC . c l</f i l ename>
60 <t i l e >
61 <s i z e a x i s=”x ”>1</s i z e>
62 <s i z e a x i s=”y ”>4</s i z e>
63 </ t i l e >
64 </implem>
65 </kerne l>
66 <cont ro l>
67 <loop i t e r a t i o n s=”10 ” >
68 <exec>Stenci l2D</exec>
69 <switch>
70 <arg ID=”A”>B</arg>
71 </switch>
72 </loop>
73 </cont ro l>
74 </root>
Matrix multiplication The matrix multiplication application computes the C = A×
B operation on 2D matrices. We compare the STEPOCL matrix multiplication with
the one provided by AMD APP SDK benchmark. We set the related configuration file
so that:
• It defines three kernel arguments: the three matrices of float elements A, B and C
(line 3 – line 26 of Listing 5.3). C and A are marked to be split by lines (line 30 –
line 33 of Listing 5.3), and B is totally replicated on all the devices because of the
data dependency analysis.
• It defines two kernel versions (these two kernels are directly taken from AMD SDK):
the first one targets CPU and Xeon Phi devices using a tile of 4× 4 elements, while
the one designed for GPUs also uses a tile of 4 × 4, but defines work-groups of
4 × 4 work-items in order to work on shared local memory (line 34 – line 64 of
Listing 5.3).
• The generated application executes only one iteration without triggering any com-
munication between devices.
List 5.3 presents the complete configuration file of matrix multiplication.
Listing 5.3 – Configuration file of Matrix multiplication
1 <?xml ve r s i on=”1 .0 ”?>
2 <root>
3 <argument>
4 <ID>matA</ID>
58
5 <data type>f l o a t </data type>
6 <a r g s i z e>
7 <d im s i z e a x i s=”x ”>1024</dim s ize>
8 <d im s i z e a x i s=”y ”>1024</dim s ize>
9 </a r g s i z e>
10 </argument>
11 <argument>
12 <ID>matB</ID>
13 <data type>f l o a t </data type>
14 <a r g s i z e>
15 <d im s i z e a x i s=”x ”>1024</dim s ize>
16 <d im s i z e a x i s=”y ”>1024</dim s ize>
17 </a r g s i z e>
18 </argument>
19 <argument>
20 <ID>matC</ID>
21 <data type>f l o a t </data type>
22 <a r g s i z e>
23 <d im s i z e a x i s=”x ”>1024</dim s ize>
24 <d im s i z e a x i s=”y ”>1024</dim s ize>
25 </a r g s i z e>
26 </argument>
27 <kerne l>
28 <name>f loatMatr ixMult</name>
29 <t a rge t a rg>matC</ta rge t a rg>
30 <d a t a s p l i t>
31 <a x i s ID=”matA”>y</axis>
32 <a x i s ID=”matC”>y</axis>
33 </d a t a s p l i t>
34 <implem>
35 <platform>NVIDIA CUDA</platform>
36 <dev ice type>GPU</dev ice type>
37 <f i l ename>multGPU . c l</f i l ename>
38 <t i l e >
39 <s i z e a x i s=”x ”>4</s i z e>
40 <s i z e a x i s=”y ”>4</s i z e>
41 </ t i l e >
42 <work group>
43 <s i z e a x i s=”x ”>4</s i z e>
44 <s i z e a x i s=”y ”>4</s i z e>
45 </work group>
46 </implem>
47 <implem>
48 <platform>I n t e l (R) OpenCL</platform>
49 <dev ice type>CPU</dev ice type>
50 <f i l ename>mult . c l</f i l ename>
51 <t i l e >
52 <s i z e a x i s=”x ”>4</s i z e>
53 <s i z e a x i s=”y ”>4</s i z e>
54 </ t i l e >
55 </implem>
56 <implem>
57 <platform>I n t e l (R) OpenCL</platform>
58 <dev ice type>ACCELERATOR</dev ice type>
59 <f i l ename>mult . c l</f i l ename>
60 <t i l e >
61 <s i z e a x i s=”x ”>4</s i z e>
62 <s i z e a x i s=”y ”>4</s i z e>
63 </ t i l e >
64 </implem>
65 </kerne l>
66 </root>
N-body The N-body application simulates the collective motions of a large particle set
under Newtonian forces in a 3D space. At each iteration, each particle, defined by a mass
and a velocity, changes both its position and its velocity by using the position and the
59
velocity of all the other particles. We compare the generated N-body application with
the one provided by AMD APP SDK benchmark.
In STEPOCL, we define the following arguments (line 3 – line 42 of Listing 5.4):
• Three integer: the number of particles, a softening factor and an elapsed time in-
terval. All these arguments are replicated because of the data dependency analysis;
• Two arrays of float elements containing the particles positions: one stores the po-
sitions at the previous iteration while the other, the current ones;
• Two arrays of float elements that contain the velocity of each particle: one array
stores the velocities at the previous iteration while the other contains the current
ones.
The application performs ten iterations. After each iteration, it switches the input and
output arrays (line 75 – line 82 of Listing 5.4). Considering the communication, the two
input arrays are replicated on all the devices because of the data dependency analysis.
Indeed, to compute the new position and a new velocity of one particle, all particle
positions and velocities of the previous iteration are used. As the input and output
arrays are switched after an iteration, between each iteration, each device broadcasts its
output to all the other devices.
List 5.4 presents the complete configuration file of N-body.
Listing 5.4 – Configuration file of N-body
1 <?xml ve r s i on=”1 .0 ”?>
2 <root>
3 <argument>
4 <ID>Pos</ID>
5 <data type>f l o a t </data type>
6 <a r g s i z e>
7 <d im s i z e a x i s=”x ”>4096</dim s ize>
8 </a r g s i z e>
9 </argument>
10 <argument>
11 <ID>Vel</ID>
12 <data type>f l o a t </data type>
13 <a r g s i z e>
14 <d im s i z e a x i s=”x ”>4096</dim s ize>
15 </a r g s i z e>
16 </argument>
17 <argument>
18 <ID>numPartic les</ID>
19 <data type>int</data type>
20 </argument>
21 <argument>
22 <ID>delT</ID>
23 <data type>f l o a t </data type>
24 </argument>
25 <argument>
26 <ID>espSqr</ID>
27 <data type>f l o a t </data type>
28 </argument>
29 <argument>
30 <ID>newPosit ion</ID>
31 <data type>f l o a t </data type>
32 <a r g s i z e>
33 <d im s i z e a x i s=”x ”>4096</dim s ize>
34 </a r g s i z e>
35 </argument>
36 <argument>
60
37 <ID>newVelocity</ID>
38 <data type>f l o a t </data type>
39 <a r g s i z e>
40 <d im s i z e a x i s=”x ”>4096</dim s ize>
41 </a r g s i z e>
42 </argument>
43 <kerne l>
44 <name>nbody sim</name>
45 <t a rge t a rg>Pos</ta rge t a rg>
46 <d a t a s p l i t>
47 <a x i s ID=”Pos ”>x</axis>
48 <a x i s ID=”Vel ”>x</axis>
49 <a x i s ID=”newPosit ion ”>x</axis>
50 <a x i s ID=”newVelocity ”>x</axis>
51 </d a t a s p l i t>
52 <implem>
53 <platform>NVIDIA CUDA</platform>
54 <dev ice type>GPU</dev ice type>
55 <f i l ename>nbodyGPU . c l</f i l ename>
56 <t i l e >
57 <s i z e a x i s=”x ”>4</s i z e>
58 </ t i l e >
59 <work group>
60 <s i z e a x i s=”x ”>128</ s i z e>
61 </work group>
62 </implem>
63 <implem>
64 <platform>I n t e l (R) OpenCL</platform>
65 <dev ice type>ALL</dev ice type>
66 <f i l ename>nbody . c l</f i l ename>
67 <t i l e >
68 <s i z e a x i s=”x ”>4</s i z e>
69 </ t i l e >
70 <work group>
71 <s i z e a x i s=”x ”>128</ s i z e>
72 </work group>
73 </implem>
74 </kerne l>
75 <cont ro l>
76 <loop i t e r a t i o n s=”10 ”>
77 <switch>
78 <arg ID=”Pos ”>newPosit ion</arg>
79 <arg ID=”Vel ”>newVelocity</arg>
80 </switch>
81 </loop>
82 </cont ro l>
83 </root>
5.2 Volume of the generated source code
Table 5.2 reports the number of lines of code of the three native OpenCL applications as
well as the size of the three configuration files that STEPOCL used for generating the
tested applications. The table also contains the number of lines of code of the applications
generated by STEPOCL. The kernels consist of a few tens of lines of code. However,
the native OpenCL applications and generated host codes in charge of instantiating the
kernels on the devices consist of several hundreds lines of code.
The native OpenCL applications are only able to run on single device, while the
STEPOCL applications can run on multiple devices. We classify the generated code
between what is part of the STEPOCL runtime and what is part of the application.
The runtime source code is generic and does not vary from one application to another,
while the application code is more specific and may vary depending on the description
61
Test case Native OpenCL application Kernels STEPOCL config. file STEPOCL generated code
(kernels included) (kernels included)
Stencil 490 GPU=51/others=18 74 1153
Mat. Mult. 1212 GPU=103/others=78 66 1216
N-body 1041 81 83 1116
Table 5.1 – Generated STEPOCL code size (in lines).
# of lines of code
Application Runtime
Initialization 89 loc 323 loc
Workload partitioning 224 loc 76 loc
Communication 24 loc 322 loc
Retrieve results 85 loc 10 loc
Total 422 731
Table 5.2 – Distribution of the lines of code of the generated 2D stencil application
file provided by the user or on the OpenCL kernel. The code is also classified depending
on its semantic: the code may be in one of the 4 parts of the generated host code as pre-
sented in Section 3.2.3: the initialization, the workload partitioning, the communication
management or the retrieving of the results. Table 5.2 reports the analysis of generated
2D stencil source code. Most of the lines of code are dedicated to the runtime system.
The generated application itself remains compact (422 lines of code), but may increase
for more complex computational kernels.
The workload partitioning, which represents the largest part of the generated code,
includes the initial distribution of the workload across devices as well as the load bal-
ancing mechanism. This part of the code is more dependant on the type of application.
Thus, most of the source code is located in the application. Retrieving the results is also
mostly performed by the application and the runtime system only provides a function for
transferring data back from the devices.
The other parts of the code are more generic and mostly provided by the run-
time. The initialization mainly consists in detecting the available devices and initializing
the OpenCL environment, while the generated code simply declares variables and calls
STEPOCL runtime functions. The communication management, which is in charge of
synchronizing the devices and managing the data exchanges between devices, is mainly
implemented by the runtime. The STEPOCL runtime implements transfer primitives
as well as functions that perform the polyhedral analysis.
To conclude, we can observe that the STEPOCL configuration files contains only
few tens of lines of code, which is roughly ten time smaller than the generated code. This
result shows that STEPOCL simplifies the development of an application for multiple
device.
62
Begin
Initialize OpenCL environment
Partition and distribute 
workload to all devices
Device a :
Execute Kernel
Device b :
Execute kernel
Retrieve result and
 release the memory
For loop
Loop
finished ?
Data 1 Data 2
Data transimission
Data transimission
Yes
No
Figure 5.2 – the structure of generated 2D stencil code
5.3 Performance evaluation
We now evaluate the performance of the three applications on two heterogeneous plat-
forms.
5.3.1 Experimental platforms
We summarize the characteristics of the two platforms in Table 5.3: the Hannibal
platform is a dual quad-core Intel Xeon CPU with three Quadro FX 5800 GPUs while
63
Name Hannibal Mistral
CPU models 2 x Intel Xeon X5550 2 x Intel Xeon E5-2670
# CPU cores 2 x 4 2 x 10
# threads 2 x 8 2 x 10
CPU frequency 2.66 GHz 2.50GHz
OpenCL support Intel OpenCL 1.1 Intel OpenCL 1.2
Accelerator type NVIDIA GPU Intel Xeon Phi
Models 3 x Quadro FX 5800 2 x Xeon Phi 7120P
# cores 3 x 240 CUDA cores 2 x 61 cores (2 x 244 threads)
Processor clock 1296 MHz 1238MHz
Memory 3 x 4096 MB GDDR3 2 x 16 GB GDDR5
OpenCL support NVidia OpenCL 1.1 Intel OpenCL 1.2
Table 5.3 – Experimental platform outline.
Mistral is a dual 10-cores Intel Xeon CPU with two Xeon Phi accelerators.
5.3.2 Evaluation of the profiler
Before evaluating the performance of the applications, we evaluate the oﬄine profiler by
measuring how it converges towards a balanced distribution of the workload.
Figure 5.3 depicts the workload distribution wji and the measured duration d
j
i of the
CPU and GPU devices on a machine running a 7-point 3D-stencil. Artificially, for the
first iteration of the application, we assign 99 % of the workload to the GPU and the
remaining 1 % to the CPU. After the first iteration, the profiler detects that it assigned
too much workload to the GPU. The workload partition for the second iteration thus
assigns more work-items to the CPU, which results in a more balanced execution time.
Yet, the difference between dt20 and d
t2
1 leads to assigning more work-items to the CPU
for the third iteration. After the third iteration, the profiler detects that the difference
between dt30 and d
t3
1 is small enough and stops the calibration process. In this experiment,
we run a few more iterations in order to make sure that the execution on each device is
stable when the workload distribution does not change.
5.3.3 Comparison with the reference codes
In order to ensure that the code generated by STEPOCL does not degrade the perfor-
mance as compared to a native implementation written directly in OpenCL, we compare
the performance of the generated application with the native OpenCL applications.
As the native OpenCL applications are only designed to run on a single device, we only
use a single GPU on Hannibal. From Table 5.4, we can observe that STEPOCL does
not change the performance of the matrix multiplication and that STEPOCL introduces
an overhead of 1% for stencil and 2% for N-body. From this result, we can thus conclude
that STEPOCL does not significantly modify the performance on a single accelerator.
5.3.4 Stencil
Figure 5.4 presents the performance of the generated stencil code on the two machines.
Our measures underlines the performance scalability achieved by STEPOCL, with GPU
and CPU devices adding up their computational horsepower efficiency.
64
020
40
60
80
100
W
or
k
lo
ad
ra
ti
o
(%
)
t1 t2 t3 t4 t5 t6
0
10
20
30
40
Step of execution
T
im
e
of
ex
ec
u
ti
on
on
ea
ch
d
ev
ic
e
(m
s)
Workload on GPU
Workload on CPU
Execution time on GPU
Execution time on CPU
Figure 5.3 – Workload adjustment performance of the 3D-stencil application.
On Hannibal (see Figure 5.4(a)), we observe that the peak performance achieved
when using all the available devices (91.8 GFLOPS) roughly corresponds to the sum of the
performance of each device running individually (14.5 GFLOPS for 1 CPU, 28.1 GFLOPS
for 1 GPU). The efficiency is not 100 % because of the communication between the devices
that are required when computing the stencil on multiple devices in order to ensure data
consistency.
The results on the mistral platform have a similar trend: the peak performance when
using all the devices (134.1 GFLOPS) approximately corresponds to the sum of the
individual performance (20.6 GFLOPS on 1 CPU, 63.1 GFLOPS on 1 Xeon Phi).
From these two results, we can conclude that STEPOCL scales linearly with the
small number of devices. This result also confirms that the oﬄine profiler seems to
provide efficient ratios able to perfectly balance the load between the devices.
Moreover, STEPOCL pushes forward the memory limits by automatically distribut-
ing data sets which are too large to be processed by a single device, summing up the
memory of multiple devices to handle larger problems. On the Hannibal platform,
where GPUs are equipped with 4 GiB of memory, the 1-GPU version cannot process the
test cases that require more than 4 GiB of memory. Similarly, the 2-GPUs version is
limited to 8 GiB, while the versions that exploit the 3-GPUs can process larger problems.
Application name 2D-Stencil Mat. Mult. N-body
Workload 4096x4096 1024x1024 32768 particles
Relative performance 0.99 1.00 0.98
Table 5.4 – Relative performance of STEPOCL as compared to a native OpenCL im-
plementation on Hannibal.
65
0 GiB 1 GiB 2 GiB 3 GiB 4 GiB 5 GiB 6 GiB 7 GiB 8 GiB
0
20
40
60
80
100
Data size (GiB)
G
F
L
O
P
S
1 CPU 1 GPU 2 GPUs 3 GPUs 1 CPU + 3 GPUs
(a) Performance on Hannibal
0 GiB 1 GiB 2 GiB 3 GiB 4 GiB 5 GiB 6 GiB 7 GiB
0
50
100
Data size (GiB)
G
F
L
O
P
S
1 CPU 1 MIC 2 MICs 1 CPU + 2 MICs
(b) Performance on Mistral
Figure 5.4 – Performance of the 5-point 2D-stencil application. The horizontal axis cor-
responds to the size the input and output matrices required to solve the problem.
5.3.5 Matrix multiplication
Figure 5.5 presents the performance of the generated matrix multiplication on the Han-
nibal and Mistral machines. On Hannibal (see Figure 5.5(a)), the performance
achieved when using both the CPU and the 3 GPUs (184.3 GFLOPS) almost corresponds
to the sum of the performance achieved by each device individually (18.36 GFLOPS on
the CPU, 56.16 GFLOPS on each GPU). On Mistral (see Figure 5.5(b)), when using
the CPU and the 2 Xeon Phis (252.9 GFLOPS), the performance corresponds to 88 % of
the one achieved by each device individually (90.9 GFLOPS on the CPU, 97.4 GFLOPS
on each Xeon Phi). Thus, in term of performance, the STEPOCL generated code scales
up correctly. Moreover, as for the stencil application, the 1-GPU version is bounded by
its inner memory size. Once again, STEPOCL allows to compute bigger problem than
the original code thanks to its multi-device dimension.
66
0 GiB 1 GiB 2 GiB 3 GiB 4 GiB 5 GiB
0
50
100
150
200
Data size (GiB)
G
F
L
O
P
S
1 CPU 1 GPU 2 GPUs 3 GPUs 1 CPU + 3 GPUs
(a) Performance on Hannibal
0 GiB 0.5 GiB 1 GiB 1.5 GiB 2 GiB 2.5 GiB
0
100
200
300
Data size (GiB)
G
F
L
O
P
S
1 CPU 1 MIC 2 MICs 1CPU + 2MICs
(b) Performance on Mistral
Figure 5.5 – Performance of the matrix multiplication application. The horizontal axis
corresponds to the summed size of the A, B, and C matrices.
5.3.6 N-body
Figure 5.6 presents the performance of the generated N-body application on Hannibal
and Mistral. On Hannibal (see Figure 5.6(a)), the performance achieved when using
both the CPU and the 3 GPUs (152.8 GFLOPS) corresponds to 94 % of the cumulated
performance achieved by each device individually (9.38 GFLOPS on the CPU, 51.12
GFLOPS on each GPU). On Mistral (see Figure 5.6(b)), when using the CPU and the
2 Xeon Phis (90 GFLOPS), the performance corresponds to 83 % of the one achieved
by each device individually (16.31 GFLOPS on the CPU, 45.68 GFLOPS on each Xeon
Phi).
As described in Subsection 5.1, the updated data communication between each iter-
ation is highly critical in comparison with our two other test cases: after each iteration,
the computed data is broadcasted to all devices attending the computation. This is why
67
210 212 214 216 218 219 220 221 222
0
50
100
150
Number of particles
G
F
L
O
P
S
1 CPU 1 GPU 2 GPUs 3 GPUs 1 CPU + 3 GPUs
(a) Performance on Hannibal
210 212 214 216 218 219 220 221 222
0
20
40
60
80
100
Number of particles
G
F
L
O
P
S
1 CPU 1 MIC 2 MICs 1 CPU + 2 MICs
(b) Performance on Mistral
Figure 5.6 – Performance of the N-body application.
the generated N-body code does not scale well on small data sets, particularly on Xeon
Phi as cross-device communication are very expensive on this platform. However, the
computation time increases in quadratic time with respect to the workload, while the
communication time only increases linearly. For this reason, with larger workloads, the
N-body application scales linearly with the number of devices.
The N-body application does not deal with memory footprint as all arrays need to be
duplicated. Since computation grows quadratically, it really focuses on the reduction of
the makespan. STEPOCL reaches this objective with its multi-device dimension.
5.4 Analysis of overhead of communication
The time of communication is decided by the bandwidth of network and the size data
which is needed to be transferred. The bandwidth of network is fixed, however the
68
Figure 5.7 – Partitioning data by column
Figure 5.8 – Partitioning data by row
transferring data size can be significantly affected by the partition layout.
Listing 5.5 – Stencil Loop Example
1 f o r ( i n t t =0; t<T; ++t ){
2 f o r ( i n t i =10; i<N−10;++ i ){
3 f o r ( i n t j =10; j<N−10;++ j ){
4 A[ i ] [ j ]=CNST ∗ (B[ i ] [ j +10]+B[ i ] [ j −10]
5 +B[ i −1] [ j ]+B[ i +1] [ j ] ) ;
6 }
7 }
8 swap (A,B) ;
9 }
For example, the Listing 5.5 presents a part of typical stencil loop code. In this case,
updating the matrix A depends more on the data which are relatively allocated on X
axes. If we partition the data space by row (Figure 5.8) instead of partitioning by column
(Figure 5.7), we can reduce significantly the overhead of communication.
Another factor that affects the overhead of communication is the continuity of trans-
ferred region. STEPOCL use clEnqueueReadBufferRect and clEnqueueWriteBufferRect
to transfer a 2D or 3D region between a buffer object and host memory, however the
physical address of a 2D or 3D region is not always continuous. In Figure 5.9, splitting
69
P0
P1
P2
P3
Original table Split in row
P1 P2P0 P3
Split in column
Ghost region
Figure 5.9 – Data partition of a 2D table
table in row can keep the continuity of physical addresses for each ghost region and split-
ting in column will break this continuity. According to our experiment, transferring a
continuous region is much more efficient than transferring an uncontinuous region. Thus,
in the situation of Figure 5.9, splitting in line is a better option.
Lastly, the overhead of communication also could be caused by repeatedly adjust-
ment of workload proportion for each OpenCL devices. For the applications which re-
quire consistently to relaunch computing kernels (such as stencil iteration and N-body),
STEPOCL will observe the execution time on each device and adjust the workload pro-
portion when the workload is obviously unbalanced between devices (see section 4.4 ).
Once the adjustment happens, STEPOCL aggregates all computed data from each de-
vices and redistributes it in adjusted proportion. Thus once a re-adjustment of workload
happens, the size of input data and the bandwidth of network will decide the duration
of readjustment. Fortunately, STEPOCL profiler and the constraint implemented in
runtime ensure that the readjustment will not happen too frequently.
5.5 Conclusion
In this chapter, we have presented the usage of STEPOCL and the performance of gen-
erated code on heterogeneous architectures. According to the results of evaluation, by
using STEPOCL, the number of lines of code to write an application to an application
for multiple devices is drastically reduced. STEPOCL generates more than one thousand
lines of code from a configuration file less than one hundred lines of code. The generated
applications do not degrade the performance as compared to a native implementation
written directly in OpenCL. Thanks to STEPOCL, the mechanisms of workload par-
titioning and workload balancing ensure that the generated applications can exploit the
full potential of computing power of any multiple devices heterogeneous architectures,
the performance of generated applications scales linearly with the number of devices.
70
Chapter 6
Conclusion
Choisissez un travail que vous
aimez, et vous n’aurez jamais a`
travailler un seul jour dans votre
vie.
—Confucius
Contents
6.1 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.3 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Heterogeneous architectures are being wildly used in the domain of High performance
computing. However, due to the multiple types of devices with different processing ca-
pabilities, programming on these architectures is difficult.
In this thesis, we have introduced STEPOCL, a programming tool that eases the
development of applications for multiple heterogeneous devices. Based on an OpenCL
compute kernel and a STEPOCL configuration file provided by user, STEPOCL gen-
erates an oﬄine profiler to guide the partitioning of the workload and generates the
OpenCL host part of the application. The generated application schedules the workload
according to the profiling results, launches their execution, and performs the necessary
data exchanges between devices.
We evaluated STEPOCL with three applications: a stencil application, a matrix
multiplication, and an N-body application. We measured the performance of these ap-
plications on two different multi-device platforms. Our evaluation shows that, thanks to
STEPOCL, the number of lines of code to write an application for multiple devices is
drastically reduced. Our measurements also show that the code generated by STEPOCL
can run on complex multi-device systems and that its performance scales well with the
number of devices. Using multiple devices also enables to cope with problem sizes that
cannot fit into a single accelerator.
71
6.1 Contribution
The contributions of this thesis are:
1. A programming tool STEPOCL that eases the development of applications for
multiple heterogeneous devices.
2. A domain specific language, based on XML, which consists in describing the execu-
tion scheme and the data layout processed by OpenCL kernels. Using this language
in our context improves the productivity of the developer by decreasing the number
of line of codes required to implement an application, but also has the advantage
of avoiding many bugs caused by the use of a low-level language such as C for the
development.
3. A workload partitioning model which guarantees the balance of workload on mul-
tiple devices.
4. A strategy of managing data transmission which maximizes the re-utilization of
data transferred in devices memory and minimize the data transmission by only
transferring the data that are need for the computation on other devices.
6.2 Limitations
The usability of STEPOCL is limited by the capability of region analysis. STEPOCL
uses compiler PIPS as a tool of region analysis. During our test of region analysis,
we find that PIPS is not capable of analysing the regions which contain complex logical
operations.For example, the result of region analysis may be not correct due to the logical
operations presented in Listing 6.1.
Listing 6.1 – Limitation of region analysis
1
2 k e r n e l void func ( . . . )
3 {
4 . . . . . .
5 {
6 const i n t cbx = x loc ;
7 const i n t cby = ( y loc & 1) ? −1 : 16 ;
8 const i n t bx = ( y loc & 2) ? cbx : cby ;
9 const i n t by = ( y loc & 2) ? cby : cbx ;
10 }
11 . . . . . .
12 }
Another limitation is form of distribution. STEPOCL only performs regular data
partitioning, such as splitting data in line, in column. The number of sub-data is equal
to the number of available devices, thus some distributions, such as distribution in cycles
or in diagonal (distribution (d) and distribution (e) in figure 6.1) are not available in
STEPOCL for now.
72
P1
P2
P3
P4
P1 P2 P3 P4
P1
P3 P4
P2
P1
P3
P1
P3
P2
P4
P2
P4
P1 P4 P3 P2
P2 P1 P4 P3
P3 P2 P1 P4
P4 P3 P2 P1
Double A[80][80]
a) distribute in lines b) distribute in columns
c) distribute in cubes d) distribute in cycles e) distribute in diagonal
Figure 6.1 – Distributions of a 2D table on 4 devices
6.3 Perspectives
Heterogeneous computing is becoming popular in every domain of computing – from
high performance clusters to low-power embedded devices. As a prototype of program-
ming tool, STEPOCL hasn’t considered every aspect of programming on heterogeneous
architectures. There are several directions we would like to explore.
Perspectives in the short term
Reducing the overhead of data transmission
We would like to further reduce the overhead of data transmission between devices. Cur-
rently, STEPOCL only transfers the necessary data which is needed for computing on
other devices, then we launch the kernels once all data is updated on every device. We
would like to ”hide” the overhead of data transmission by overlapping the data transmis-
sion with kernel executions.
Optimisation of OpenCL kernel
Although OpenCL applications are portable across different platforms, their performances
are not equally portable. In our experience, we achieved the best performance by manually
providing device-specific kernels. In the future, we would like to implement an heuristic
73
module and a code transformation module in STEPOCL so that STEPOCL could
automatically optimise computing kernels for specific devices.
Upgrading STEPOCL profiler
Current estimation of the performance on each computing device is based on the hardware
statistics, such as the number of cores and their frequency. We would like to develop an
heuristics not based on hardware statistics but also based on the static and runtime
feature of computing kernel. Thus, the upgraded profiler can provide a more reliable
information for workload partitioning.
Perspectives in the long term
Distributed memory architectures
The current version of STEPOCL only works on shared memory architectures. We
would like to extend STEPOCL to adapt to distributed memory architectures. Thus,
we need to develop a new runtime system which is capable of evaluating the performance
of each node on cluster, performing workload partition and scheduling tasks between
several nodes. We would like to combine STEPOCL with MPI to synchronize not only
the tasks of multiple devices on a single node but also the tasks on different nodes.
Scheduling policies
We would like to implement different scheduling policies to face the challenges of exe-
cuting more complex tasks on multiple devices heterogeneous architectures. Rather than
simply scheduling the tasks to saturate all computing resources, we want to consider
more aspects, such as the priorities of tasks and the efficiency of the use of device memo-
ries. The options of scheduling policies will be implemented in the STEPOCL interface.
Therefore programmer will have more choice for optimising OpenCL applications.
74
Bibliography
[1] What is CUDA. http://www.nvidia.com/object/cuda_home_new.html.
[2] Bolt Docmentation. http://hsa-libraries.github.io/Bolt/html/.
[3] Boost librairies homepage. http://www.boost.org/.
[4] Ieee standard for information technology-portable operating system interface (posix)-
part 1: System application program interface (api)- amendment d: Additional real
time extensions [c language], 1999.
[5] TOP500 list of supercomputers, 2015. http://www.top500.org/lists/2015/06/.
[6] Ce´dric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-Andre´ Wacrenier.
StarPU: a unified platform for task scheduling on heterogeneous multicore architec-
tures. Concurrency and Computation: Practice and Experience, 23(2):187–198, 2011.
[7] Steffen Christgau, Johannes Spazier, Bettina Schnor, Martin Hammitzsch, Andrey
Babeyko, and Joachim Waechter. A comparison of cuda and openacc: Accelerating
the tsunami simulation easywave. In Architecture of Computing Systems (ARCS),
2014 27th International Conference on, pages 1–5, Feb 2014.
[8] Be´atrice Creusillet. Array region analyses and applications. PhD thesis, E´cole Na-
tionale Supe´rieure des Mines de Paris, 1996.
[9] Be´atrice Creusillet and Franc¸ois Irigoin. Interprocedural array region analyses.
In Languages and Compilers for Parallel Computing, volume 1033, pages 46–60.
Springer, 1996.
[10] J. Cronsioe, B. Videau, and V. Marangozova-Martin. Boast: Bringing optimization
through automatic source-to-source transformations. In Embedded Multicore Socs
(MCSoC), 2013 IEEE 7th International Symposium on, pages 129–134, Sept 2013.
[11] Message P Forum. Mpi: A message-passing interface standard. Technical report,
Knoxville, TN, USA, 1994.
[12] Ivan Grasso, Simone Pellegrini, Biagio Cosenza, and Thomas Fahringer. LibWater:
Heterogeneous Distributed Computing Made Easy. In Proceedings of the 27th In-
ternational ACM Conference on International Conference on Supercomputing, pages
161–172. ACM, 2013.
75
[13] Rachid HABEL. High Performance Programming for Hybrid Architectures. Theses,
Ecole Nationale Supe´rieure des Mines de Paris, November 2014.
[14] Sylvain Henry, Alexandre Denis, Denis Barthou, Marie Christine Counilh, and Ray-
mond Namyst. Toward OpenCL Automatic Multi-Device Support. In Euro-Par
2014 Parallel Processing, pages 776–787, 2014.
[15] F. Irigoin. Interprocedural analyses for programming environments. In In Envi-
ronments and Tools for Parallel Scienti Computing, pages 333–350. Elsevier Science
Publisher, 1992.
[16] Franc¸ois Irigoin, Pierre Jouvelot, and Re´mi Triolet. Semantical interprocedural par-
allelization: an overview of the pips project. In ICS, 1991.
[17] Lee Janghaeng, Samadi Mehrzad, Park Yongjun, and Mahlke Scott. Transparent
CPU-GPU Collaboration for Data-parallel Kernels on Heterogeneous Systems. In
Proceedings of PACT ’13, pages 245–256, 2013.
[18] Jim Jeffers and James Reinders. Intel Xeon Phi coprocessor high-performance pro-
gramming. Elsevier Waltham (Mass.), Amsterdam, Boston (Mass.), Heidelberg..., et
al., 2013.
[19] Herbert Jordan, Simone Pellegrini, Peter Thoman, Klaus Kofler, and Thomas
Fahringer. INSPIRE: The insieme parallel intermediate representation. Proceed-
ings of the 22nd International Conference on Parallel Architectures and Compilation
Techniques, 0:7–17, 2013.
[20] Kim Jungwon, Kim Honggyu, Lee Joo Hwan, and Lee Jaejin. Achieving a single
compute device image in OpenCL for multiple GPUs. In Proceedings of the 16th
ACM Symposium on Principles and Practice of Parallel Programming, PPoPP ’11,
pages 277–288. ACM, 2011.
[21] Klaus Kofler, Ivan Grasso, Biagio Cosenza, and Thomas Fahringer. An automatic
input-sensitive approach for heterogeneous task partitioning. In Proceedings of the
27th International ACM Conference on International Conference on Supercomput-
ing, pages 149–160. ACM, 2013.
[22] Chris McClanahan and Georgia Tech. History and evolution of gpu architecture.
[23] Daniel Millot, Alain Muller, Christian Parrot, and Fre´de´rique Silber-Chaussumier.
Step: A distributed openmp for coarse-grain parallelism tool. In OpenMP in a New
Era of Parallelism, volume 5004, chapter 8, pages 83–99. Springer, 2008.
[24] David A. Patterson and John L. Hennessy. Computer Organization and Design,
Fourth Edition, Fourth Edition: The Hardware/Software Interface (The Morgan
Kaufmann Series in Computer Architecture and Design). Morgan Kaufmann Pub-
lishers Inc., San Francisco, CA, USA, 4th edition, 2008.
76
[25] Faber Peter and Groblinger Armin. A comparison of gpgpu computing frameworks on
embedded systems. In 13th IFAC and IEEE Conference on Programmable Devices
and Embedded SystemsaˆA˘TˇPDES 2015, volume 48, pages 240–245. ScienceDirect,
2015.
[26] James Reinders and Intel. An overview of programming for intel xeon processors
and intel xeon phi coprocessors, 2012.
[27] M. Sugawara, S. Hirasawa, K. Komatsu, H. Takizawa, and H. Kobayashi. A compar-
ison of performance tunabilities between opencl and openacc. In Embedded Multicore
Socs (MCSoC), 2013 IEEE 7th International Symposium on, pages 147–152, Sept
2013.
[28] Krishnahari Thouti and S. R. Sathe. Comparison of openmp & opencl parallel
processing technologies. CoRR, abs/1211.2038, 2012.
[29] Sandra Wienke, Paul Springer, Christian Terboven, and Dieter an Mey. Openacc:
First experiences with real-world applications. In Proceedings of the 18th Interna-
tional Conference on Parallel Processing, Euro-Par’12, pages 859–870, Berlin, Hei-
delberg, 2012. Springer-Verlag.
77
78
Appendices
79

List of publications
[1] Pei Li, Elisabeth Brunet, and Raymond Namyst. High performance code generation
for stencil computation on heterogeneous multi-device architectures. In 10th IEEE
International Conference on High Performance Computing and Communications &
2013 IEEE International Conference on Embedded and Ubiquitous Computing, HPC-
C/EUC 2013, Zhangjiajie, China, November 13-15, 2013, pages 1512–1518, 2013.
[2] Pei Li, Elisabeth Brunet, Franc¸ois Trahay, Christian Parrot, Gae¨l Thomas, and Ray-
mond Namyst. Automatic opencl code generation for multi-device heterogeneous
architectures. In 44rd International Conference on Parallel Processing, ICPP 2015,
Beijing, China, September 1-4, 2015, 2015.
81
