Using ArrayOL to Identify Potentially Shareable Data in Thread Work-Groups of GPUs by De Oliveira Rodrigues, Antonio Wendell et al.
Using ArrayOL to Identify Potentially Shareable
Data in Thread Work-Groups of GPUs
{Wendell.Rodrigues, Frederic.Guyomarch, Jean-Luc.Dekeyser}@inria.fr
Contribution
We propose an optimization method suitable
to environments based on MDE and ArrayOL
for code generation to GPU architectures. This
method takes into account information from
the dataflow modeling and data reuse by
thread work-groups. The aim is to achieve bet-
ter performances on data access in order to pro-
vide speed-up in parallel applications.
Background
OpenCL Automatic Code Generation
Gaspard2 is a framework based on Model
Driven Engineering (MDE) and MARTE. This
framework allows us generating code to sev-
eral environments including massively paral-
lel architectures such as GPU. Recent changes
in Gaspard2 bring us a transformation chain
(MARTE to OpenCL) able to create automat-
ically code for OpenCL API in hybrid archi-
tectures. However, optimization levels cur-
rently detected by programming experts are
challengers in MDE compilers. This approach
adds the detection ability to identify potentially
shareable data.
ArrayOL and Tilers
Tilers are stereotyped in connectors between
ports around the application model. The
model contains connectors with the stereotype
Tiler linking a part A of shape M from a port
of shape N to a port of shape P of a containing
component. This topology means that each
element of the pattern N at the iteration M is
transmitted to the element the array P accord-
ing to origin, paving and fitting attributes of the
Tiler following this formula:
{origin+ paving.i+ (fitting.j mod shape)
|0 ≤ i < M |0 ≤ j < N}
Matrix Multiplication Model
• MC = MA∗MB, one thread takes one line
from MA and one column from MB and
computes one element in MC;
• modeled using Papyrus UML Tool;
• magen, mbgen and mcprint are allocated onto
the CPU processor;
• the repetitive task m is allocated onto the
GPU processor;
• usually ports (variables) from GPU tasks
are placed into the device global memory;
Refactored Model
Matrix_Multiplication_byBlock_1024x1024
beta: divScalarfg: frameGenA: Matrix
beta: divScalarfg: frameGenB: Matrix


















Transfer from CPU 






beta: divScalarfg: frameGengpuGM: HwRAM
beta: divScalarfg: frameGengpuLM: HwRAM
?
Without shareable 
data, the TILER 
makes a simple 
address reference 
in Device Global 
Memory
Otherwise, the 
TILER is a data 
copy from Global 








k=0 Swi[k]: internal product of the shape of the repetion space of the work-item;∏dim
k=0 Sarray[k]: internal product of the shape of the external array;∏dim
k=0 Spattern[k]: internal product of the shape of the internal array (pattern):






> 1 then transfer=true





k: how many array elements are copied by a work-item.
Funding
This work is supported by Nord-Pas de Calais Region, Valeo and GPUTech.
Results and Conclusions
⇒from a UML/MARTE high abstract model we can detect optimization levels in memory;
⇒these optimization additions can provide high gains in performance ( 9x in this case).
References
A. Wendell O. Rodrigues, Frédéric Guyomarc’h, and Jean-
Luc Dekeyser. An MDE Approach for Automatic Code Gener-
ation from MARTE to OpenCL. Technical report, INRIA Lille
- RR-7525.
A. Gamatié, S. Le Beux, E. Piel, R. Ben Atitallah, A. Etien, P.
Marquet, and J-L. Dekeyser. A model driven design framework
for massively parallel embedded systems. ACM Transactions on
Embedded Computing Systems (TECS). (to appear), 2011.
Open Questions
} How to create a rafactoring method suit-
able to hybrid environments.
} Benchmark for other application types and
cases.
} Analysis of reading and writing ranking.
