















Presidente  D. ..............................
Vocal  D. ..............................






















































































































































































































































































































































































































































































































































































































































































































Frameworks API Standard Portable legacy
code
FastFlow Templates Yes Yes Yes No
TBB Templates Yes Yes No No
Keywords Yes
OpenCL + No (not Yes No
library eﬃciently)
OpenMP Pragmas Yes Yes Yes Yes
OpenACC Pragmas No Yes Yes Yes

















































































































































































































































































































































































































2 /* Metainformation of HPP-DL */
3 "class":"hpp",








12 "description":"REPARA Reference System.X9DRG-QF(To be filled byO.E.M.)",
13 ...
14 },








23 /* Definition of GPGPU */
24 {
25 /* Common attributes */
26 "class":"gpgpu",
27 "id":"platform:0.gpgpu:0",





33 /* Definition of links */
34 "links":[







































































































































































































1/* Sample ofakernel that includes the basic attributes *
2 *defined on the specification: *
3 *-target where the code could be executed, *
4 *-input variables *
5 *-and output variables *












































































































1/* Different examples on the use of the kernel clause */
2
3/* for loop asakernel */
4[[rpr::kernel,





10/* compound statement asakernel */
11[[rpr::kernel,






18/* expression asakernel */
19[[rpr::kernel,



















1/* The kernel will be executed on GPGPU components *
2 *of the system.The particular device will be *
3 *chosen by the scheduler. */
4[[rpr::kernel,rpr::target(GPGPU),





10/* The kernel could be executed on CPU or FPGA device */
11[[rpr::kernel,rpr::target(CPU,FPGA),





17/* The kernel could be executed on any device */
18[[rpr::kernel,rpr::target(ANY),


































13/* sequential code */
14 ...
15
16/* Kernel that syncs the previous kernels *
17 *before the execution of this one. *




















1/* Other sample of async kernels inacode *
2 *region. */
3 void foo(....){





9 /* sequential code */
10 ...





16}/* Synchonization point */
Listing4.6:Exampleofsynchronizationofasynchronouskernelsdeﬁnedinanested
loop.
1/* Asynchronous kernels declared insidea *
2 * structure code block(e.g.body loop) */
3 for(i=0;i<N;++i){





9 /* sequential code */
10 ...
11}
12/* There isan implicit sync operation,where it*



















































































1/*Aisainput kernel value */
2 intA[1000];
3/*Bisainput kernel value,but*
4 *only from indices 200 to 300 */
5 intB[1000];





































































1/*Jisan output kernel parameter */
2 intJ[1000];
3/* Indices 100 to 150 fromKare*
4 *output for the kernel */
5 intK[1000];





























































2/* rest of the code
*/
3
4/* MatrixAis transfered to the device and,if not in *









13/* part of the host execution where"A"is not used */
14 ...
15
16/*"A" variable is not transfered to the kernel.Again *
17 *the scheduler tries to keep in the device memory */
18[[rpr::kernel,rpr::in(A),rpr::out(A),rpr::keep(A),





24/*"A"if variable resides in device memory,it will *
25 *not be transfered from host
*
26 *"A" variable will be transfered to host and deleted *
27 *at the end of kernel execution.
*/
28[[rpr::kernel,rpr::out(A),
















4/* Matrix"A"isan output of kernel.It will keep,if *
5 *itis possible,until next kernel that use itas *
6 * parameter (thrid kernel in this case). */
7 [[rpr::kernel,rpr::out(A),rpr::keep(A),





13/*"B"is used by the kernel and kept ifitis possible */
14[[rpr::kernel,rpr::in(B),rpr::out(B),rpr::keep(B),





20/*"A"and"B" variables may reside in device memory. */
21[[rpr::kernel,rpr::in(A,B),



















2/*A variable is used outside of kernel *
3 * execution but kept in device memory */
4[[rpr::kernel,rpr::in(A),rpr::keep(A),








13/* ERROR:"A"device variable and"A"host *
14 * variable are different */
15[[rpr::kernel,rpr::in(A),






2/*Akernel with 1000 elements *

















1/* Execution of2dkernel from *
2 *0to 1000 and from0to 100 *






















1/* The kernel will accessamatrix inacross patterns,*








10/* The kernel will add all the neighbors ofan element *










21/* The kernel will use the elements place inaplane *























































3/* Example on the use of the reduce attribute */
4[[rpr::kernel,rpr::reduce(add,var),





































































































2/* loop statement */
3[[rpr::pipeline (),rpr::stream(A,B)]]
4 for (...){




8 /* first stage of pipeline */
9 [[rpr::kernel,rpr::out(A),
10 ... /* rest of attributes */]]{
11 A=...;
12 }
13 /* second stage of pipeline */
14 [[rpr::kernel,rpr::in(A),rpr::out(B),





20 /* third stage of pipeline */
21 [[rpr::kernel,rpr::in(B),



















2/* loop statement *
3 * AandBare part of the stream *
4 * The internal buffers of the pipeline are: *
5 * -Bound internal buffers *
6 * -Number of elements stored of internal *
7 * buffers:8 *









17 /* Stage of the pipeline
*




























3/* loop statement */
4[[rpr::pipeline (),rpr::stream(A,B)]]
5 while(true){
6 /* Information used by the stream will *
7 *be create here
*










18 /* Third stage of pipeline */
19 // Illegal stream variables are not





















1/* Farm by default using *
2 *maximum number of workers and *
3 * unordered output */
4[[rpr::kernel,
5 rpr::farm(),


















1/* Map kernel using maximum number *
2 *of workers by default. */
3[[rpr::kernel, rpr::map(),






10/* Map with4workers */
11[[rpr::kernel, rpr::map(4),






















































4 //one per variable







12 //one per each dimension.















28 //one per variable







36 //one per each dimension.



















































1/* Examples on the use of the debug attribute.
*
2 *It will show the parameters and their type(i.e.in,out,inout)
,*
3 * characteristics (e.g.type of variable,size).
*
4 *Also,It will show the target device where the code runs and the
*










































































































































































































































































































































































































































































































































































































































































































13: compTransfHostDevEPk ← compTransfHostDevEPk,i= CompInputsi∧InputsMovei
14: rateTransfHostDevEPk ← rateTransfHostDevEPk,i =(RateInputsi∧InputsMovei)∗rate
15: rateTransfDevHostEPk←rateTransfDevHostEPk,i=RateOutputsi∗rate












4: paramLocationhost ← paramLocationhost,i = paramLocationhost,i ∨InputsMoveToHosti








































Coreclock 1.2GHz 1.0GHz 1.1GHz
Computingunits 24 2816SP 224









































































BFS  Histogram  LBM  SGEMM
Hotspots  86,161,175  98,101  62,103,104,127,  28,29,30
128,186,459,562,563
AKI 86  101  62,127,128,186,  28,29,30
459,562,563
Baseline  86  98,101  62,126,157,  28
186,289,348
Notexecuted  -  -  157,289,348  -
Nested  -  -  126  29,30











































































































































DataSize  256 512 1024 2048 4096 8192
CPUpartition 16/16 16/16 16/16 4/16 4/16 4/16
GPUpartition 0/16 0/16 0/16 12/16 12/16 12/16

























Problem size (matrix side)



































Problem size (matrix side)


























































DataSize  128 256 384 512 768 1024 1536 2048 3072 4096 6144 8192
Multi.Kernel X X X 3/4X X  X  X  X  X  X  X  X
UnionKernel O O O  X  O  O  O  O  O  O  O  O













Kernels 1 2 3 4 5 6 7 8 91011121314
Case24 X O O O O O O O O O O O O O
Case32 X O O O O O O O O O O O O O
Case40 X O O O O O O O O O O O O O
Case48 X O O O O O O O O O O O O O
Kernels15161718192021222324252627
Case24 O O O O O X O O O O O O X
Case32 O O O O O O O O O O O O X
Case40 O O O O O O O O O O O O X















Problem size (matrix side)





























Problem size (matrix side)























































































Problem size (matrix side)



















































































































































































































































































































































































































































































































































































































[104]REPARAproject. TargetPlatform DescriptionSpeciﬁcation. http://
repara-project.eu/?p=174,Feb.2014.
[105]D.L.RosenbandandT.Rosenband.Adesigncasestudy:CPUvs.GPGPU




































[116]O. Team. Portable Hardware Locality. https://www.open-mpi.org/
projects/hwloc/,dec2015.

















































































































size magnitude Representsthesizeofal memorybanksasso-
ciatetothismemory. Memorysizecapacity
usingIECbinarypreﬁxes.




























































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































2 /* Common attributes */
3 "class":"gpgpu",
4 "id":"platform:0.gpgpu:0,
5 "description":"geforce gtx titan",
6 "model":"GK110",
7 "vendor":"nvidia corporation",




















































































































































































4 "description":"Xilinx VC707 Evaluation Kit",
5 "vendor":"Xilinx Inc.",
6 "model":"VC707",
7 "rndacc_winsize":32,/* size of address space in bits,i.e. number of bits
in addresses */
8 "rndacc_latency": 115,/* access latency (ns)*/
9 "stracc_count":4,/* number of streams */
10 "stracc_throughput": 15.5,/* throughput in MB/s*/



















































































































14 /* Definition of DSP processor */
15 "class":"processor",
16 "id":"platform:0.dsp:0.processor:0",








25 /* There are three more processors like this one */
26 {
27 /* Definition of DSP cores */
28 "class":"core",
29 "id":"platform:0.dsp:0.processor:0.core:0",






























2 /* Definition of memory banks */
3 "class":"bank",
4 "id":"platform:0.dsp:0.processor:0.memory:0.bank:0",














































51 /* The same for each core of each processor */

























































































































































































































































































































































































































































52 /* rest of measures */
53 ],
54 "latency":"0.3 ms"
55 },
56 ...
57 ]
58 ...
59 }
166 APPENDIXA. HPP-DL
ListingA.22:ResourceexamplesonHPP-DL
1 {
2 "class":"resource",
3 "id":"resource:0",
4 "description":"",
5 "component_ref":"platform:0.memory:0",
6 "io_memory_address":["0x200000000-0x24000000"],
7 "io_ports":["0xdc00-0xdf00","0xfe00-0xff00"],
8 "irq":91
9 }
10
11 {
12 "class":"resource",
13 "id":"resource:1",
14 "description":"",
15 "component_ref":"platform:0.fpga:0.memory:0",
16 "io_memory_address":["0x200000000-0x400000000"],
17 "io_ports":[],
18 "irq":0
19 }
TableA.35:OtherdeﬁnedCPUextensioncapabilities.
Capability Description
up smpkernelrunningonup
bts BranchTraceStore
repgood repmicrocodeworkswel
nopl TheNOPL(0F1F)instructions
xtopology cputopologyenumextensions
nonstoptsc TSCdoesnotstopinCstates
TableA.36:AuxiliaryCPUcapabilities.
Capability Description
ida IntelDynamicAcceleration
arat AlwaysRunningAPICTimer
cpb AMDCorePerformanceBoost
epb IA32ENERGYPERFBIASsupport
xsaveopt OptimizedXsave
pln IntelPowerLimitNotiﬁcation
pts IntelPackageThermalStatus
dts DigitalThermalSensor
hwpstate AMDHW-PState
A.4. CAPABILITIES 167
TableA.37:IntelCPUvirtualizationcapabilities.
Capability Description
tprshadow IntelTPRShadow
vnmi IntelVirtualNMI
ﬂexpriority IntelFlexPriority
ept IntelExtendedPageTable
vpid IntelVirtualProcessorID
TableA.38:AMDCPUvirtualizationcapabilities.
Capability Description
npt NestedPageTablesupport
lbrv LBRVirtualizationsupport
svmlock SVMlockingMSR
nripsave SVMnextripsave
tscscale TSCscalingsupport
vmcbclean VMCBcleanbitssupport
ﬂushbyasid ﬂush-by-ASIDsupport
decodeassists DecodeAssistssupport
pauseﬁlter ﬁlteredpauseintercept
pfthreshold pauseﬁlterthreshold
TableA.39:NewCPUcapabilities.
Capability Description
fsgsbase RD/WRFS/GSBASEinstructions
bmi1 1stgroupbitmanipulationextensions
hle HardwareLockElision
avx2 AVX2instructions
smep SupervisorModeExecutionProtection
bmi2 2ndgroupbitmanipulationextensions
erms EnhancedREPMOVSB/STOSB
invpcid InvalidateProcessorContextID
rtm RestrictedTransactionalMemory
168 APPENDIXA. HPP-DL
TableA.40:ARMCPUcapabilities1.
Capability Description
swp SWPinstruction(atomicread-modify-write)
half Halfwordsupport
thumb Thumb(16-bitinstructionset)
26bit 26BitModel
fastmult 32x32→64−bitmultiplication
fpa Floatingpointaccelerator
vfp VFP(earlySIMDvectorﬂoatingpointinstructions)
edsp DSPextensions
java Jazele(Javabytecodeaccelerator)
iwmmxt SIMDinstructions
TableA.41:ARMCPUcapabilities2.
Capability Description
crunch MaverickCrunchcoprocessor
thumbee ThumbEE
neon NEON(second-generationSIMD)
vfpv3 VFPversion3
tls TLSregister
vfpv3d16 VFPversion3limitedto16registers
vfpv4 VFPversion4
idiva SDIVandUDIVhardwaredivisioninARMmode
idivt SDIVandUDIVhardwaredivisioninThumbmode
vfpd32 VFPversionfor32registers
lpae LargePhysicalAddressExtension
TableA.42:ARM64CPUcapabilities.
Capability Description
fp Floating-point
asimd AdvancedSIMD
evtstrm Eventstreamonuserspace
A.4. CAPABILITIES 169
TableA.43:DSPcoresspeciﬁccapabilities1.
Capability Description
fxp Fixed-pointarithmeticsupport.
ﬂp Floating-pointarithmeticsupport.
MAPLE-B Multi-AcceleratorPlatformEngineforBaseband.
Turbo Turbodecoding.
Viterbi Viterbidecoding.
FFT FastFouriertransformsupport.
iFFT InverseFastFouriertransformsupport.
DFT DiscreteFouriertransformsupport.
TableA.44:DSPcoresspeciﬁccapabilities2.
Capability Description
iDFT InverseDiscreteFouriertransformsupport.
EFCOP EnhancedFilterCoprocessor.
DTF DataTraceFormatter.
ETB EmbeddedTraceBuﬀer.
EDMA3 EnhancedDirectMemoryAccess3.
SWI Softwareinterruptcapability.
MAC Multiply-accumulateoperationsupport.
FMA Fusedmultiply-add,performedwithasinglerounding.
IIR Inﬁniteimpulseresponseﬁlter
FIR Finiteimpulseresponseﬁlter
DPIM Double-PrecisionIntegerMultiplication
170 APPENDIXA. HPP-DL
TableA.45:FPGAcapabilities.
CapabilityName Description
autocoherency driverguaranteescachecoherentran-
domaccess
hwswint devicecantriggerinterruptsathost
mastermode devicecanperformmastermodemem-
oryaccess
rndacc devicecanaccesshostmemoryinran-
domaccessmode
rndaccwinmove devicemovethebaseoftherandomac-
cess
windowtohostmemory
scattergather drivercanperformscatter/gatherac-
cess
stracc deviceprovidesstreaminginterface
virtaddr devicecanuseuser-spacevirtualad-
dresses
toaccesshostmemory
TableA.46:Cachecapabilities.
Capability Description
internal Internalcache.
write-through Writingcachepolicywheredatais
writtenoncachesand/ormemoriesatthesame
time.
Itisincompatiblewithwrite-back.
write-back Writingcachepolicywheredatais
writtenoncachesand/or memoriesdelayedon
time.
Itisincompatiblewithwrite-through.
instruction Instructioncache.Itisincompatiblewith
dataorunified.
data Datacache.Itisincompatiblewith
instructionorunified.
uniﬁed Unifydataandinstructioncaches.Itis
incompatiblewithdataorinstruction.
