Parallelizing Workload Execution in Embedded and High-Performance
  Heterogeneous Systems by Nunez-Yanez, Jose et al.
Parallelizing Workload Execution in Embedded and
High-Performance Heterogeneous Systems
Jose Nunez-Yanez,
Mohammad Hosseinabady,
Moslem Amiri
University of Bristol, UK
(eejlny,csxmh,ma17215)@bristol.ac.
uk
Andre´s Rodrı´guez,
Rafael Asenjo,
Angeles Navarro
Universidad de Ma´laga
Spain
(andres,asenjo,angeles)@ac.uma.es
Rube´n Gran-Tejero,
Darı´o Sua´rez-Gracia
Universidad de Zaragoza
Spain
(rgran,dario)@unizar.es
ABSTRACT
In this paper, we introduce a soware-dened framework that en-
ables the parallel utilization of all the programmable processing
resources available in heterogeneous system-on-chip (SoC) includ-
ing FPGA-based hardware accelerators and programmable CPUs.
Two platforms with dierent architectures are considered, and a
single C/C++ source code is used in both of them for the CPU and
FPGA resources. Instead of simply using the hardware accelera-
tor to ooad a task from the CPU, we propose a scheduler that
dynamically distributes the tasks among all the resources to fully
exploit all computing devices while minimizing load unbalance.
e multi-architecture study compares an ARMV7 and ARMV8
implementation with dierent number and type of CPU cores and
also dierent FPGA micro-architecture and size. We measure that
both platforms benet from having the CPU cores assist FPGA
execution at the same level of energy requirements.
KEYWORDS
FPGAs, heterogeneous, dynamic scheduler, performance improve-
ment, energy reduction.
ACM Reference format:
Jose Nunez-Yanez,
Mohammad Hosseinabady,
Moslem Amiri, Andre´s Rodrı´guez,
Rafael Asenjo,
Angeles Navarro, and Rube´n Gran-Tejero,
Darı´o Sua´rez-Gracia. 2018. Parallelizing Workload Execution in Embedded
and High-Performance Heterogeneous Systems. In Proceedings of 6th Work-
shop on High Performance Energy Ecient Embedded Systems, Manchester
UK, January 2018 (HIP3ES 2018), 6 pages.
DOI: 10.1145/nnnnnnn.nnnnnnn
1 INTRODUCTION
Heterogeneity is seen as a path forward for computers to deliver the
energy and performance computing improvements needed over the
next decade. In heterogeneous architectures, specialized hardware
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permied. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
HIP3ES 2018, Manchester UK
© 2018 ACM. 978-x-xxxx-xxxx-x/YY/MM. . .$15.00
DOI: 10.1145/nnnnnnn.nnnnnnn
units accelerate complex tasks. A good example of this trend is
the introduction of GPUs (Graphics Processing Units) for general
purpose computing combined with multicore CPUs. FPGAs (Field
Programmable Gate Arrays) are an alternative high performance
technology that oer bit-level parallel computing in contrast with
the word-level parallelism deployed in GPUs and CPUs. In a typ-
ical conguration, the host CPU employs the FPGA accelerator
to ooad the work and then remains idle. In this research, we
investigate a cooperative strategy applied to compute intensive
applications in which both the CPU and FPGA perform the same
task on dierent regions of the input data. e proposed scheduling
algorithm dynamically distributes dierent chunks of the iteration
space between CPU and a FPGA fabric integrated in the same die.
e objective is to measure if simultaneous computing among these
devices could be more favourable from an energy and/or perfor-
mance points of view compared with ooading to the FPGA and
the CPU idling. e FPGA and CPUs are programmed with the
same C/C++ language using the SDSoC (Soware Dened SoC)
framework that enables very high productivity and simplies the
development of drivers to interface the processor and logic parts. As
shown in Table 1, we consider two platforms with dierent scales
of compute power, one a low-cost platform with a dual-core ARMv7
CPU and another high-performance state-of-the-art platform with
a quad-core ARMv8 CPU. Testing on both enables both the valida-
tion the approach and the comparison of their performance and
energy characteristics.
Table 1: Platform Specications
ZYNQ Z7020 Zynq Ultrascale+ ZU9
PL LUTs 53.2K 274K
PL Flip-Flops 106.4K 548K
PL Block RAMs 140 1824
PL DSP Slices 220 2520
Fabrication process 28 nm CMOS 16 nm FinFET
PS, CPU type 32-bit dual Cortex A9 64-bit quad Cortex A53
PS, CPU frequency 600 MHz 1.4 GHz
Nominal Voltage 1 Volt 0.85 Volt
PL-PS interface Up to 4 64-bit HP ports Up to 4 128-bit HP ports
1 64-bit ACP coherent
port
Up to 2 128-bit HPC co-
herent ports (no L2 allo-
cation)
1 128-bit ACP port (L2 al-
location)
ar
X
iv
:1
80
2.
03
31
6v
1 
 [c
s.D
C]
  9
 Fe
b 2
01
8
HIP3ES 2018, January 2018, Manchester UK Jose Nunez-Yanez et al.
2 BACKGROUND AND RELATEDWORK
e idea of balancing the workload among devices has been ex-
plored previously in the literature mainly around systems that
combine GPUs and CPUs. For example, a study with desktop CPUs
and GPUs has been done in [5] where percentages of work to both
devices are assigned before making a selection based on heuris-
tics. With CPUs and GPUs, also energy aware decisions have been
considered in [2], which requires proprietary code. Another re-
lated work in the context of streaming applications [10] considers
performance and energy when looking for the optimal mapping
of pipeline stages to CPU and on-chip GPU. e possibility of us-
ing GPU+CPU and FPGA simultaneously and collaboratively has
also received aention in diverse application areas such as med-
ical research [3]. e hardware considered uses multiple devices
connected through a common PCIe backbone, and the designers
optimized how dierent parts of the application are mapped to each
computing resource. is type of heterogeneous computing can be
considered to connect devices vertically since the idea is to build a
streaming pipeline with results moving processed data from one
stage to the next. Data is captured and initially processed in the
FPGA then moved with DMA engines to the CPU and GPU compo-
nents. e heterogeneous solution achieves a 273× speed-up over a
multi-core CPU implementation. A study of the potential of FPGAs
and GPUs to accelerate data center applications is done in [6]. e
paper conrms that FPGA and GPU platforms can provide com-
pelling energy eciency gains over general purpose processors, but
it also indicates that the possible advantages of FPGAs over GPUs
are unclear due to the similar performance per wa and the signi-
cant programming eort of FPGAs. In any case, it is important to
note that the paper does not use high level languages to increase
FPGA productivity as done in this work, and the power measure-
ments for the FPGA are based on worst case tool estimations and
not direct measurements. In this research, we explore a horizontal
collaborative solution more closely related to the work done in [9].
at work focuses on a multiple device solution similar to our work
and demonstrates how the N-body simulation can be implemented
in a heterogeneous solution in which both FPGA and GPU work
together to compute the same algorithm kernel on dierent por-
tions of particles. While our approach uses a dynamic scheduling
algorithm to compute the optimal split, in [9] the split is calculated
manually with 2/3 of the workload given to FPGA and the rest
to GPU; the collaborative implementation is 22.7× faster than the
CPU only version. In summary, we can conclude that the avail-
able literature has largely focused on advancing the programming
models to make the use of FPGAs in heterogeneous systems more
productive, comparing the performance of GPGPUs, FPGAs and
CPUs for dierent types of applications in large scale clusters, and
creating systems that manually choose the optimal device for each
part of the application and move data among them. In contrast, in
this paper we select a state-of-the-art high-level design ow based
on C/C++ for single-chip heterogeneous CPU+FPGA and extend it
to support simultaneous computing performing dynamic workload
balancing.
3 PROGRAMMING ENVIRONMENT
3.1 Programming Interface
is section introduces the proposed Heterogeneous Building Blocks
(HBB) library API. It is a C++ template library that takes advantage
of heterogeneous processors and facilitates his usage and congura-
tion. HBB aims to make easier the programming for heterogeneous
processors by automatically partitioning and scheduling the work-
load among the CPU cores, and the accelerator. It builds on top of
the SDS (Xilinx SDSoC library) and TBB[7] libraries, and it oers
a parallel for() function template to run on heterogeneous
CPU+FPGA systems. In Fig. 1 we depict an MPSoC with an inte-
grated FPGA and two CPU cores (CC), as the low-end platform used
in the experimental evaluation. e FPGA itself can contain a num-
ber of FPGA compute units (FC) depending on resource availability
and accelerator conguration.
e le part of Fig. 1 shows the soware stack that supports the
user application. Our library (HBB) oers an abstraction layer that
hides the initialization and management details of TBB and SDS
constructs, thus the user can focus on his own application instead of
dealing with thread management and synchronization. e library
takes care of spliing the iteration space in chunks of iterations
and process each chunk on a CPU core (CC) or a FPGA compute
unit (FC). e size of the chunks that are ooaded to the FC is
constant an provided by the user so that it is big enough to fully
utilize the FC, but small enough to foster work sharing and load
balance among the CCs and FCs. e size of the chunks processed
on the CCs is adaptively computed by our heterogeneous scheduler
as explained in Section 3.2. e right part of Fig. 1 shows that the
internal engine that manages the parallel for() function is a
two-stage pipeline, Stage1(S1) and Stage2(S2), implemented with
the TBB pipeline template. At the top of this part we can see the
iteration space with the chunks that have already been assigned to a
processing resource (in orange for the FPGA and yellow for the two
CPU cores) and the remaining iterations with the iterations that
have not been assigned yet (in white). e right part of the gure
shows an execution of the pipeline with 3 tokens. e tokens
represent the number of chunks of iterations that are processed in
parallel. e time required for the computation of each processed
chunk on a FC or on a CC is recorded. is time is used to update
the relative speed of the FC w.r.t. a CC, that we call f . Factor f will
be required to adaptively adjust the size of the next chunk assigned
to a CC as we will see in Section 3.2.
Fig. 2 shows a main function with all the required component
initialization to make the parallel for() function template
works. is is the main component of the HBB library and it
is made available by including the hbb.h header le. e user
has to create a Body instance (line 4) that will later be passed
to the parallel for() function. Program arguments, like the
number of threads and scheduler conguration can be read from
the command-line, as can be seen in line 6. e benchmarks
that we evaluate accept at least three command-line arguments:
<num cpu t>, <num fpga t> and <fpga chunksize>. e
rst one sets the number of CPU tokens, which translates into
how many CPU cores will be processing chunks of the iteration
space. e second one can be set just to 0 or 1 to disable or not
the FPGA as an additional computing resource. e last argument,
Parallelizing Workload Execution in Embedded and High-Performance Heterogeneous SystemsHIP3ES 2018, January 2018, Manchester UK
remaining (r)
CPU
Core
(CC)
FPGA CPU
Core
(CC)
SDSoC
run-time
HBB library
class Body, class Scheduler <Dynamic, Static>
User Application
parallel_for(begin, end, body);
Threading Building Blocks (TBB)
Tokens Stage1 Stage2
ntokens
tokentokentoken S1 S2chunk
chnkchunk chnk
Iteration Space
chunkCPU
chunkFPGA
time
F C C
Threads (O.S. dependent)
S2
S2
S2
chunk
chnk
chnk
chunks
chunk
S1
S1
S1
CU 
(FC)
CU 
(FC)
CU 
(FC)
CU 
(FC)
Figure 1: Heterogeneous Scheduler
1 #include "hbb.h"
2
3 int main(int argc, char* argv[]){
4 Body body;
5 Params p;
6 InitParams (argc, argv, &p);
7 // Instantiate task scheduler
8 Dynamic * hs = Dynamic::getInstance(&p);
9 ...
10 hs->parallel_for(begin, end, body);
11 ...
12 }
Figure 2: Using the parallel for() function template
<fpga chunksize> set the number of iterations that will con-
tain the chunks ooaded to the FPGA.
1 class Body{
2
3 public:
4 void operatorCPU(int begin, int end) {
5 for(i=begin; i!=end; i++){
6 c[i] = a[i] * b[i]; }
7 }
8
9 void operatorFPGA() (int begin, int end){
10 mmult((float*)array_a,(float*)array_b,(float*)array_c,
begin, end, scalar, status, enable);
11 }
12 };
13 ...
Figure 3: Denition of Class Body
Before using the parallel for() function, the user must
implement a Body class in order to dene the body of the parallel
loop, as we see in Fig. 3. is class must implement two methods:
one that denes the code that each CPU core has to execute for an
arbitrary chunk of iterations, and the same for the FPGA device.
e operatorCPU()method (lines 4-7 in Fig. 3) denes the CPU
code of the kernel, and the operatorFPGA()method (lines 9-11)
calls a hardware function that has been already implemented in the
FPGA using the SDSoC development ow. SDSoC automatically
manages the data movement from global memory to the FPGA and
back.
3.2 Scheduling strategies
is section covers the computation of the chunk size that will be
executed by the CPU cores and the FPGA. We implement dier-
ent scheduling policies, but in this work we focus in the dynamic
scheduling strategy.
When the dynamic scheduling is selected (see line 8 in Fig. 2),
then the argument <fpga chunksize> sets the FPGA chunk
size, Sf , whereas the CPU chunk size is automatically computed
by a heuristic described in [4] and briey summarized next. is
heterogeneous dynamic scheduler is a combination of the OpenMP
dynamic scheduler [1] for the FPGA chunks and the OpenMP guided
scheduler for the CPU chunks. Assuming that n is the number of
iterations of the parallel for(), nCores the number of CPU
cores, and r the number of remaining iterations (initially r = n),
then the computation of the CPU chunk, Sc , follows the next ex-
pression:
Sc =min
(Sf
f
,
r
f + nCores
)
where f represents how much faster the FPGA is w.r.t. a CPU core,
and it is recomputed each time a chunk is processed, as explained
in Section 3.1. In other words, Sc is either (Sf /f ) (the number
of iterations that a CPU core must perform to consume the same
time as the FPGA) when the number of remaining iterations, r , is
suciently high, or r/(f + nCores) (a guided self-scheduling strat-
egy [8]), when there are few remaining iterations, this is when
r/(f + nCores) < Sf /f .
4 BENCHMARK DEVELOPMENT
is preliminary evaluation is based on a well-known benchmark:
GEMM (General Matrix Multiplication). e benchmark is wrien
in C/C++ for both FPGA and CPU targets, and the FPGA functions
are compiled using the high-level synthesis tools that are part of
the SDSoC framework.
HIP3ES 2018, January 2018, Manchester UK Jose Nunez-Yanez et al.
Figure 4: Matrix multiplication tiling
e algorithm is based on a tiling strategy depicted in Figure 4 in
which the matrix blocks are shown with dierent colors. A and B
are the input matrices and C is the output matrix. For example mul-
tiplying the green block of A with the red block of B will generate
the purple block of C. e matrix size used in the main experiment
of 1M elements cannot be buered completely in FPGA memory
so the tiling strategy becomes necessary. Matrix B cannot be de-
clared as having sequential access in SDSoC because the blocks
inside matrix B are not accessed in a sequential manner and for
this reason DMA options are not possible. Matrix A is accessed
sequentially but it is read multiple times. For that reason it cannot
be declared as having a sequential access either. e multiple reads
of the same matrix during a single multiplication will not work with
the DMA correctly. Notice that sequential access is needed to use a
DMA solution based on either SDSoC scaer gather or simple dma.
Both use virtual addresses that must be sequential although scat-
ter gather allows physical addresses that are nonsequential. Since
using SDSoC DMAs is not possible in this benchmark the interfaces
are based on AXIMM (AXI Memory Master) that can also obtain
high performance using the long burst modes available in AXI.
Table 2 shows the results of using the same source code for
both devices while varying the number matrix B columns that are
buered inside the FPGA (32 in case of the Zynq and 128 in case of
the Zynq Ultra). As the number of buered elements increases it is
possible to extract more parallelism. e available internal memory
in the Zynq device limits this value to 32 but in the case of the
Zynq Ultra the 128 value is due to a tool issue that fails to perform
synthesis with larger values than 128. e Zynq device can only
accommodate one single FPGA compute unit while the Zynq Ultra
supports the deployment of 4 compute units, working in parallel. To
enable cache coherence the ACP port is used in the Zynq device and
the HPC ports are used in the Zynq Ultra device. Cache coherence
is important when the application requires CPU and FPGA cores
have access to the same data to guarantee correctness and to avoid
explicit soware coherency.
Table 2: GEMM hardware resources
Zynq Ultra Zynq
available used / % available used / %
LUTs (K) 274 87.8 / 32.0 53.2 18.1 / 34.0
Flip-Flops (K) 548.1 162.6 / 29.7 106.4 27.3 / 25.7
Block RAMs 1824 1048 / 57.5 140 79 / 56.4
DSP Slices 2520 640 / 25.4 220 160 / 72.7
5 HETEROGENEOUS COMPUTING
EVALUATION
e evaluation of the GEMM benchmark is performed on a ZC702
board equipped with a Zynq 7020 device and the ZCU102 board
equipped with a Zynq Ultrascale Z9 device. ese board contains
a PMBUS (Power Manager BUS) power control and monitoring
system that enables the reading of power and current values using
the ARM CPUs. For the power measurements the values of power
corresponding to the processing system (CPU cores), programmable
logic (FPGA) have been added together. For the energy computation
we multiply this value for the execution time of the benchmark.
Figs. 5 and 6 show performance, power and energy consumption
when we explore dierent chunk sizes for the FPGA (X axis) in
our dynamic scheduling strategy with a xed matrix size of 1M
elements. Note that the CPU chunk sizes are determined adaptively,
as explained in Section 3.2. Dierent congurations are evaluated
and the number of active CPU cores (CC) and FPGA compute units
(FC) ranges from 0 to 4.
Fig. 5 shows the performance evaluation of the GEMM bench-
mark. e heterogeneous congurations are the fastest for both
Zynq and Zynq Ultrascale. Overall the Zynq Ultrascale congura-
tion is up to 6.5 times faster than the Zynq device and the highest
performance is achieved with 4 CPU cores and the 4 FPGA cores in
parallel.
Fig. 6 compares the energy and power results for both systems.
e Zynq Ultrascale device highest power usage is 4.2 Was while
Zynq uses 0.8 Was. is means that power usage is 5.25 higher in
the Zynq Ultrascale device and this increase in power means that the
energy values are comparable in both devices. We believe that as the
Parallelizing Workload Execution in Embedded and High-Performance Heterogeneous SystemsHIP3ES 2018, January 2018, Manchester UK
32 64 128 256
block size
0
500000
1000000
1500000
2000000
2500000
3000000
T
h
ro
u
g
h
p
u
t 
(m
a
tr
ix
 e
le
m
e
n
ts
/s
) 4 FC
4 FC + 1 CC
4 FC + 2 CC
4 FC + 3 CC
4 FC + 4 CC
1 CC
2 CC
3 CC
4 CC
(a) GEMM ZYNQ Ultrascale
32 64 128 256
block size
0
100000
200000
300000
400000
500000
600000
T
h
ro
u
g
h
p
u
t 
(m
a
tr
ix
 e
le
m
e
n
ts
/s
) 1 FC
1 FC + 1 CC
1 FC + 2 CC
1 CC
2 CC
(b) GEMM ZYNQ
Figure 5: Benchmarks performance analysis
0
5
10
15
20
25
0
0.3
0.6
0.9
1.2
1.5
1.8
2.1
2.4
2.7
3
3.3
3.6
3.9
4.2
4.5
En
er
gy
 (
Jo
u
le
s)
Po
w
er
 (
W
at
ts
)
Core count configuration
Total power (W) Energy (J)
(a) GEMM ZYNQ Ultrascale
0
5
10
15
20
25
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1CC 2CC 1FC 1CC+1FC 2CC+1FC
En
er
gy
 (
Jo
u
le
s)
Po
w
er
 (
W
at
ts
)
Core count configuration
Total power (W) Energy (J)
(b) GEMM ZYNQ
Figure 6: Benchmarks power and energy
the Zynq Ultrascale compiler improves and larger congurations
are possible, the speed-up factor should increase and dierences in
energy eciency should be more noticeable. Initial results with a
matrix size of 16M elements show that the performance of the Zynq
platform drops from 500K to 50K matrix elements per second while
the Zynq Ultra reaches 400K matrix elements per second which is
8x higher.
6 CONCLUSION
is paper has presented initial results of a dynamic scheduler
that shares work on FPGA+CPU system-on-chips improving per-
formance at the same level of energy consumption. Two hybrid
CPU+FPGA SoCs with dierent CPU and FPGA microarchitectures
and resources with the same single-source programming model are
compared in terms of performance, power and energy. e exper-
iments show that a noticeable performance gain can be achieved
in both platforms with heterogeneous computing. Heterogeneous
congurations that allow the CPU cores collaborate with the FPGA
reduce execution times from 25% to 50%. If the objective is to min-
imize energy, then the heterogeneous versions tend to be energy
neutral since the additional power required by the CPU cores is com-
pensated by the reduction in execution time. e more powerful
Ultrascale platform is signicantly faster in terms of performance
but the additional CPU and FPGA static and dynamic power sug-
gests that it will be necessary to achieve performance speed-ups
higher than one order of magnitude to observe meaningful energy
savings. Future work includes the generalization of the method-
ology to other benchmarks, larger workloads and exploring the
additional PL-PS interfaces available in the system.
ACKNOWLEDGMENT
is work was partially supported by Xilinx, the Spanish projects
TIN 2013-42253-P, P11-TIC-08144, TIN2013-46957-C2-1-P, TIN2016-
76635-C2-1-R, gaZ: T48 research group and UK EPSRC with the EN-
POWER (EP/L00321X/1) and the ENEAC (EP/N002539/1) projects.
HIP3ES 2018, January 2018, Manchester UK Jose Nunez-Yanez et al.
REFERENCES
[1] Leonardo Dagum and Ramesh Menon. Openmp: an industry standard api
for shared-memory programming. Computational Science & Engineering, IEEE,
5(1):46–55, 1998.
[2] R. Dolbeau, F. Bodin, and G. C. de Verdire. One opencl to rule them all? In
2013 IEEE 6th International Workshop on Multi-/Many-core Computing Systems
(MuCoCoS), pages 1–6, Sept 2013.
[3] Pingfan Meng, Mahew Jacobsen, and Ryan Kastner. Fpga-gpu-cpu heterogenous
architecture for real-time cardiac physiological optical mapping. In Intl. Conf. on
Field-Programmable Technology, FPT’12, pages 37–42, 2012.
[4] A. Navarro, A. Vilches, F. Corbera, and R. Asenjo. Strategies for maximizing uti-
lization on multi-CPU and multi-GPU heterogeneous architectures. e Journal
of Supercomputing, 2014.
[5] Prasanna Pandit and R. Govindarajan. Fluidic kernels: Cooperative execution of
opencl programs on multiple heterogeneous devices. In Proceedings of Annual
IEEE/ACM International Symposium on Code Generation and Optimization, CGO
’14, pages 273:273–273:283.
[6] S. Prongnuch and T. Wiangtong. Heterogeneous computing platform for data
processing. In 2016 International Symposium on Intelligent Signal Processing and
Communication Systems (ISPACS), pages 1–4, Oct 2016.
[7] James Reinders. Intel reading Building Blocks: outing C++ for multi-core
processor parallelism. O’Reilly, 2007.
[8] David C. Rudolph and Constantine D. Polychronopoulos. An ecient message-
passing scheduler based on guided self scheduling. In Proceedings of the 3rd
international conference on Supercomputing, ICS ’89, pages 50–61, 1989.
[9] Kuen Hung Tsoi and Wayne Luk. Axel: A heterogeneous cluster with fpgas and
gpus. In Proceedings of the 18th Annual ACM/SIGDA International Symposium on
Field Programmable Gate Arrays, FPGA ’10, pages 115–124, 2010.
[10] A. Vilches, A. Navarro, R. Asenjo, F. Corbera, R. Gran, and M. J. Garzarn. Mapping
streaming applications on commodity multi-cpu and gpu on-chip processors.
IEEE Transactions on Parallel and Distributed Systems, 27(4):1099–1115, April
2016.
