Synchronization / communication techniques for OmpSs@FPGA by Vidal-Piñol, Miquel
Synchronization/Communication techniques for
OmpSs@FPGA
Master Thesis
for the award of the degree of
Master in Innovation and Research in
Informatics (MIRI)
specialized in
High Performance Computing (HPC)
Miquel Vidal Pin˜ol
(miquel.vidal@bsc.es)
Advisor Co-advisor
Daniel Jime´nez Gonza´lez Xavier Martorell Bofill
(djimenez@ac.upc.edu) (xavim@ac.upc.edu)
Computer Architecture Department (DAC)
October 24, 2017
Abstract
HPC machines are introducing more and more heterogeneity in their
architecture on the road to exascale systems. The increasing complexity of
the machines due to the variety of hardware architectures and accelerators
makes efficient programming a task harder than ever. Heterogeneous par-
allel programming models, such as OmpSs@FPGA, help the programmer
handle the most unfriendly parts of working with accelerators.
This master thesis analyzes the OmpSs@FPGA communication system
and proposes a set of techniques to overcome the problems related to it
and potentially improve the performance of the applications.
The results show that the techniques proposed speed up the applica-
tions under certain conditions and, most importantly, solves some of the
limitations that had the previous communication system. In particular, the
new techniques specially improve the explotation of fine-grain parallelism
and open the door to explore new possibilities with regard to data com-
munication and re-use.
Moreover, a tool (autoVivado) that automatically manages the process
of bitstream generation, from the synthesis of the HLS code to the gener-
ation of the device-tree, has been developed as part of this master thesis.
autoVivado has been fully integrated with the OmpSs@FPGA compiler in-
frastructure, providing the programmers a way to transparently generate
parallel heterogenous programs and bitstreams from OmpSs applications
that use FPGA accelerators.
Acknowledgments
I would like to thank my thesis advisor Dani for his valuable counsel,
moral support and GREAT patience during the writing of this master the-
sis, and my co-advisor Xavi and Carlos for their support and helpful tips.
To my friends and colleagues at BSC. Suffering is more bearable when
it’s shared. Thanks for letting me disconnect from work while we argue
where shall we eat or talking nonsense during lunch. Special thanks to
Antonio, Jaume and Ying for their technical support.
To Jancauts, the Wacken squad, the people from Kp, my roommates and
all my friends from Revistes. Thank you all.
I finalment, a la meva famı´lia. A l’Eula`lia, gra`cies per estar sempre alla`
i aguantar-me tot aquest temps; a la meva mare Rosa i al Jaume; al meu
germa` Ferran i a l’Ahinoam; a la meva a`via Gina; i en especial al meu pare
i al meu avi, alla` on sigueu: espero que estigueu orgullosos.
This work is partially supported by the European Union H2020 program
through the AXIOM project (grant ICT-01-2014 GA645496) and HiPEAC
(GA687698), by the Spanish Government through Programa Severo Ochoa
(SEV-2011-0067), by the Spanish Ministerio de Economı´a y Competitivi-
dad under contract Computacio´n de Altas Prestaciones VII (TIN2015-65316-
P), and the Departament d’Innovacio´, Universitats i Empresa de la Gener-
alitat de Catalunya, under project MPEXPAR: Models de Programacio´ i
Entorns d’Execucio´ Paral·lels (2014-SGR-1051).
Contents
List of Figures iii
List of Code snippets v
List of Tables vi
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Document organization . . . . . . . . . . . . . . . . . . . . . 4
2 Background 6
2.1 Field Programmable Gate Arrays . . . . . . . . . . . . . . . . 6
2.1.1 Xilinx Zynq-7000 SoC . . . . . . . . . . . . . . . . . . 7
2.2 OmpSs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 OmpSs heterogeneity: the target construct . . . . . 11
2.2.2 OmpSs@FPGA ecosystem . . . . . . . . . . . . . . . . 12
3 Related work 17
3.1 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 OpenACC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 Design Space Exploration 20
4.1 Analysis of PS/PL communication interfaces . . . . . . . . . 20
4.2 OmpSs@FPGA instrumentation analysis . . . . . . . . . . . . 23
4.3 Pending tasks limitation . . . . . . . . . . . . . . . . . . . . . 25
5 Our Proposals: Synchronization and Communication Techniques 26
5.1 Synchronous data transfer offloading . . . . . . . . . . . . . . 26
5.1.1 Synchronization and communication protocols . . . . 29
i
5.2 Asynchronous task management . . . . . . . . . . . . . . . . 31
5.2.1 Communication protocol . . . . . . . . . . . . . . . . 32
5.3 Task Batch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.3.1 Synchronization and communication protocols . . . . 37
5.4 Asynchronous task management with task batch capability . 40
6 OmpSs@FPGA toolchain 42
6.1 Mercurium modifications . . . . . . . . . . . . . . . . . . . . 43
6.1.1 Wrapper generation . . . . . . . . . . . . . . . . . . . 43
6.1.2 New Compier and Linker Flags . . . . . . . . . . . . . 49
6.2 autoVivado: Automatic bitstream generation . . . . . . . . . 50
6.2.1 Configuration file . . . . . . . . . . . . . . . . . . . . . 51
6.2.2 Generation step 0: HLS . . . . . . . . . . . . . . . . . 52
6.2.3 Generation step 1: Design . . . . . . . . . . . . . . . . 53
6.2.4 Generation step 2: Synthesis . . . . . . . . . . . . . . 55
6.2.5 Generation step 3: Implementation . . . . . . . . . . . 55
6.2.6 Generation step 4: Bitstream . . . . . . . . . . . . . . 55
6.2.7 Generation step 5: Device tree . . . . . . . . . . . . . 55
6.3 Runtime modifications . . . . . . . . . . . . . . . . . . . . . . 56
6.3.1 Asynchronous submissions of ready tasks . . . . . . 56
6.3.2 Dynamic creation of task batch and submission . . . 57
7 Evaluation 58
7.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . 58
7.2 Performance analysis . . . . . . . . . . . . . . . . . . . . . . . 60
7.2.1 Small-sized accelerators . . . . . . . . . . . . . . . . . 61
7.2.2 Medium-sized accelerators . . . . . . . . . . . . . . . 65
7.2.3 Large-sized accelerators . . . . . . . . . . . . . . . . . 69
8 Conclusions and Future Work 74
Acronyms 76
References 77
ii
List of Figures
2.1 SRAM-based FPGA architecture . . . . . . . . . . . . . . . . 6
2.2 Xilinx Zynq PS/PL interfaces block diagram . . . . . . . . . 8
2.3 OmpSs@FPGA compilation flow without bitstream genera-
tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 OmpSs@FPGA data transfers through DMA . . . . . . . . . 15
2.5 OmpSs@FPGA hardware instrumentation support . . . . . . 16
3.1 OpenCL memory model . . . . . . . . . . . . . . . . . . . . . 18
4.1 PS/PL interfaces read benchmark . . . . . . . . . . . . . . . . 22
4.2 PS/PL interfaces write benchmark . . . . . . . . . . . . . . . 23
4.3 Original FPGA task communication model . . . . . . . . . . 24
4.4 Improved FPGA task communication model . . . . . . . . . 24
5.1 Delayed accelerator computation due to data streaming . . . 27
5.2 OmpSs@FPGA data transfer offloading . . . . . . . . . . . . 28
5.3 Accelerator overlapping communication and computation . 29
5.4 Data transfer offload execution diagram . . . . . . . . . . . . 29
5.5 Task argument struct . . . . . . . . . . . . . . . . . . . . . . . 30
5.6 Task information struct . . . . . . . . . . . . . . . . . . . . . . 31
5.7 Block design of the Asynchronous Task Manager . . . . . . . . 32
5.8 Asynchronous task management execution diagram . . . . . 33
5.9 Asynchronous ready task struct . . . . . . . . . . . . . . . . . 34
5.10 FSM of the asynchronous Ready Task Manager . . . . . . . . . 34
5.11 readyQueue BRAM structure . . . . . . . . . . . . . . . . . . . 35
5.12 Asynchronous finished task struct . . . . . . . . . . . . . . . 36
5.13 FSM of the asynchronous Finished Task Manager . . . . . . . . 36
5.14 Task batch structure . . . . . . . . . . . . . . . . . . . . . . . . 37
5.15 Task batch execution diagram . . . . . . . . . . . . . . . . . . 38
5.16 Task batch header . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.17 Task header . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.18 FSM of the Task Batch Manager . . . . . . . . . . . . . . . . . . 40
iii
5.19 Asynchronous task batch execution diagram . . . . . . . . . 41
6.1 OmpSs@FPGA compilation flow . . . . . . . . . . . . . . . . 43
6.2 Screenshot of autoVivado help command output . . . . . . . 51
7.1 Small-sized accelerator standalone performance . . . . . . . 61
7.2 Small-sized accelerator standalone speedup . . . . . . . . . . 62
7.3 Small-sized accelerator OmpSs performance . . . . . . . . . 63
7.4 Small-sized accelerator OmpSs speedup . . . . . . . . . . . . 64
7.5 Comparison of the new techniques in small-sized accelerators 64
7.6 Medium-sized accelerator standalone performance . . . . . 65
7.7 Medium-sized accelerator standalone speedup . . . . . . . . 66
7.8 Medium-sized accelerator OmpSs performance . . . . . . . . 67
7.9 Medium-sized accelerator OmpSs speedup . . . . . . . . . . 68
7.10 Comparison of the new techniques in medium-sized accel-
erators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.11 Large-sized accelerator standalone performance . . . . . . . 70
7.12 Large-sized accelerator standalone speedup . . . . . . . . . . 70
7.13 Large-sized accelerator OmpSs performance . . . . . . . . . 71
7.14 Large-sized accelerator OmpSs speedup . . . . . . . . . . . . 72
7.15 Comparison of the new techniques in large-sized accelerators 73
iv
List of Code snippets
2.1 OmpSs application example . . . . . . . . . . . . . . . . . . . 10
2.2 OmpSs@FPGA application example . . . . . . . . . . . . . . 12
6.1 OmpSs@FPGA code with Vivado HLS directives . . . . . . . 44
6.2 OmpSs@FPGA HLS wrapper header . . . . . . . . . . . . . . 45
6.3 HLS wrapper reads the task info header . . . . . . . . . . . . 45
6.4 OmpSs@FPGA HLS wrapper arguments reading . . . . . . . 46
6.5 OmpSs@FPGA HLS wrapper computation . . . . . . . . . . 47
6.6 OmpSs@FPGA HLS wrapper writes out dependencies . . . 47
6.7 OmpSs@FPGA HLS wrapper sends finished signal . . . . . . 48
6.8 OmpSs@FPGA HLS wrapper hardware instrumentation . . 49
6.9 autoVivado configuration file . . . . . . . . . . . . . . . . . . 52
v
List of Tables
7.1 Communication savings when exploiting small-sized accel-
erators cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.2 Communication savings when exploiting medium-sized ac-
celerators cache . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.3 Communication savings when exploiting large-sized accel-
erators cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
vi
Chapter 1
Introduction
As the race to reach the exascale computing continues, the use of hard-
ware accelerators and co-processors has been steadily increasing for the
past years. In the last Top500 list1, almost 1 of every 5 systems uses an
heterogeneous architecture with some kind of accelerator/co-processor,
mainly Nvidia GPUs and Intel Xeon Phis; and half of the Top10 systems
boost their performance using accelerators.
The utilization of hardware accelerators, such as the aforementioned GPUs
and manycore processors for parallel processing, has proven beneficial
both in performance and energy efficiency terms. Although the standalone
performance of these technologies, as well as their interconnection with
CPUs in heterogeneous architectures, has greatly improved in recent years,
they seem to fall short for the ambitious project of building an exascale ma-
chine.
The next big thing in the accelerators community are more application-
specific alternatives such as Field Programmable Gate Arrays (FPGAs)
and Application-Specific Integrated Circuits (ASICs), that have a better
performance per watt than their current counterparts. ASICs provide the
best energy-efficiency and application-specific performance in exchange
for cost, production time and reprogrammability. On the other hand, FP-
GAs sacrifice part of the efficiency and performance for a more balanced
trade-off between these factors.
One of the main characteristics of FPGAs, that make them an attractive al-
ternative, is the ability to reprogram their hardware. However, this acts as
1June 2017 - https://www.top500.org/lists/2017/06/
1
a double-edged sword, as they require the programmer to know a Hard-
ware Description Language (HDL), which are not widely known, and to
spend a lot more time developing the logic than it would be necessary if
using a high-level language.
To ease the burden of reprogrammability, High-Level Synthesis (HLS) com-
pilers transform code written in certain high-level languages, normally
C/C++, into an HDL equivalent. Additionally, some parallel program-
ming models, such as OpenMP [1] or OmpSs [2], have added support for
heterogeneous architectures in order to transparently take care of hardware-
software communication and runtime execution; while other program-
ming models, e.g. OpenCL [3] and OpenACC [4], were already designed
with heterogeneity in mind.
In this work, the OmpSs ecosystem [2] is improved by analyzing and im-
plementing new techniques for synchronization and communication be-
tween the CPU and the FPGA in the OmpSs programming model. In addi-
tion, the current OmpSs ecosystem has been fully extended with hardware
generation at compile time from C code, integrating all synchronization
and communication techniques proposed in this work.
1.1 Motivation
FPGAs implement their logic in arrays of Look-Up Tables (LUTs), which
limit the complexity and magnitude of the designed hardware. Sometimes
it is not possible to implement an entire function into hardware because it
does not fit the physical limitations of the FPGA.
The usual practice to avoid this is to reduce the computational complexity
and/or the data requirements of the accelerated function. Implementing a
blocked version of the function algorithm reduces both factors, shrinking
the size of the accelerator and leading to better and faster compilation pro-
cesses. Moreover, fine-grain accelerators give the implemented hardware
more flexibility to work with workloads of different sizes, and increase the
application potential parallelism.
When dealing with blocked algorithms, the ideal is to maximize FPGA
occupation by having multiple accelerators working concurrently, over-
lapping data communication and execution.
2
Thus, in this multi-accelerator scenario, data communication and synchro-
nization with the CPU are two vital factors to achieve an optimal perfor-
mance.
1.2 Objectives
The main objective of this master thesis is to improve the communication
and synchronization between FPGA and CPU of the OmpSs programming
model ecosystem for FPGA (OmpSs@FPGA). In addition to this, an impor-
tant objective is to develop a fully transparent hardware generation of the
bitstream from a C code, including the support for the new techniques for
communicating and synchronizing. Those objectives are divided in four
sub-objectives:
• Analyze the current hardware/software communication paradigm
• Propose and implement new techniques to improve communication
and minimize synchronization with the CPU
• Develop an automatic tool to allow both:
– The integration of the Mercurium source-to-source with the Xil-
inx toolchain, automatically generating a bitstream from C source
code
– Automatic integration of the new mechanisms of communica-
tion and synchronization with the original accelerators
• Evaluate the performance of the new techniques
By improving data communication and synchronization with the CPU,
we expect to reduce overall execution times as well as obtain better over-
lapping between the different accelerators. Another important aspect of
our proposal techniques is that those overcome the limitation of the Linux
driver of accepting several Direct Memory Access (DMA) submissions.
1.3 Contributions
This master thesis has contributed to 4 conference papers and 2 journal
articles:
3
• The AXIOM software layers. Microprocessors and Microsystems, 2016.
[5]
• General Purpose Task-Dependence Management Hardware for Task-Based
Dataflow Programming Models. IEEE International Parallel and Dis-
tributed Processing Symposium (IPDPS), 2017. [6]
• Picos, A Hardware Task-Dependence Manager for Task-Based Dataflow
Programming Models. International Conference on High Performance
Computing & Simulation (HPCS), 2017. [7]
• Implementation of the K-means algorithm on heterogeneous devices: a use
case based on an industrial dataset. ParaFPGA: Parallel Computing
with FPGAs, 2017. [8]
• The AXIOM Project: IoT on Heterogeneous Embedded Platforms. IEEE
Design & Test. Manuscript submitted for publication. [9]
• Exploiting Parallelism on GPUs and FPGAs with OmpSs. 1st Work-
shop on AutotuniNg and aDaptivity AppRoaches for Energy effi-
cient HPC Systems, 2017. Manuscript in preparation. [10]
In addition, parts of this master thesis are being used on the H2020 AX-
IOM2 project, on the BSC-Ikergune industrial collaboration project and on
the subjects Programacio´ Conscient de l’Arquitectura (PCA-GRAU) and Su-
percomputing for Challenging Applications (SCA-MIRI) of the Facultat d’Infor-
ma`tica de Barcelona (FIB).
It is also planned its use on the H2020 ExaNoDe3, EuroExa4 and TANGO5
projects.
1.4 Document organization
The rest of the document is organized as follows. Chapter 2 introduces
the background of the main elements that conform this thesis: FPGAs
2http://axiom-project.eu/
3http://exanode.eu/
4http://euroexa.eu/
5http://tango-project.eu/
4
and OmpSs. Chapter 3 reviews how other parallel programming mod-
els have addressed the hardware/software communication issue. Chap-
ter 4 presents an analysis of the OmpSs@FPGA ecosystem at a communi-
cation level. Chapter 5 describes the new techniques proposed and some
implementation details. Chapter 7 presents the experimental results. Fi-
nally, Chapter 8 summarizes the main conclusions of this master thesis
and presents an outlook for future work.
5
Chapter 2
Background
2.1 Field Programmable Gate Arrays
A Field Programmable Gate Array (FPGA) is a type of integrated circuit,
structured as a two-dimensional array of logic blocks, whose logic can
be configured after manufacturing. In each logic block, a small memory,
called Look-Up Table (LUT), stores the truth table of an arbitrary n-input
boolean function. The logic blocks communicate between them through a
configurable interconnect network. The array is bordered by a set of I/O
blocks that connect the internal FPGA logic with the I/O pins.
These three elements conform the basic architecture of an FPGA, how-
ever modern implementations also include non-programmable logic such
as on-chip memory blocks, Digital Signal Processing (DSP) blocks, serial
transceivers or analog-to-digital converters.
(a) SRAM cells implementing a LUT (b) SRAM interconnect multiplexer
Figure 2.1: SRAM-based FPGA architecture
6
Most FPGAs base their designs on rewritable memories, such as Static
Random-Access Memory (SRAM) or flash, so they can be programmed
more than once. The boolean function computed by each block, as well as
the interconnection pattern, are defined in a bitstream which is loaded into
the FPGA. This bitstream overwrites the values of the SRAMs, produc-
ing a different boolean function (Figure 2.1a) or interconnect path (Figure
2.1b).
In the following section, we are going to review the FPGA System-on-Chip
(SoC) used in this project: the Xilinx Zynq-7000 SoC.
2.1.1 Xilinx Zynq-7000 SoC
The Zynq-7000 SoC family integrates a Processing System (PS) based on
the 28nm ARMv7 32-bit Cortex-A9 and a Xilinx Programmable Logic (PL)
in a single device.
To analyze and prototype the different communication/synchronization
techniques we used the Xilinx Zynq-7000 ZC706 development board, fea-
turing a dual-core ARM Cortex-A9 and a Xilinx Kintex-7 FPGA.
The main characteristics of the FPGA are the following:
• 218600 LUTs.
• 19.2Mb of Block RAM capacity in 545 36Kb blocks.
• 900 DSP slices.
The communication between the PS and the PL is done using the ARM
AMBA Advanced eXtensible Interface (AXI) protocol. The board offers
three communication interfaces with different characteristics: the General
Purpose (GP) interface, the High-Performance (HP) interface and the Ac-
celerator Coherency Port (ACP) interface. The internal paths of each inter-
face can be seen in Figure 2.2.
The GP port provides non-coherent 32-bit access to the Double Data Rate
(DDR) memory and the on-chip RAM. A pair of master interfaces (dark
green) connected to slaves residing in the PL allow the CPU to initiate
read/write transactions. Analogously, two slave interfaces (light green)
are connected to masters residing in the PL, which can initiate transac-
tions from the FPGA.
7
Four slave HP ports (red) provide non-coherent 64-bit access to the DDR
memory and the on-chip RAM.
One slave ACP port (cyan) provides optionally-coherent 64-bit access to
the DDR memory and the on-chip RAM. Cache coherency can be enabled
or disabled by the programmer, writing into a control register.
Figure 2.2: Xilinx Zynq PS/PL interfaces block diagram
8
The last two interfaces, HP and ACP, are more data-oriented interfaces,
implementing full-duplex buses that can transfer up to 16 bytes of data
at each clock cycle. GP ports are designed to send low-latency control sig-
nals, although they can be used to move data around if there is the need to.
The peak performance and suitability of the different PS/PL interfaces is
described in Chapter 4.
2.2 OmpSs
The OmpSs programming model, developed by the Programming Models
group at Barcelona Supercomputing Center (BSC), is an effort to integrate
different features from the StarSs programming model family and push
them into the OpenMP standard. It is composed of the Mercurium source-
to-source compiler [11] and the Nanos++ runtime.
Its name derives from the combination of the names of the two program-
ming models from which it takes its main design principles: OpenMP 3.0
and StarSs.
OmpSs inherits from OpenMP 3.0 the philosophy to develop parallel code:
start from a sequential program and introduce certain annotations on the
code, called directives, to guide the compiler on the creation of a parallel
version of the code.
Diverging from the initial versions of OpenMP 3.0, which use a fork-join
model, OmpSs uses a thread-pool model. It provides the task directive to
specify regions of code that identify a unit of work, which will eventually
be executed when the runtime sees fit. This approach is specially benefi-
cial when trying to accommodate irregular applications.
Strongly influencing recent OpenMP releases, OmpSs task directive al-
lows the programmer to specify data dependencies and to map the execu-
tion of certain tasks to a specific type of accelerator.
With these data dependencies, the OmpSs runtime generates a depen-
dency graph containing all the tasks created and schedules them taking
into account the given dependencies, relieving the programmer of the bur-
9
den of task scheduling. This significantly contrasts with OpenMP 3.01,
where the programmer has to explicitly express how the parallel code has
to be executed and synchronized between the different parts.
Code snippet 2.1 contains a simple example of a vector multiplication ap-
plication that uses the OmpSs task directive.
1 #pragma omp task in(v a[0:SIZE-1], v b[0:SIZE-1]) out(v c[0:SIZE-1])
2 void vector mult ( f l o a t ∗v a , f l o a t ∗v b , f l o a t ∗v c ) {
3 i n t i ;
4 for ( i = 0 ; i < SIZE ; i ++) {
5 v c [ i ] = v a [ i ] * v b [ i ] ;
6 }
7 }
8
9 i n t main ( ) {
10 f l o a t A[ SIZE ] ;
11 f l o a t B [ SIZE ] ;
12 f l o a t C[ SIZE ] ;
13 i n i t i a l i z e v e c t o r (A, SIZE ) ;
14 i n i t i a l i z e v e c t o r ( B , SIZE ) ;
15 vector mult (A, B , C ) ;
16 #pragma omp taskwait
17 }
Code snippet 2.1: OmpSs application example
Each function call to vector mult will correspond to a task creation with
two input dependencies (v a and v b) and one output dependence (v c), all
of them of size SIZE. In the case of the code, function call at line 15 will
correspond to a task creation with the same code of the function. This task
has A and B as input data, and C as the output result. Line 16, with the
taskwait directive, will wait for all task created so far, directly by the
current task (parent), to finish, in this case vector mult task.
In general, if there is more than one task with input and output depen-
dencies, the runtime will take care of the correct order of the tasks based
in the issue time and their dependencies.
1Since OpenMP 4.0, OpenMP includes tasks with input and output dependencies,
strongly influenced by OmpSs works.
10
2.2.1 OmpSs heterogeneity: the target construct
Heterogeneity in OmpSs is handled through the target construct. It is
used to specify that a given unit of work can be run in a set of devices. The
construct can be applied to either a task construct or a function definition,
and includes a set of clauses that allows the programmer to provide fur-
ther information about the associated code.
The set of clauses currently implemented, along with their meaning, are
the following:
• device(device-name-list) - Specifies the devices where a task
can be offloaded to. If no device-name is specified, the default smp is
used and it is assumed that the associated code will run on a homo-
geneous shared-memory multicore architecture. Current supported
devices also include opencl, cuda and fpga.
• copy in(list-of-variables) - Specifies that the shared vari-
ables in list-of-variables will have to be copied from the host to the
device memory before the associated code can be executed. It is ig-
nored for smp.
• copy out(list-of-variables) - Analogous to the preceding
clause, it specifies a set of shared variables that will be copied from
the device memory to the host, once the associated code has finished
execution.
• copy inout(list-of-variables) - A combination of the previ-
ous two clauses.
• copy deps - Specifies that any dependence clause attached to the
associated code will have copy semantics. Any in will be treated as
copy in, out as copy out and inout as copy inout.
• implements(function-name) - Specifies that the associated code
is an alternative implementation of function-name for the target
device. This alternative can be used instead of the original, if the
runtime considers it appropriate.
• onto(acc id,num instances) - Exclusive to fpga device. Spec-
ifies that the associated code will be synthesized into an FPGA accel-
erator identified by acc id, of which num instances instances will be
created.
11
Following the example started in Code snippet 2.1, one can easily make an
heterogeneous version of the application by adding a single line of code
(line 1), as shown in Code snippet 2.2.
1 #pragma omp target device(fpga,smp) copy deps onto(0,1)
2 #pragma omp task in(v a[0:SIZE-1], v b[0:SIZE-1]) out(v c[0:SIZE-1])
3 void vector mult ( f l o a t ∗v a , f l o a t ∗v b , f l o a t ∗v c ) {
4 i n t i ;
5 for ( i = 0 ; i < SIZE ; i ++) {
6 v c [ i ] = v a [ i ] * v b [ i ] ;
7 }
8 }
9
10 i n t main ( ) {
11 f l o a t A[ SIZE ] ;
12 f l o a t B [ SIZE ] ;
13 f l o a t C[ SIZE ] ;
14 i n i t i a l i z e v e c t o r (A, SIZE ) ;
15 i n i t i a l i z e v e c t o r ( B , SIZE ) ;
16 vector mult (A, B , C ) ;
17 #pragma omp taskwait
18 }
Code snippet 2.2: OmpSs@FPGA application example
Additionally, we can take advantage of the OmpSs@FPGA ecosystem to
fully exploit the FPGA capabilities.
2.2.2 OmpSs@FPGA ecosystem
The development state of the OmpSs@FPGA ecosystem has changed over
the course of this master thesis. Nonetheless, the initial state of the ecosys-
tem is the one described here.
The OmpSs@FPGA ecosystem is able to generate the code that runs on
the host machine with FPGA support (Figure 2.3), however there is not
support to neither generate the HLS code nor the FPGA bitstream. Hard-
ware/software communication is not handled transparently either.
Mercurium compiler analyzes the input source code and detects two main
parts: the host code and the FPGA code.
The host code is transformed to include calls to the Nanos++ runtime to
12
spawn tasks and perform the data movements specified with the copy (in,
out, inout) clauses. A specially designed DMA library implements the
data transfers to and from the FPGA. Then, it is compiled using the GCC
compiler to run on the ARM cores of the SoC.
Figure 2.3: OmpSs@FPGA compilation flow without bitstream generation
On the other hand, Mercurium ignores the FPGA code, which has to be en-
tirely written by the programmer, including all HLS directives regarding
communication interfaces. Hardware design and synthesis and the gen-
eration of the bitstream through Vivado are also part of the programmers
duties.
In this initial approach data communication is based on the AXI-Stream
protocol. A DMA engine is necessary to transform the memory-mapped
13
data from the CPU to streaming data to the accelerator, and vice versa. The
schema of the DMA transfers is shown in the diagram of Figure 2.4.
The CPU has to program the DMA transfer by writing to certain regis-
ters of the DMA engine. Data is read from the CPU memory and fed into
the accelerator. Analogously, data read from the accelerator is written into
the CPU memory. Both reads and writes are done in a sequential manner.
Once either operation is completed, the DMA engine issues a hardware
interrupt to inform the CPU.
(a) OmpSs@FPGA send data to accelerator
14
(b) OmpSs@FPGA retrieve data from accelerator
Figure 2.4: OmpSs@FPGA data transfers through DMA
Communication is handled on the host side by the Nanos++ runtime
through calls inserted by the compiler. On the FPGA side, however, the
programmer has to know how many variables and in which order have to
be read or written, and insert the corresponding DMA calls in the FPGA
code. This error-prone process usually implies several iterations of the
compilation flow to amend those errors, increasing the already-long com-
pilation time.
The OmpSs@FPGA ecosystem also includes instrumentation support to
monitor the accelerator performance. It is based on the OMPT standard,
which is a performance-monitoring interface that is being considered for
integration into the OpenMP standard.
Extensions were developed for the Nanos++ runtime and the Extrae in-
strumentation package to comply with the OMPT standard [12].
To support the instrumentation, a hardware timer was implemented. The
timer stores its current value into a BRAM memory inside the FPGA, ac-
cessible through a BRAM controller. The controller is mapped to a physi-
cal memory accessible from anywhere in the system, either from the CPU
or from the accelerators.
15
Figure 2.5 shows a picture of a Vivado hardware design with an accelera-
tor, the hardware timer, the CPU and their interconnection with the BRAM
controller.
Figure 2.5: Hardware design of an IP accelerator with hardware instru-
mentation support
16
Chapter 3
Related work
In this section we are going to review how the main programming models
targeting FPGA accelerators tackle the issue of communication and syn-
chronization with the processor.
3.1 OpenCL
OpenCL implementations [13, 14, 15] mimic the memory hierarchy ap-
proach it uses for GPUs (Figure 3.1) and describe four different levels of
memories: host, global + constant, local and private.
Host memory is any memory connected to the host processor; global + con-
stant memory are memory chips physically connected to the FPGA which
are also accessible by the processor; local memory is memory inside the
FPGA device implemented using block RAM elements which is accessible
by all accelerator Intellectual Propertys (IPs); private memory is memory
inside the FPGA implemented using registers to minimize latency.
OpenCL FPGA kernels store their variables in local memory by default
and the programmer has to explicitly write which variables have to be in
private memory.
Data present on the host machine is stored in the host memory by default.
To move data to the FPGA, the programmer must allocate space on the
global memory, enqueue the write and read commands on the command
queue and ensure that the copies finish correctly, either by issuing a block-
ing read/write command or waiting for the command event to finish.
17
The communication is done using DMA engines and is initiated by the
processor. The FPGA device has a passive role in this communication
model, not initiating any communication back to the CPU and writing
data only to the innermost levels of the memory hierarchy.
Figure 3.1: OpenCL memory model
3.2 OpenMP
OpenMP target directive was introduced in version 4.0 of the standard
and refined in the latest 4.5 release [16]. There has been some works
[17, 18] implementing the standard with similar CPU-FPGA communica-
tion approaches.
In these implementations the FPGA is understood as a passive accelera-
tor, and all the communication is started by the host CPU. As in a GPU,
local variables of accelerated functions are directly stored in the device
memory; while the rest of the data is allocated in CPU memory and trans-
ferred to the FPGA through DMA engines.
In both implementations data communication is done transparently to the
programmer. On [18] is handled by the Thread-PoolComposer (TPC), a
18
toolchain that is responsible for the synthesis of the hardware accelerators
from kernel code, as well as their invocation and communication in exe-
cution time. On [17], the prototype runtime system is the one in charge of
programming the DMA data transfers when the accelerator IP needs it.
3.3 OpenACC
Current OpenACC implementation for code targeting FPGA [19] relies
on the source-to-source OpenARC compiler which transforms OpenACC
code to an OpenCL equivalent code which is in turn compiled by the Intel
OpenCL compiler [14].
Consequently, data communication follows the model already explained
in Section 3.1.
19
Chapter 4
Design Space Exploration
In order to be able to propose a meaningful set of improvements to the
communication paradigm used, a set of analysis of the aforementioned
paradigm was performed. The results were obtained for a complete set
of benchmarks executed both standalone (i.e. without any runtime system
running) and using the OmpSs programming model.
The analysis were performed using the Xilinx Zynq-7000 ZC706 board
running a Linaro Ubuntu 14.04 and a 4.6 Linux kernel. Communication
between the CPU and the FPGA was done using the DMA library devel-
oped at BSC and the Xilinx DMA driver.
4.1 Analysis of PS/PL communication interfaces
We wanted to know if the choice of communication interface and protocol
affects the overall communication performance. To do so we designed two
benchmarking IP cores using Vivado HLS. One based on the AXI protocol
and the other on the AXI-Stream protocol.
The AXI protocol is a memory mapped one, involving the concept of a
target address and a certain amount of data to be transferred. It com-
municates directly with the DDR controller and is specially useful when
working within an heterogeneous system since the FPGA has a more ho-
mogeneous view of the memory system and can manage data through
memory addresses, although physical ones.
The AXI-Stream protocol acts as a single unidirectional channel where
data is received in a dataflow manner. There is no concept of address
20
space, thus no memory addresses are present, and, when used in con-
junction with a CPU, requires the utilization of DMA engines to convert
memory mapped data to stream data.
The IP cores either read (move data from PS to PL) or write (move data
from PL to PS) a number of 64-bit words. The IP core based on the AXI
protocol, issues petitions directly to the DDR controller inside the Zynq
processor, while the one based on the AXI-Stream protocol uses an AXI
DMA IP core to move data.
Several instances of the IPs were added in the bitstream and connected
to a port of each of the interfaces described in Section 2.1.1. A program
running in the Linux system inside the ARM cores tested the different in-
terfaces by telling the IPs to read or write an incremental number of 64-bit
words, sweeping from 20 to 220, i.e. from 8 bytes to 8MB, which is the max-
imum length that can be moved in a single DMA transfer.
AXI interfaces, whether memory mapped or streaming, are able to move
a word of data each clock cycle. Word size is variable and is user-defined.
On our analysis we set the word size to the maximum for each interface:
64-bit for HP and ACP and 32-bit for GP. Our FPGA was working at a
clock speed of 100MHz, so the expected peak performance was approxi-
mately 763MB/s for HP and ACP and 381MB/s for GP.
(a) AXI protocol
21
(b) AXIS protocol
Figure 4.1: Performance of PS/PL interfaces in a read benchmark.
Figure 4.1 shows the reading performance of the PS/PL interfaces using
both AXI and AXI-Stream protocols. Vertical axis shows MB/s based on
the number of 64-bit words read (x-axis).
As expected, the interface with the highest bandwidth is HP (blue col-
umn), achieving near-optimal bandwidth when moving significant
amounts of data; closely followed by ACP with coherency disabled (yel-
low column). GP (green column) also achieves near-optimal bandwidth
but, since it uses a 32-bit data bus, the bandwidth achieved is half the HP
one. Enabling coherency (orange column) reduces ACP bandwidth more
than 50%. We could not observe any substantial difference between AXI
and AXI-Stream protocols.
(a) AXI protocol
22
(b) AXIS protocol
Figure 4.2: Performance of PS/PL interfaces in a write benchmark.
As for the write benchmark, in Figure 4.2 we can observe that there is no
substantial difference in GP, HP and non-coherent ACP compared to the
read benchmark; but there is a performance boost in coherent ACP writes.
From these results we can conclude that the protocol used does not af-
fect the performance of the interface, however the performance is greatly
affected by the amount of data moved.
Choosing the right interface is vital when dealing with communication-
intensive kernels, as it can yield huge improvements in performance. It
is also important to know whether coherency is a needed feature or not,
since we have seen that using a coherent interface can decrease the com-
munication performance.
4.2 OmpSs@FPGA instrumentation analysis
Running a matrix multiply kernel instrumented with the hardware instru-
mentation support mentioned previously, we were able to monitor the be-
haviour of the accelerator.
Figure 4.3 shows an Extrae trace with several FPGA task executions of
a 64x64 matrix multiply kernel. We divided the accelerator execution in
three main parts, instrumenting the code accordingly: the input DMA
transfers (DMA in), the kernel execution (FPGA acc) and the output DMA
transfers DMA out). DMA submit corresponds to the DMA submit done
by the CPU.
23
One different DMA submit is required per task argument copy, in or out,
before the corresponding DMA transfer starts. This effectively delays the
start of the kernel execution because the accelerator has to wait between
each DMA submit.
Figure 4.3: Extrae trace of an OmpSs@FPGA application submitting each
task argument copy separately
In order to reduce this difference, the Nanos++ runtime was modified to
provide the necessary information of the copies (in and outs) to the ac-
celerator, in just one DMA submit. With this unique submit, the FPGA
accelerator can start all the necessary DMA copies without having to wait
for each DMA submit.
Figure 4.4 shows how having a single DMA submit improves FPGA com-
munication due to shortening the waiting time in the DMA transfers. In
the same time span, five full tasks could be run in the original version
while, in the improved version, six full tasks could be run.
Figure 4.4: Extrae trace of an OmpSs@FPGA application submitting all
task argument copies together
24
4.3 Pending tasks limitation
In several months of development of the OmpSs@FPGA ecosystem we
run hundreds of tests and applications in several environments. We per-
formed experiments with a multitude of implementations of both the run-
time and the compiler. We also used a fair amount of different FPGA
boards, Linux kernels and DMA drivers.
During all this time, we were able to observe a strange behaviour related
to the number of active DMA transfers. Sometimes, specially when the
number of active DMA transfers was high, some of the transfers got lost
and did not actually take place. When this happened, it certainly meant
that the results were incorrect, because some input and/or output trans-
fers did not occur.
We suspected of a race condition and were able to narrow the problem
to either the Linux kernel or the DMA driver. However we were unsuc-
cessful on the search of the bug.
In order to avoid losing DMA transfers, we limited the amount of pend-
ing tasks we had at a given time, to reduce the number of active DMA
transfers and the probability of triggering the bug. After several tests, we
saw that the problem did not occur when having at most 4 pending FPGA
tasks, and set the limit accordingly.
25
Chapter 5
Our Proposals: Synchronization
and Communication Techniques
In this chapter we are going to review the different synchronization and
communication techniques that we proposed based on the analysis previ-
ously performed.
5.1 Synchronous data transfer offloading
In Chapter 4 we analyzed the available communication protocols, AXI and
AXI-Stream, and saw no difference in bandwidth between them. How-
ever, we did observe that using the AXI-Stream protocol can lead to de-
lays on the start of the computation, due to synchronization issues with
the CPU when issuing more than one input DMA transfer.
Moreover, using the AXI-Stream protocol has some implications that were
not perceivable in the benchmarks. In this section, we analyze those impli-
cations and propose the best synchronization and communication mecha-
nism between the FPGA accelerator and the CPU.
Basing the data communication in the AXI-Stream protocol implies the
utilization of DMA engines to convert the memory mapped data in the
CPU to streaming data. While the use of DMA engines is not bad per se,
the data is sent by each engine in a sequential manner.
Therefore, in accelerators with several data dependencies, the accelerator
would be stalled until all data has been copied to the FPGA, because the
HLS compiler has no room for optimization since the data is sent sequen-
26
tially.
The Xilinx Vivado HLS compiler has an optimization phase where it ana-
lyzes the code to look for potential modifications in the accelerator logic to
reduce the overall execution latency. For example, it can detect situations
where the computation can be overlapped with data communication.
Indeed, it includes a tool to analyze the performance of the synthesized
code. Figure 5.1 shows the performance analysis of a matrix multiply ac-
celerator using AXI-Stream-based data communication. On the first row
we have the control steps (not clock cycles) that guide the performance
and on the second column we have the parts in which is divided the exe-
cution. Each of these parts comprises a number of logic operations, num-
bered in the first column, and can be understood as a set of hardware logic
that implements a given function.
The reception of the matrices takes the first 6 control steps (C1-C6), two for
each matrix, effectively delaying the start of the computation until control
step C7. In this particular accelerator, each of the read matrix functions has
a latency of 1024 clock cycles, while the whole matrix multiplication takes
1156 clock cycles.
Figure 5.1: Delayed accelerator computation due to data streaming
One workaround to this problem is to instantiate a DMA engine for every
variable that has to be copied to the FPGA in order for them to work con-
currently and let the HLS compiler optimize the code.
This, however, has several implications:
• More FPGA resources used
• Each DMA engine uses two hardware interrupts, which are very lim-
ited (only 16 on Zynq devices)
• More overhead for the OmpSs@FPGA runtime, having to program
transfers in different DMAs
27
• Not having a static number of DMA engines for each accelerator, and
thus an homogeneous communication structure, makes them more
difficult to manage
A better solution that does not suffer any of these inconveniences is the
first modification that we proposed: to offload the data transfers to the
FPGA and use the memory mapped AXI protocol to copy data.
A single DMA transfer, with all the data dependencies addresses, is now
sent from the CPU, after which the FPGA gains control of the data com-
munication and allows the CPU to tend to other pending tasks. Figure 5.2
shows the schema of this data transfer pattern.
We have shifted the master role in the communication process from the
CPU (through the DMA engine) to the accelerator, which now issues istelf
read/write petitions to the DDR controller.
Figure 5.2: OmpSs@FPGA data transfer offloading
Additionally, we can specify a dedicated AXI port for each data depen-
dence, since it does not imply a major increase in FPGA resource utiliza-
tion. This way, the HLS compiler is able to optimize the code and overlap
data communication and computation, as seen in Figure 5.3.
28
Figure 5.3: Accelerator overlapping communication and computation
5.1.1 Synchronization and communication protocols
Figure 5.4 shows the diagram of the execution of an accelerator with the
data transfers offloaded.
Figure 5.4: Diagram of the execution of an accelerator with data transfer
offloaded
First of all, OmpSs runtime (Nanos++) copies the data dependencies that
are specified with the copy clause to kernel space and obtains the physical
address of each memory region. Virtual addresses can not be translated to
physical addresses because the FPGA does not include a Memory Man-
agement Unit.
Hence the addresses that the accelerator receives have to be physical ad-
dresses. Moreover, the FPGA has no way to handle data fragmentation, so
the data has to be placed contiguously in a memory region.
29
In addition, in order to assure that the memory pages are not physically
flushed from memory, these have to be pinned. And finally, due to coher-
ent issues of the memory used from the FPGA, shared with the CPU cores,
the memory pages have to be non-cached memory.
Inside the accelerator, arguments are identified by a number, starting from
0. The addresses are sent paired with the number that identifies the argu-
ment they represent, so they can be read in any given order.
Once Nanos++ have obtained the physical addresses, it stores them into
a 128-bit struct, along with the argument identifier (argID) and argCached,
which is a flag that specifies if the argument is cached inside the accelera-
tor and its copy can be avoided.
Figure 5.5 shows the memory organization of each of those fields of an
argument. This struct is organized in two 64-bit words. The 64 lowest
most significant bits contains the argID and the argCached fields, and the
64 most significant bits stores the argument address (argAddr).
63 32 31 0
argID argCached
argAddr
Figure 5.5: Task argument struct
Each argument struct is then stored in the task information struct that rep-
resents the entire task. The struct is conformed by the task header and the
task arguments. Figure 5.6 shows the bit organization of this structure in
64-bit words.
The task header contains information that is not part of the accelerated
function, but is used by the accelerator to modify its behaviour. It in-
cludes the task identifier taskID; the addresses to the hardware counter
and the instrumentation buffer instrCounterAddr and instrBufferAddr, only
present if the hardware support for instrumentation is enabled; the des-
tination identifier destID, used to guide interconnection inside the FPGA;
and the compute flag, used to activate or deactivate the computation part
of the accelerator.
30
63 32 31 0
taskID
instrCounterAddr
instrBufferAddr
destID compute

Task
header
args[0]
...
args[N]

Task
arguments
Figure 5.6: Task information struct
Finally, the runtime sends the struct through the DMA in a single transfer
and waits for its completion.
Once the accelerator has received the task info struct, the CPU is informed
through the DMA hardware interrupt that the transfer has finished and
can continue working. The accelerator, at the same time, issues a read pe-
tition to the DDR controller and receives the input data dependencies.
Then, the accelerator performs the task computation and writes the output
dependencies back to kernel space. To finish its role in the task execution,
the accelerator sends through DMA the identifier of the task that has exe-
cuted, taskID, to inform the CPU of the completion of the task.
Nanos++ receives this signal and copies the output data dependencies that
are specified with a copy clause from kernel space to user space.
5.2 Asynchronous task management
In this section we present a mechanism to completely remove the explicit
synchronization between CPU and FPGA. This mechanism gets rid of the
remaining DMA engines in our communication system and overcome any
OS limitation in the number of DMA transfers in flight. It is built on top
of the technique described in the previous section, and it uses the same
31
memory structs.
From the point of view of the Nanos++ runtime, each accelerator was rep-
resented as an instance of a DMA engine where to send the task informa-
tion (whether whole data or memory addresses). We decoupled the set of
accelerators from the CPU by removing the DMAs and adding a specific
hardware to manage ready tasks, submit them into their corresponding
accelerator and handle the finished ones.
Figure 5.7 shows the block design of the manager, called Asynchronous Task
Manager, which is conformed by 5 sub-components: the Ready Task Man-
ager, the Finished Task Manager, the readyQueue BRAM, the finishedQueue
BRAM and the accAvailability BRAM.
The Ready Task Manager has three AXI connections: to the readyQueue BRAM,
to the accAvailability BRAM and to the SoC DDR memory, and an AXI-
Stream output that connects with the accelerators through a crossbar.
On the other hand, the Finished Task Manager has two AXI connections:
to the finishedQueue BRAM and to the accAvailability BRAM, and an input
AXI-Stream port connected to the accelerators through a crossbar.
Figure 5.7: Block design of the Asynchronous Task Manager
5.2.1 Communication protocol
Figure 5.8 shows the diagram of the execution of an accelerator asyn-
chronous task management. Observe that no explicit synchronization be-
32
tween the CPU and the FPGA remains. All interaction is done through
asynchronous accesses to kernel space memory.
Figure 5.8: Diagram of the execution of an accelerator with asynchronous
task management
Nanos++ runtime follows the same procedure as the one explained in the
previous section: it copies all the data dependencies specified with a copy
clause to kernel space, packs the physical address of each data dependence
in the task argument struct (Figure 5.5) and then stores each argument in
the task information struct (Figure 5.6).
Once the task information struct has been generated, the runtime obtains
its physical address and writes it into the asynchronous ready task struct
and stores it in a free slot of the readyQueue BRAM.
Figure 5.9 shows the 64-bit word organization of the asynchronous ready
task struct. The less significant word stores the task information struct
physical address (taskInfoAddr), and the most significant word stores a bit-
mask, argsBitmask, that marks which arguments are ready and can be sent
to the accelerator; size, that specifies the number of 64-bit words that will
be sent to the accelerator (task header + task arguments); the identifier of
the accelerator where to send the task, accID; and a valid field to indicate if
the struct represents a valid task or not.
33
63 56 55 48 47 40 39 32 31 16 15 0
taskInfoAddr
valid accID size argsBitmask
Figure 5.9: Asynchronous ready task struct
At the moment, the argsBitmask field is not used and we consider that all
the arguments are ready. However it could be used in future versions of
the manager to prematurely send some arguments to accelerators in a sort
of prefetching.
At the same time the runtime is initializing the task structs, inside the
FPGA, the Ready Task Manager is already waiting for ready tasks. Its be-
haviour is synthesized in the FSM of Figure 5.10.
Read task X
of acc
N from
readyQueue
Check acc N
availability
Mark acc N
as busy
Invalidate
task X of
readyQueue
Send task X
to acc N
X = X + 1
XN = X
N = N + 1
X = XN+1
start
N = 0
X = 0
Task X is valid
Task X is not valid
Acc N is available
Acc N is busy
Figure 5.10: Finite State Machine of the asynchronous Ready Task Manager
The manager implements an active polling over the readyQueue BRAM,
looking for valid ready tasks. The BRAM is divided in a number of regions
and each of those represents a different accelerator. Figure 5.11 shows
34
the current BRAM structure, containing 1024 slots divided in 16 regions.
Therefore, each accelerator has room for 64 pending tasks. The size of the
BRAM and the number of regions can be tuned in the source code of the
manager. The search for valid tasks is done following a Round-robin algo-
rithm through every region.
0x0000
0x03FF
Accelerator 0
0x0400
0x07FF
Accelerator 1
...
0x3C00
0x4000
Accelerator 15

1024
Slots
Figure 5.11: Division of the readyQueue BRAM in regions.
When the manager finds a valid task, it checks whether the accelerator that
has to execute it is available. The accelerators availability status is stored
in the accAvailability BRAM, which the manager checks out.
If the accelerator is busy, the manager continues the search for valid tasks
for the next accelerator in the Round-robin schedule. If it is available, the
manager marks the accelerator as busy in the accAvailability BRAM, sets to
0 the valid field of the ready task struct in the readyQueue BRAM and sends
the task to the given accelerator.
Concurrently, Finished Task Manager waits for the reception of a finished sig-
nal from one of the accelerators, indicating the completion of a task. Then,
it marks the accelerator as available in the accAvailability BRAM, searches
an empty slot in finishedQueue and stores the finished task information in
35
the asynchronous finished task struct (Figure 5.12).
63 56 55 32 31 0
taskID
valid accID
Figure 5.12: Asynchronous finished task struct
The struct contains the task identifier taskID and the accelerator identifier
that has executed it, accID.
Finished Task Manager behaviour is synthesized in the FSM of Figure 5.13.
Wait for
finished
signal
Mark
accelerator
as available
Check if
slot Y of
finishedQueue
is empty
Y = Y + 1
Write
finished task
to slot Y of
finishedQueue
Y = Y + 1
start
Y = 0
Signal received
Not received
Slot is empty
Slot is occupied
Figure 5.13: Finite State Machine of the asynchronous Finished Task Man-
ager
36
5.3 Task Batch
In parallel with the Asynchronous Task Manager modification, we proposed
another extension to the communication structure described in Section 5.1.
With the objective of reducing DMA transfers and CPU/FPGA synchro-
nization, we developed the concept of FPGA task batching.
A task batch is a special task that does not contain code inside, but a num-
ber of smaller tasks. Its purpose is to be fed into a hardware manager that
reads the task batch information and proceeds to submit the inner tasks
to the accelerators. Figure 5.14 shows our definition of task batch: a set of
interrelated tasks with data dependencies between consecutive tasks.
Figure 5.14: Structure of a task batch
Moreover, defining this type of structure has two additional benefits. Since
we know that the set of tasks must be executed one after the other, the
Nanos++ runtime system does not need to insert the tasks into the depen-
dency graph, reducing the overall OmpSs overhead. In addition, if we
know which data is shared among the tasks, we can re-use it by setting
the argCached flag and avoid unnecessary data copies.
For example, on a blocked matrix multiply algorithm, we would have a
3-level nested loop iterating over the different blocks. The innermost loop
fits exactly the structure of our task batch definition, so we could group all
the iterations of the loop in a single task batch.
All those tasks share one of the three matrix blocks that is read by the first
task and used by the following ones. We can re-use that block by marking
it as cached in the argCached field in the task argument struct (Figure 5.5).
5.3.1 Synchronization and communication protocols
The runtime system stores the data dependencies in kernel space the same
way as in Section 5.1, but instead of storing the data of just one task, it
37
stores the data dependencies of the whole set of tasks that pertain to the
task batch.
Batching tasks relieves the CPU of all the management related to the in-
ner tasks (i.e. handling dependencies, task scheduling, data copies..) and
allows it to focus on other non-dependent tasks (Figure 5.15).
Figure 5.15: Diagram of the execution of a task batch
The hardware manager, called Task Batch Manager, receives the task batch
header (Figure 5.16) that contains the task batch identifier taskBatchID; the
number of tasks within the batch, numTasks; and the destination identifier,
destID, which drives intra-FPGA communication.
63 32 31 0
taskBatchID
destID numTasks
Figure 5.16: Task batch header
38
Each task within the batch is represented as a task header (Figure 5.17)
and a task information struct (Figure 5.6). The task header contains a bit-
mask that marks which arguments are ready, argsBitmask; the number of
arguments of the task, size; and the accelerator that has to execute the task,
accID
63 56 55 48 47 40 39 32 31 16 15 0
accID size argsBitmask
Figure 5.17: Task header
The manager follows the FSM of Figure 5.18. It first reads the header of
the first task. Next it reads each element of the task information struct and
checks argsBitmask to see if the argument is ready. If it is ready, the man-
ager sends it to the accelerator. Analogously, if the argument is not ready,
the manager stores the argument in a temporary buffer in order to send it
afterwards.
In contrast with the previous section, where the argsBitmask field is not
used, here we can actually use it to send arguments that are ready from a
task that can not be executed yet, to idle accelerators.
The first task of every task batch is always a ready task, meaning that all its
input dependencies are met, so the manager always sends its arguments
right away.
Once the first task has been sent, the manager starts reading the remain-
ing tasks of the batch, sending the arguments that are ready and storing in
temporary buffers the ones that are not.
After all the tasks have been processed, the manager waits for the finished
signal of the first task of the batch and proceeds to send the arguments of
the second one that are stored in the temporary buffer. It then waits for
this second task to finish, and repeats the process until all the tasks have
been send to the corresponding accelerators.
Finally, when the finished signal of the last task arrives, the manager sends
its own finished signal to the CPU to inform that the task batch has been
executed in its entirety and can read the output data from kernel space.
39
Read task
batch
information
Read task N
information
Read
argument X
Check
argument X
availability
Send
argument X
Store argu-
ment X in a
temporary
buffer
X = X + 1N = N + 1
Wait for
task Y
to finish
Send task Y
remaining
arguments
Y = Y + 1
Send
finished
signal
start
N = 0
X = 0
Y = 0
if N >= numTasks
N = 0
if N < numTasks
Argument
is ready
Argument
not ready
if X < size
if X >= size
X = 0
if Y < numTasks
if Y >= numTasks
Y = 0
Figure 5.18: Finite State Machine of the Task Batch Manager
5.4 Asynchronous task management with task batch
capability
Finally, we proposed the combination of the three modifications described:
an asynchronous task batch with offloaded data transfers.
The tasks are packed in a task batch struct, the same way as it is explained
in Section 5.3, then this struct is stored in kernel space and its physical ad-
dress written into a slot of the readyQueue in order to feed the Asynchronous
Task Manager.
40
The Asynchronous Task Manager will read the task batch from kernel space
and feed, in turn, the Task Batch Manager, which will, ultimately, send the
tasks to the accelerators (Figure 5.19).
Figure 5.19: Diagram of the execution of an asynchronous task batch
41
Chapter 6
OmpSs@FPGA toolchain
To accommodate the work of this master thesis, we developed the whole
OmpSs@FPGA ecosystem [2] to fully support the automatic bitstream gen-
eration and transparent handling of CPU/FPGA communication.
Prior to this work, OmpSs@FPGA implementation only supported 32-bit
Xilinx Zynq SoC families and required the programmer to code the FPGA
code by hand and manually use the Xilinx toolchain (Vivado HLS + Vi-
vado) to generate the FPGA bitstream.
Figure 6.1 shows the full OmpSs@FPGA ecosystem with support for au-
tomatic HLS code and bitstream generation. The left branch is what was
already developed of the OmpSs@FPGA ecosystem at the start of the mas-
ter thesis and the right branch is what has been developed in the context
of this work.
The left branch represents the part of Mercurium that generates the host
code of the OmpSs@FPGA application, inserting calls to the Nanos++ run-
time system and the DMA library.
The right branch represents the Mercurium components that handle all
things related to the generation of FPGA code and hardware. A dedicated
Mercurium FPGA phase extracts the code of the accelerated task and gen-
erates an HLS wrapper that manages data communication. An external
tool, autoVivado, works coupled with Mercurium to synthesize the gener-
ated wrappers and design and generate the hardware bitstream.
The autoVivado tool is also used to include some advanced features on
the bitstream, such as support for hardware instrumentation and the ad-
42
dition of the Asynchronous Task Manager and the Task Batch Manager, both
detailed in Chapter 5.
Figure 6.1: OmpSs@FPGA compilation flow
Transparent handling of CPU/FPGA communication required some mod-
ifications in the Nanos++ runtime system.
6.1 Mercurium modifications
Mercurium compiler has been modified to include a phase to handle the
FPGA part of an OmpSs@FPGA application.
This Mercurium phase extracts the code of each FPGA task into its own
source file, including an autogenerated HLS wrapper that will become the
top function of the IP core of the accelerator (Vivado HLS project), which
will be later compiled by Xilinx Vivado HLS.
6.1.1 Wrapper generation
The autogenerated wrapper acts as an interconnection between the accel-
erated task and the CPU/memory. Communication with the CPU is done
through two types of interfaces: two 64-bit AXI-Stream ports (an input
and an output), and a 64-bit AXI port for each data dependence.
43
Code snippet 6.1 shows a simple code with two dependencies with copy in
and copy out clauses. In this code, there are a few HLS directives, in ma-
genta, inserted by the programmer, to optimize the code.
1 #pragma omp target device(fpga,smp) copy deps onto(0,1)
2 #pragma omp task in(v a[0:SIZE-1], v b[0:SIZE-1]) out(v c[0:SIZE-1])
3 void vector mult ( f l o a t ∗v a , f l o a t ∗v b , f l o a t ∗v c ) {
4 i n t i ;
5
6 #pragma HLS array partition variable=v a block factor=SIZE/2
7 #pragma HLS array partition variable=v b block factor=SIZE/2
8
9 #pragma HLS pipeline
10 for ( i = 0 ; i < SIZE ; i ++) {
11 v c [ i ] = v a [ i ] * v b [ i ] ;
12 }
13 }
14
15 i n t main ( ) {
16 f l o a t A[ SIZE ] ;
17 f l o a t B [ SIZE ] ;
18 f l o a t C[ SIZE ] ;
19 i n i t i a l i z e v e c t o r (A, SIZE ) ;
20 i n i t i a l i z e v e c t o r ( B , SIZE ) ;
21 vector mult (A, B , C ) ;
22 #pragma omp taskwait
23 }
Code snippet 6.1: OmpSs@FPGA code with Vivado HLS directives
Code snippet 6.2 shows the interface and protocol automatically generated
by the OmpSs@FPGA ecosystem. This is part of the wrapper generated.
44
1 void vector mult hls automatic mcxx wrapper (
2 hls::stream<axiData> &inStream ,
3 hls::stream<axiData> &outStream ,
4 ap uint<32> accID ,
5 f l o a t * mcxx v a ,
6 f l o a t * mcxx v b ,
7 f l o a t * mcxx v c ) {
8 #pragma HLS INTERFACE ap ctrl none port=return
9 #pragma HLS INTERFACE axis port=inStream
10 #pragma HLS INTERFACE axis port=outStream
11 #pragma HLS INTERFACE m axi port=mcxx v a
12 #pragma HLS INTERFACE m axi port=mcxx v b
13 #pragma HLS INTERFACE m axi port=mcxx v c
Code snippet 6.2: OmpSs@FPGA HLS wrapper header
A minimum of four ports are defined in each wrapper: the two 64-bit
AXI-Stream ports, inStream and outStream, used to receive the memory ad-
dresses of the data dependencies and to inform of the finalization of the
task execution, respectively; a 32-bit unsigned integer accID, which is the
global identifier of the accelerator; and a 64-bit AXI port for each data de-
pendence.
Code snippet 6.3 shows the part of the wrapper function that receives the
header of the task info struct (Figure 5.6), described in Chapter 5.
1 uint64 t accHeader ;
2 uint32 t dest ID , compute ;
3
4 accHeader = inStream . read ( ) . data ;
5 compute = accHeader ;
6 de s t ID = accHeader >>32;
Code snippet 6.3: HLS wrapper reads the task info header
The task info header contains the compute flag that indicates if the com-
putation part of the accelerator has to be executed or not and destID, the
identifier of the destination where to send the finished signal once the ac-
celerator finishes execution.
Once the task info header is read, the wrapper proceeds to read the task
argument information. Figure 6.4 shows the part of the wrapper that per-
forms the reception of that information. A loop with as much iterations
45
as dependencies with copy clauses is automatically generated. Inside the
loop, the information regarding the argument is read from the input AXI-
Stream port.
If the argument represents an input dependence, either in or inout, the
argument is copied right away from memory with a memcpy to the local
variable. On the other hand, if the argument is an output dependence, ei-
ther out or inout, the argument information (cached flag and memory
address) is stored in a temporary buffer to be used once the computation
has taken place.
Code snippet 6.4 shows the code produced when compiling Code snip-
pet 6.1. Accelerator arguments v a and v b represent in dependencies
and, as such, are copied from kernel space through their corresponding
AXI port (mcxx v a and mcxx v b). Argument v c corresponds to an out
dependence and so its information is stored in temporary buffers.
1 uint64 t c a c h e d i d o u t [ 1 ] , argAddr out [ 1 ] ;
2 uint64 t cached id , argAddr ;
3 uint32 t argCached , argID ;
4
5 for ( i = 0 ; i < 3 ; i ++) {
6 c a c h e d i d = inStream . read ( ) . data ;
7 argCached = c a c h e d i d ;
8 argID = c a c h e d i d >> 3 2 ;
9 argAddr = inStream . read ( ) . data ;
10 switch ( argID ) {
11 case 0 :
12 memcpy( v a , ( f l o a t * ) ( mcxx v a + argAddr ) , . . . ) ;
13 break ;
14 case 1 :
15 memcpy( v b , ( f l o a t * ) ( mcxx v b + argAddr ) , . . . ) ;
16 break ;
17 case 2 :
18 c a c h e d i d o u t [ 0 ] = c a c h e d i d ;
19 argAddr out [ 0 ] = argAddr ;
20 break ;
21 default : ;
22 }
23 }
Code snippet 6.4: OmpSs@FPGA HLS wrapper arguments reading
The code of the accelerated task is copied as is in the HLS code. Before
46
calling the function, the wrapper checks if the compute flag is enabled or
not (Code snippet 6.5).
1 i f ( compute )
2 vector mult ( v a , v b , v c ) ;
Code snippet 6.5: OmpSs@FPGA HLS wrapper computation
Once the function has been executed, and with it the computation part of
the task, the wrapper writes back to kernel space the variables represent-
ing an output dependence (out or inout). It reads the cached flag and the
memory address from the temporary buffers where they have been pre-
viously stored and proceeds to copy the data with a memcpy through the
given AXI port.
In Code snippet 6.6, the wrapper is copying back to the CPU memory
the third argument of the task, v c, which is the vector resultant of the
multiplication of the two input vectors.
1 for ( i = 0 ; i < 1 ; i ++) {
2 c a c h e d i d = c a c h e d i d o u t [ i ] ;
3 argCached = c a c h e d i d ;
4 argID = c a c h e d i d >> 3 2 ;
5 argAddr = argAddr out [ i ] ;
6 switch ( a r g i d ) {
7 case 2 :
8 memcpy( mcxx v c + argAddr , ( f l o a t * ) v c , . . . ) ;
9 break ;
10 default : ;
11 }
12 }
Code snippet 6.6: OmpSs@FPGA HLS wrapper writes out
dependencies
To finish execution, the wrapper sends the finished signal, consisting of the
accelerator identifier accID, to the destination specified by destID (Code
snippet 6.7).
47
1 axiData output = {0 , 0 , 0 , 0 , 0 , 0 , 0} ;
2 output . keep = 0xFF ;
3 output . data = accID ;
4 output . dest = de s t ID ;
5 output . l a s t = 1 ;
6 outStream . wri te ( output ) ;
Code snippet 6.7: OmpSs@FPGA HLS wrapper sends finished
signal
The destination signal destID is used to guide the interconnection within
the FPGA and can be used to inform different elements of the completion
of the task. It is usually used to inform the CPU that the task has finished,
but it can also be used to inform the Task Batch Manager that a task within
a batch has finished, or inform the Asynchronous Task Manager, by sending
the signal to the Finished Task Manager, that an asynchronous task has been
executed.
If the hardware support for instrumentation is enabled in the compila-
tion, Mercurium compiler would add additional code to annotate the main
parts of the accelerator.
The accelerator would receive two extra arguments in the task info header,
instrCounterAddr and instrBufferAddr, the addresses of the hardware counter
and the buffer to where store the instrumentation data, respectively.
Additionally, the compiler would include sampling of the hardware counter
before and after each one of the three parts the accelerator is divided in:
the reception of arguments, the computation and the writing of the output
dependencies.
Right after the sampling the hardware counter after the output depen-
dencies writing, the samples are written in the hardware instrumentation
buffer, so the CPU can read them.
Code snippet 6.8 shows a generic example of an instrumented accelera-
tor.
48
1 counter t c o u n t e r r e g [ 4 ] ;
2 uint64 t instrCounterAddr , i n s t r B u f f e r A d d r ;
3
4 instrCounterAddr = inStream . read ( ) . data ;
5 i n s t r B u f f e r A d d r = inStream . read ( ) . data ;
6
7 / / Hardware c o u n t e r sample 0
8 memcpy(& c o u n t e r r e g [ 0 ] , instrCounterAddr , . . . ) ;
9 /*
10 * Read arguments and input dependencies
11 */
12 / / Hardware c o u n t e r sample 1
13 memcpy(& c o u n t e r r e g [ 1 ] , instrCounterAddr , . . . ) ;
14 /*
15 * Compute
16 */
17 / / Hardware c o u n t e r sample 2
18 memcpy(& c o u n t e r r e g [ 2 ] , instrCounterAddr , . . . ) ;
19 /*
20 * Write output dependencies
21 */
22 / / Hardware c o u n t e r sample 3
23 memcpy(& c o u n t e r r e g [ 3 ] , instrCounterAddr , . . . ) ;
24
25 / / Wri te s a m p l e s t o hardware i n s t r u m e n t a t i o n b u f f e r
26 memcpy( ins t rBuf ferAddr , c o u n t e r r e g , . . . ) ;
27 /*
28 * Send finished signal
29 */
Code snippet 6.8: OmpSs@FPGA HLS wrapper hardware
instrumentation
6.1.2 New Compier and Linker Flags
The new fpga Mercurium version, alias fpgacc, includes new compiler
and linker flags to support the FPGA bitstream generation.
The compilation flag --variable=bitstream generation:ON makes
fpgacc compiler generate a Vivado HLS wrapper code for each of the
fpga tasks (accelerators) appearing in a source code.
The linking flags are used to determine how to generate the bitstream with
all the accelerators found among all the source codes with compilation
flag --variable=bitstream generation:ON. Those flags are, among
49
others, the target board of the bitstream generation (e.g. --board=AXIOM),
the accelerator frequency in MHz (e.g. --clock=200), if the accelerators
should have instrumentation support (e.g. --hardware instrumen-
tation), the project name (e.g. --name=MxM), and the directory where
the Vivado HLS and Vivado projects should be generated.
For example, a possible command line to compile and link a program
(program.c) follows:
fpgacc --ompss --variable=bitstream_generation:ON \
-o program program.c \
--Wf,"--board=AXIOM,--name=vivado_project_name" \
--Wf,"--clock=200,--dir=$(VIVADO_WORKSPACE)" \
--Wf,"--hardware_instrumentation,-v"
This will generate a intermediate source code for each FPGA accelerator
and will make Mercurium compiler invoke the autoVivado tool to gener-
ate the bitstream of the accelerators during the linking process. The auto-
Vivado tool details are explained in the following section.
6.2 autoVivado: Automatic bitstream generation
autoVivado is a tool that has been developed with the objective of autom-
atizing the process of generating a bitstream using the Xilinx toolchain.
autoVivado integrates all the synchronization and communication tech-
niques analyzed in this master thesis, that can be activated or not. Indeed,
autoVivado is a modular tool that can be easily updated with new board
supports and features.
It can be integrated with the Mercurium compiler to produce OmpSs@FPGA
compliant bitstreams, or use it as a standalone tool to automatically gen-
erate bitstreams from a set of source files.
Given the significant amount of time that a bitstream generation takes,
the tool divides the generation in 6 steps, so when an error occur in one
of them, the generation can be resumed from the step that failed instead
of starting over. The 6 steps the generation is divided in are: HLS, design,
synthesis, implementation, bitstream and device tree.
50
autoVivado tool receives a configuration file input with bitstream infor-
mation (target board, accelerator frequency, project name, etc.) and other
details about the steps to be done of the bitstream generation.
6.2.1 Configuration file
The autoVivado configuration file can either be generated manually or
through the fpgacc-bit.sh script included in the autoVivado toolchain.
Figure 6.2 shows a screenshot of the help command output of autoVivado,
which details the set of options to configure the behaviour of autoVivado
and the parameters of the bitstream.
Figure 6.2: Screenshot of autoVivado help command output with the
complete set of options to configure the tool
Mercurium uses the latter option and runs the script right after generating
the wrappers for all the FPGA-accelerated tasks, during the linking step
of the Mercurium compilation.
Code snippet 6.9 shows the configuration file used to generate the bit-
stream of the vector multiplication example code. This configuration file
was automatically generated by Mercurium during its compilation to bit-
stream.
The minimum information needed to generate the bitstream is the model
of the board (BOARD PART, line 1), the frequency of the FPGA clock (ex-
pressed as the period, CLOCK PERIOD, line 4) and the list of accelerators
to include (KERNELS, lines 7-13, with 5 accelerators).
Additionally, the file includes parameters to configure the behaviour of au-
toVivado, for example the directory and name of the autoVivado project
(PROJECT DIR and PROJECT NAME, lines 2 and 3), the initial and final
51
generation steps that has to perform the tool (FROM STEP and TO STEP,
lines 5 and 6) and which advanced features are enabled (lines 15-17).
1 BOARD PART=xc7z045ffg900−2 # Board p a r t
2 PROJECT DIR=/path/to/autoVivado/ p r o j e c t / # P r o j e c t d i r
3 PROJECT NAME=vector mult # P r o j e c t name
4 CLOCK PERIOD=10 # C l o c k p e r i o d ( ns )
5 FROM STEP=0 # I n i t i a l g e n e r a t i o n s t e p
6 TO STEP=4 # F i n a l g e n e r a t i o n s t e p
7 KERNELS=”
8 [0]=/ path/to/ a c c e l / 0 : 2 : vector mult h ls automat ic mcxx . cpp
9 [1]=/ path/to/autoVivado/HLS/ s r c /Hardware Counter . cpp
10 [2]=/ path/to/autoVivado/HLS/ s r c /Task Batch Manager . cpp
11 [3]=/ path/to/autoVivado/HLS/ s r c /Ready Task Manager . cpp
12 [4]=/ path/to/autoVivado/HLS/ s r c /Finished Task Manager . cpp
13 ”
14 NUM KERNELS=1 #Number o f k e r n e l s
15 HARDWARE INSTRUMENTATION= t rue
16 ASYNC TASK MANAGER= t rue
17 TASK BATCH MANAGER= t rue
Code snippet 6.9: autoVivado configuration file
autoVivado main script (autoVivado.sh) parses the configuration file
and proceeds to perform the generation steps specified in the file. It can
perform just one step or all of them, but always in a sequential way. If
the programmer tries to run autoVivado from the synthesis step to the bit-
stream step without passing through steps HLS and design, it will most
likely fail unless the programmer has manually performed them and con-
ditions to start the synthesis step are met.
It is possible, however, to perform middle steps as long as a previous au-
toVivado run has completed the initial steps correctly.
6.2.2 Generation step 0: HLS
In the first step of bitstream generation, autoVivado parses the input con-
figuration file and obtains the list of accelerators that need to be included
in the bitstream.
Accelerators file names have to follow a certain pattern in order for autoVi-
vado to correctly handle them. The identifier of the accelerator, as well as
the number of instances of that accelerator, have to be prepended to the file
52
name, separated by a colon (:). For example, the name of the vector multi-
plication accelerator is 0:2:vector mult hls automatic mcxx.cpp,
meaning that the identifier of the accelerator is 0 and the bitstream will
contain 2 instances of that IP core. The suffix hls automatic mcxx is
appended by Mercurium to identify the source files that have been auto-
matically generated.
autoVivado creates a Xilinx Vivado HLS project for each of the accelerators
(hardware managers and instrumentation counter included) and proceeds
to compile them source-to-source from C/C++ to Verilog/VHDL.
The accelerators are compiled using the target clock period and target
FPGA board that are specified in the configuration file.
After the correct source-to-source compilation of each accelerator, auto-
Vivado checks the reports and obtains the FPGA resource utilization esti-
mation that Vivado HLS computes. autoVivado maintains an estimation
of the total resource utilization and aborts the bitstream generation if it
founds that the set of accelerators does not fit into the physical limitations
of the board.
Correctly compiled accelerators are from now on considered IP cores and
are added into a local IP repository that will be used in following genera-
tion steps by Xilinx Vivado.
6.2.3 Generation step 1: Design
On the second step, the actual design of the bitstream takes place. In this
step, autoVivado generates a design with all the IP cores compiled in the
first step and generates the interconnection.
It takes advantage of the Xilinx Vivado ability to run Tool Command Lan-
guage (Tcl) scripts to encapsulate the main parts of the design in Tcl tem-
plates.
A base design template contains the IP core that represents the CPU al-
ready configured, and with all the PS/PL communication interfaces en-
abled. There is a template for each CPU family (currently 32-bit Zynq and
64-bit Zynq Ultrascale+).
53
Another template encapsulates all the elements that surround each accel-
erator (DMA, interconnection, clock and reset signals. . . ) already config-
ured and connected between them. A placeholder dummy accelerator IP
is present in the template, which is swapped by the real IP once the tem-
plate is placed in the design.
Additional templates encapsulate the Asynchronous Task Manager (Figure 5.7),
the Task Batch Manager and the infrastructure to support hardware instru-
mentation.
autoVivado creates a Xilinx Vivado project in the path specified by the
PROJECT DIR configuration option and uses the different templates to in-
stantiate the IPs.
Communication between IPs and the DDR memory is done through AXI
ports. Standard AXI interconnect IPs are placed at each PS/PL interface to
act as multiplexers to allow multiple connections to each port.
On the other hand, AXI-Stream ports are used to send and receive in-
formation between accelerator and manager IPs through standard AXI-
Stream interconnect IPs. In these interconnects is where the destID argu-
ment that every accelerator receives in the task header (Figure 5.6) comes
into play.
AXI-Stream interconnect IPs contain an internal crossbar that can be con-
figured to distribute the data to different destinations with the TDEST sig-
nal. The crossbars are configured by autoVivado to distribute the data the
following way:
• TDEST 0x00 − 0x0F : Data is sent to the accelerator identified with
the same value as the TDEST signal.
• TDEST 0x10: Data is sent to the Task Batch Manager.
• TDEST 0x11: Data is sent to the Finished Task Manager.
• TDEST 0x1E: Data is sent back to the sender.
• TDEST 0x1F : Data is sent to the CPU.
After all the IPs have been placed and interconnected in the design, auto-
Vivado maps the DMAs and BRAMs to the CPU address space and vali-
dates the design generated.
54
6.2.4 Generation step 2: Synthesis
Although being the step that takes most time, it is one of the simplest ones
from the autoVivado point of view.
In this step, autoVivado runs the Xilinx Vivado synthesis command to
transform the design generated in step 2, specified in a HDL language,
either Verilog or VHDL, into a gate-level representation.
It uses some of the parameters provided to autoVivado, such as the tar-
get device and the FPGA clock frequency.
6.2.5 Generation step 3: Implementation
Similarly to the previous one, this step is the second longest but with al-
most no actions taken by autoVivado.
In this case, autoVivado runs the implementation command of Xilinx Vi-
vado to transform the gate-level representation of step 3 into a placed and
routed design, ready for bitstream generation.
The implementation process runs several iterations of placing, routing and
optimization sub-processes until all the requirements are met.
6.2.6 Generation step 4: Bitstream
In step 5, the bitstream is generated from the placed and routed design of
step 4. The Hardware Description File, which contains information of the
hardware design, is also generated.
Bitstream file is copied by autoVivado to the path specified by PROJECT DIR
to make it easier for the programmer to find it, since it is usually buried in
several levels of folders inside the Vivado project.
6.2.7 Generation step 5: Device tree
As the final step, autoVivado generates the device tree using the hsi tool
provided by the Xilinx Vivado SDK. This tool takes the Hardware Descrip-
tion File of the previous step and generates device tree sources with the
characteristics of the bitstream.
55
The device tree sources are modified to include a device used by the DMA
user library to interact with the different DMAs in the bitstream.
The device tree is used by the Linux operating system to build the /dev/
directory, which is where the DMAs are exposed to the operating system
as well as a special device to reprogram the FPGA with a new bitstream.
6.3 Runtime modifications
We wrote external library containing all the functions related to FPGA
tasks, called xTasks, to simplify the implementation of the FPGA plugin
of Nanos++ runtime. The library already implemented the synchronous
data transfer offloading technique. This way, we could easily test the differ-
ent techniques by rewriting some of those functions and just recompile the
library.
6.3.1 Asynchronous submissions of ready tasks
To implement the asynchronous submission of ready tasks, the task cre-
ation, task submit and task wait functions needed to be modified.
The function in charge of the creation of tasks was modified to, besides
creating and initializing the task information struct (Figure 5.6), also ini-
tialize the struct representing an asynchronous ready task (Figure 5.9).
Arguments are added to the task information struct, with no extra action
needed.
At submit time, the asynchronous ready task struct containing the pointer
to the task information struct is stored in the readyQueue BRAM, which has
been previously mapped to memory. The valid field of the asynchronous
ready task struct is set, so the Asynchronous Task Manager will read the
readyQueue element and proceed to submit the task to the corresponding
accelerator.
Finally, to wait for an asynchronous task, the xTasks wait task function
required to be completely rewritten. Instead of waiting for the finished
signal sent through a DMA transfer, the function implements an active
polling looking for a valid element on the finishedQueue BRAM represent-
ing a finished task. When found, the finished task is returned to Nanos++
56
and the finishedQueue slot is freed.
6.3.2 Dynamic creation of task batch and submission
To correctly create and submit a task batch, we need to generate the struc-
ture of Figure 5.14. Thus, we need to create a meta-task, representing the
task batch structure, which contains all the tasks. Inner tasks are created
normally, however they must be stored inside the task batch struct and
their submission to the FPGA must not take place.
The batch region is confined within two xTasks functions. The first func-
tion, which marks the start of the region, allocates the task batch struct
and initializes the elements. The second function, which marks the end
of the batch region, submits the batch to the hardware manager waits for
the finished signal and frees the task batch buffer. Both DMA transfers are
blocking, so there is an implicit taskwait at the end of the batch region.
To create the task batch, we disable the dependencies computation of the
task batch region, to prevent Nanos++ from scheduling tasks that are part
of the batch, and we fool Nanos++ by informing it that the tasks are be-
ing executed when they are actually being inserted into the batch. Once
Nanos++ has created all the tasks of the batch, we submit it to the hard-
ware manager.
Our intention, however, is to actually use the task batch directive and
modify Nanos++ to call the proper task creation function, not execute the
task submit function at every inner task and submit the entire batch at the
end of the region.
57
Chapter 7
Evaluation
In this chapter, we present the results of the experiments performed to
evaluate the different communication and synchronization techniques pro-
posed in Chapter 5.
7.1 Experimental setup
Experiments were performed using a single Xilinx Zynq 706 FPGA board
running a Linaro 14.04 Linux distribution with a 4.6.0 kernel.
The board, already detailed in Section 2.1.1, consists of a System-on-Chip
including a dual-core ARM Cortex-A9 running at 667MHz and a Xilinx
Kintex-7 FPGA running at 100MHz.
Given that the OmpSs programming model ecosystem for FPGA is still in
development, and many of its components were developed concurrently
with this master thesis, still there is not a full set of applications to test our
modifications.
In order to evaluate the benefits of the new synchronization and commu-
nication techniques, matrix multiply, a heavily used kernel in scientific
applications, has been analyzed. Matrix multiply is used in Cholesky de-
composition, k-means clustering, convolutional neural networks, etc.
We used the same blocked matrix multiplication base code in all our exper-
iments, only modifying the code surrounding it. Four different versions
were developed to evaluate the contributions of this master thesis:
• AXI-Stream based data transfers standalone version. The HLS code,
58
written by hand, includes the reading of the input matrices and the
writing of the output ones, using the AXI-Stream protocol. On the
host side, data handling and calls to the communication DMA li-
brary were also coded by hand.
• AXI-Stream based data transfers OmpSs version. Same HLS code
as the previous, no modifications. The host code is generated by
Mercurium, including calls to the Nanos++ runtime and the DMA
library. A couple of functions to be able to send whole data through
DMA transfers have been coded in the DMA user library. And the
Nanos++ runtime has been slightly modified in order to use these
functions when submitting OmpSs tasks to the FPGA.
• Data transfer offload standalone version. Mercurium generated HLS
code and bitstream designed and generated by autoVivado. The host
part is manually coded to include calls to the unmodified DMA li-
brary.
• Data transfer offload OmpSs version. This version utilizes the full
OmpSs@FPGA toolchain. Mercurium generates both the HLS and
the host code and autoVivado automatically designs and generates
the bitstream. The host code takes advantage of the Nanos++ run-
time.
Given the physical limitations of the FPGA board used, we generated three
types of bitstreams defining three different FPGA environments:
• small-size matrix multiply IP core with a block size of 32x32.
• medium-size matrix multiply IP core with a block size of 64x64.
• large-size matrix multiply IP core with a block size of 128x128.
We divided the evaluation experiments in two categories: standalone and
OmpSs.
The experiments of the standalone category were done to evaluate the per-
formance of the new communication and synchronization techniques on
their own, without any OmpSs-related overhead. Therefore the Nanos++
runtime system is not used and the computation is done in a single accel-
erator.
OmpSs experiments were done to evaluate the impact on the performance
59
of the initial implementations of the new techniques.
To test the batching technique, we batched the innermost loop of the set
of nested loops that handles the matrix blocks. This way, we can offload
the computation of an entire block of the result matrix C and reduce the
number of DMA transfers from N3 to N2, being N the number of blocks
of a side of the matrix.
Moreover, we can take advantage of the argCached flag and avoid unnec-
essary data copies of the matrix C block, which is shared among all tasks
of the batch.
7.2 Performance analysis
For each combination of technique, block size and matrix size, we run a
total of 10 executions and computed the arithmetic average. For each exe-
cution, we measured the elapsed time it took to perform the whole matrix
multiplication, from the sending of the input matrices to the receiving of
the result matrix.
The performance results are presented through a series of charts, each
of whom represents an experiment performed with a certain accelerator
block size, stated above the chart, and using the different communication
techniques:
• AXI-Stream are the results when using the AXI-Stream based com-
munication system that was used at the beginning of this master
thesis.
• Offload uses the transfer offloading technique presented in Section 5.1.
• Asynchronous uses the asynchronous task management technique pre-
sented in Section 5.2.
• Task batch uses the batching technique presented in Section 5.3 with-
out exploiting the accelerator internal cache.
• Task batch + Cached also uses the batching technique, but marking the
output matrix of all but the first and last tasks of the batch as cached.
60
7.2.1 Small-sized accelerators
Results from small-sized accelerators show a significant improvement, both
in standalone and in OmpSs. This is due to the overhead of creating and
submitting each of the DMA transfers in AXI-Stream, which, given the
small size of the accelerator, is comparable to the cost of receiving the data
and performing the computation. This leads the accelerator to spend a lot
of time being idle, waiting for each DMA submit.
Offload overcomes this problem performing a single small DMA transfer,
with the information of the copies so the accelerator can copy all the data
right away, without having to wait between copies. Figure 7.1 shows the
time reduction when using this technique in the standalone application.
We can also see that the rest of the techniques, that derive from Offload,
further reduce the execution time.
Figure 7.1: Standalone performance of the communication techniques in
small-sized accelerators
Specially effective reducing the execution time are the batching techniques
Task batch and Task batch + Cache. Having a multitude of small tasks to exe-
cute introduces a lot of idling time in the accelerator while it waits for the
next task to arrive. If we batch a set of tasks and send them in a batch, we
allow the accelerator to execute them one after the other, with no idling
time between them. It is the same idea that led to the development of the
Offload technique at dependency level, but applied at a task level.
61
Figure 7.2 shows the different speedups obtained in this standalone con-
text.
Figure 7.2: Standalone speedup of the communication techniques against
AXI-Stream communication in small-sized accelerators
We can observe that the speedup of the Task batch + Cache technique in-
creases with the matrix size. This is due to the increasing amount of data
we avoid copying back and forth the CPU.
Figure 7.1 summarizes the communication savings of using the cache flag
to exploit the accelerator internal memory. Several MBs of data copies can
be avoided if we do not perform the copies regarding the C matrix in the
middle tasks of the batch.
64x64 128x128 256x256 512x512 1024x1024
Number of tasks 8 64 512 4096 32768
Total number of transfers 24+8 192+64 1536+512 12288+4096 98304+32768
Number of transfers avoided 0+0 32+32 384+384 3584+3584 30720+30720
Communication savings 0 256kB 3MB 28MB 240MB
Table 7.1: Summary table of the savings in communication (input+output)
when the argCached flag is used in small-sized accelerators.
The argCached flag used at the moment has only two meanings: a vari-
able can either be cached or not. However, for inout variables, like the
62
C matrix in our application, we could add two extra meanings to the flag:
caching the input value or the output value. This way, an additional 8MB
of data (4MB input + 4MB output) would not be copied in the 1024x1024
matrix case.
Figure 7.3 shows the execution time of the application when using the
full OmpSs@FPGA ecosystem. The OmpSs execution times are, in av-
erage, higher than in standalone due to the overhead introduced by the
Nanos++ runtime and we are sequentially executing the MxM task in the
accelerator.
Figure 7.3: OmpSs performance of the communication techniques in
small-sized accelerators
However, the techniques proposed help reduce the Nanos++ overhead
which leads to higher speedups, shown in Figure 7.4.
In particular, Task batch and Task batch + Cache benefit from the fact that
the tasks of the batch do not need no be inserted into the dependency
graph. This yields an additional speedup when there are few tasks to ex-
ecute, nevertheless, when the number of tasks increases, the runtime can
overlap the execution of tasks in the accelerator with the execution of run-
time duties, such as inserting tasks into the dependency graph. So, this
gain decreases with the number of tasks.
63
Figure 7.4: OmpSs speedup of the communication techniques against AXI-
Stream communication in small-sized accelerators
Finally, Figure 7.5 shows the speedups of Asynchronous, Task Batch and
Task Batch + Cache (extensions of the Offload) considering Offload the base
technique. There is more than 15% speedup for the Asynchronous extension
and more than 1.5× speedup with the Task batch and Task batch + Cache
extensions.
Figure 7.5: Speedup of the new techniques against Offload in small-sized
accelerators
64
7.2.2 Medium-sized accelerators
For the medium-sized accelerators, we see a reduction of the benefits of
the new techniques, mainly because the weight of the computation part
has increased. Figure 7.6 shows the average execution times of the tech-
niques when dealing with different matrix sizes.
The price we had to pay in small-sized accelerators when submitting DMA
transfers in AXI-Stream has almost disappeared here, because the weight
of the computation and communication has increased, whereas the over-
head of DMA transfer submits remains almost the same.
Figure 7.6: Standalone performance of the communication techniques in
medium-sized accelerators
If we take a look at the speedup of Figure 7.7, we still can see that the Of-
fload technique and its derivatives still yield some degree of speedup com-
pared to AXI-Stream. Offload, Asynchronous and Task batch speeds up the
application between a 5% and a 25%, specially when there are few tasks to
be executed. Task batch + Cache, on the other hand, still maintains a good
speedup of over 40% thanks to the amount of communication avoided.
65
Figure 7.7: Standalone speedup of the communication techniques against
AXI-Stream communication in medium-sized accelerators
The table from Figure 7.2 summarizes the total amount of bytes and mem-
ory transfers avoided thanks to the argCached flag. If we implemented a
argCached flag for inout dependencies to cache either the input or the
output value, we could avoid copying an additional 16MB (8MB input +
8MB output) for the 2048x2048 matrix case.
128x128 256x256 512x512 1024x1024 2048x2048
Number of tasks 8 64 512 4096 32768
Total number of transfers 24+8 192+64 1536+512 12288+4096 98304+32768
Number of transfers avoided 0+0 32+32 384+384 3584+3584 30720+30720
Communication savings 0 1MB 12MB 112MB 960MB
Table 7.2: Summary table of the savings in communication (input+output)
when the argCached flag is used in medium-sized accelerators.
In OmpSs, the reduction in speedup is also significant, but still higher than
in standalone due to the runtime overhead. Figure 7.8 shows the execution
times of the OmpSs application. In this case, the tasks are large-enough
to allow the runtime to submit the DMA transfers before the accelerator
finishes the previous execution and to reduce idle time between transfers.
66
Figure 7.8: OmpSs performance of the communication techniques in
medium-sized accelerators
Figure 7.9 shows the speedups obtained running the OmpSs version of the
application. As in small-sized accelerators, we can see a peak in speedup
in Task batch and Task batch + Cache when there is a small number of tasks
to be executed, attributable to the fact that the runtime does not insert the
tasks into the dependency graph. This speedup is reduced the more tasks
there are, as the runtime overlaps the execution of accelerator tasks with
the handling of the dependency graph.
The speedup gained by exploiting the accelerator internal memory and
reducing the number of data copies is still significant, with over 50% of
improvement over AXI-Stream.
67
Figure 7.9: OmpSs speedup of the communication techniques against AXI-
Stream communication in medium-sized accelerators
Figure 7.10 shows the speedups of the Offload extension techniques com-
pared to the Offload. We can see that the performance benefits of the exten-
sion techniques decreases when the number of tasks to execute are larger.
This is due to the same reason we commented in previous section with
small-sized accelerators: the overhead of creating tasks and offloading
them can be easily overlapped when increasing the number of tasks.
68
Figure 7.10: Speedup of the new techniques against Offload in medium-
sized accelerators
7.2.3 Large-sized accelerators
For the case of the standalone experiments, large-sized accelerators seems
to overcome the performance of the proposed techniques (Figure 7.11) be-
cause the weight of the synchronization with the CPU (DMA submits and
idle times between them) is no longer relevant against pure communica-
tion and computation. In fact, the overhead introduced to support those
techniques, when there is no OmpSs overhead, provokes a light slowdown
in performance.
69
Figure 7.11: Standalone performance of the communication techniques in
large-sized accelerators
The only technique that achieves some degree of speedup (Figure 7.12) is
Task batch + Cached because, apart from reducing synchronization with the
CPU, it reduces the actual amount of data communication done at each
task execution.
Figure 7.12: Standalone speedup of the communication techniques against
AXI-Stream communication in large-sized accelerators
70
The amount of data communication avoided thanks to the argCached flag
is summarized in the table of Figure 7.3. Extending the argCached flag to
support inout dependencies, would reduce data communication by an
additional 128MB (64MB input + 64MB output) in the 4096x4096 matrix
case.
256x256 512x512 1024x1024 2048x2048 4096x4096
Number of tasks 8 64 512 4096 32768
Total number of transfers 24+8 192+64 1536+512 12288+4096 98304+32768
Number transfers avoided 0+0 32+32 384+384 3584+3584 30720+30720
Communication savings 0 4MB 48MB 448MB 3.75GB
Table 7.3: Summary table of the savings in communication (input+output)
when the argCached flag is used in large-sized accelerators.
For the case of OmpSs applications, the Nanos++ overhead, which spe-
cially affects the AXI-Stream technique, allows the other techniques obtain
better average execution times (Figure 7.13).
Figure 7.13: OmpSs performance of the communication techniques in
large-sized accelerators
Figure 7.14 shows the speedup achieved by the proposed techniques com-
pared to the AXI-Stream technique. The trend is the same as with small and
medium-sized accelerators: batching techniques have a peak of speedup
when the number of tasks to execute is small, and then decreases when
the number of tasks increases.
71
In this large-sized accelerators context, we can also see that there is an
increase of the speedup in big matrices, such as 4096x4096, because of an
slowdown of the AXI-Stream performance. This slowdown seems to be
caused by the high number of DMA transfers with heavy data payloads.
Figure 7.14: OmpSs speedup of the communication techniques against
AXI-Stream communication in large-sized accelerators
The previous trend assumption seems to be supported by the numbers of
Figure 7.15, where we can see that the relative performance between the
proposed techniques is maintained compared to small and medium size
accelerators.
72
Figure 7.15: Speedup of the new techniques against Offload in large-sized
accelerators
73
Chapter 8
Conclusions and Future Work
In this master thesis we propose a set of new communication and synchro-
nization techniques for the FPGA ecosystem of the OmpSs programming
model, based on the analysis of the current communication paradigm.
Our proposals include (1) a new communication technique that transfers
the master role in communication from the CPU to the FPGA by offloading
data transfers to the accelerators; and two techniques to reduce CPU/FPGA
synchronization based on (2) asynchronous task management and (3) task
batching, which are built over the first proposal.
We also presented the autoVivado tool, that automatizes the design and
generation of bitstreams. It can be coupled with Mercurium to generate
OmpSs-capable FPGA bitstreams and allows the programmer to include
the hardware support for the different techniques here proposed.
We have evaluated the performance of our proposals with an extensively
used compute kernel: a blocked matrix multiplication. The results ob-
tained show that, under certain conditions, the application performance
improved, specially in small-sized accelerators. Particularly, the new tech-
niques improves exploitation of fine-grain parallelism and open the way
for data re-using.
Moreover, the second proposal, the asynchronous task management, com-
pletely removes explicit synchronization between the CPU and the FPGA
and the need of DMA engines and, by extension, drivers, user libraries or
modifications to the device tree. We have also seen that the use of DMAs is
not an optimal solution for our workloads and might, and at the moment
does, carry software bugs.
74
Part of the contributions of this master thesis have been presented as Proof-
of-Concept implementations to evaluate their benefits, while others have
only been proposed but not implemented.
The next steps that could be done after this master thesis include:
• Implement the task batchOmpSs directive and modify the Nanos++
runtime to accommodate it.
• Explore the capabilities of the cached flag and automatize its use in
task batches.
• Implement and evaluate the combination of asynchronous task man-
agement and task batching techniques.
• Study the necessity of maintaining DMA engines in our hardware
designs and eliminate their use from the xTasks library and autoVi-
vado tool.
• Evaluate the techniques in a multi-accelerator scenario and study
how the hardware frequency affects them.
75
Acronyms
ACP Accelerator Coherency Port.
ASIC Application-Specific Integrated Circuit.
AXI Advanced eXtensible Interface.
DDR Double Data Rate.
DMA Direct Memory Access.
DSP Digital Signal Processing.
FPGA Field Programmable Gate Array.
GP General Purpose.
HDL Hardware Description Language.
HLS High-Level Synthesis.
HP High-Performance.
IP Intellectual Property.
LUT Look-Up Table.
PL Programmable Logic.
PS Processing System.
SoC System-on-Chip.
SRAM Static Random-Access Memory.
Tcl Tool Command Language.
76
References
[1] Eduard Ayguade´, Rosa M. Badia, Daniel Cabrera, Alejandro Du-
ran, Marc Gonzalez, Francisco Igual, Daniel Jime´nez-Gonza´lez, Jesu´s
Labarta, Xavier Martorell, Rafael Mayo, Josep M. Perez, and En-
rique S. Quintana-Ortı´. A Proposal to Extend the OpenMP Tasking
Model for Heterogeneous Architectures. In Lecture Notes in Com-
puter Science (including subseries Lecture Notes in Artificial Intelligence
and Lecture Notes in Bioinformatics), volume 5568 LNCS, pages 154–
167. Springer, Berlin, Heidelberg, 2009.
[2] Antonio Filgueras, Eduard Gil, Daniel Jime´nez-Gonza´lez, Carlos
A´lvarez, Xavier Martorell, Jan Langer, Juanjo Noguera, and Kees Vis-
sers. OmpSs@Zynq all-programmable SoC ecosystem. In Proceedings
of the 2014 ACM/SIGDA international symposium on Field-programmable
gate arrays - FPGA ’14, pages 137–146, New York, New York, USA,
2014. ACM Press.
[3] The OpenCL Specification, version 2.2, 2017.
[4] The OpenACC Application Programming Interface, 2013.
[5] Carlos A´lvarez, Eduard Ayguade´, Jaume Bosch, Javier Bueno, Artem
Cherkashin, Antonio Filgueras, Daniel Jime´nez-Gonza´lez, Xavier
Martorell, Nacho Navarro, Miquel Vidal, Dimitris Theodoropou-
los, Dionisios N. Pnevmatikatos, Davide Catani, David Oro, Carles
Ferna´ndez, Carlos Segura, Javier Rodrı´guez, Javier Hernando, Clau-
dio Scordino, Paolo Gai, Pierluigi Passera, Alberto Pomella, Nicola
Bettin, Antonio Rizzo, and Roberto Giorgi. The AXIOM software lay-
ers. Microprocessors and Microsystems, 47:262–277, 2016.
[6] Xubin Tan, Jaume Bosch, Miquel Vidal, Carlos A´lvarez, Daniel
Jime´nez-Gonza´lez, Eduard Ayguade´, and Mateo Valero. General
Purpose Task-Dependence Management Hardware for Task-Based
Dataflow Programming Models. In 2017 IEEE International Parallel
77
and Distributed Processing Symposium (IPDPS), pages 244–253. IEEE,
may 2017.
[7] Xubin Tan, Jaume Bosch, Miquel Vidal, Carlos Alvarez, Daniel
Jimenez-Gonzalez, Eduard Ayguade, and Mateo Valero. Picos, A
Hardware Task-Dependence Manager for Task-Based Dataflow Pro-
gramming Models. In 2017 International Conference on High Perfor-
mance Computing & Simulation (HPCS), pages 878–880. IEEE, jul 2017.
[8] Ying hao Xu, Miquel Vidal, Ben˜at Arejita, Javier Dı´az, Carlos A´lvarez,
Daniel Jime´nez-Gonza´lez, Xavier Martorell, and Filippo Mantovani.
Implementation of the K-means algorithm on heterogeneous devices:
a use case based on an industrial dataset. In ParaFPGA: Parallel Com-
puting with FPGAs, 2017.
[9] Antonio Filgueras, Miquel Vidal, Marc Mateu, Daniel Jime´nez-
Gonza´lez, Carlos A´lvarez, Xavier Martorell, Eduard Ayguade´, Dim-
itris Theodoropoulos, Dionisios N. Pnevmatikatos, Paolo Gai, Ste-
fano Garzarella, David Oro, Javier Hernando, Nicola Bettin, Alberto
Pomella, Marco Procaccini, and Roberto Giorgi. The AXIOM Project:
IoT on Heterogeneous Embedded Platforms. IEEE Design & Test.
[10] Jaume Bosch, Antonio Filgueras, Miquel Vidal, Daniel Jime´nez-
Gonza´lez, Carlos A´lvarez, and Xavier Martorell. Exploiting Paral-
lelism on GPUs and FPGAs with OmpSs. 1st Workshop on AutotuniNg
and aDaptivity AppRoaches for Energy efficient HPC Systems, 2017.
[11] Jairo Balart, Alejandro Duran, Marc Gonzalez, Xavier Martorell, Ed-
uard Ayguade´, and Jesu´s Labarta. Nanos Mercurium: A research
compiler for OpenMP. European Workshop on OpenMP (EWOMP’04),
pages 103–109, 2004.
[12] Germa´n Llort, Antonio Filgueras, Daniel Jime´nez-Gonza´lez, Har-
ald Servat, Xavier Teruel, Estanislao Mercadal, Carlos A´lvarez, Judit
Gime´nez, Xavier Martorell, Eduard Ayguade´, and Jesu´s Labarta. The
Secrets of the Accelerators Unveiled: Tracing Heterogeneous Execu-
tions Through OMPT. In Naoya Maruyama, Bronis R. de Supinski,
and Mohamed Wahib, editors, OpenMP: Memory, Devices, and Tasks:
12th International Workshop on OpenMP, IWOMP 2016, Nara, Japan,
October 5-7, 2016, Proceedings, pages 217–236. Springer International
Publishing, Cham, 2016.
78
[13] Tomasz S. Czajkowski, Utku Aydonat, Dmitry Denisenko, John Free-
man, Michael Kinsner, David Neto, Jason Wong, Peter Yiannacouras,
and Deshanand P. Singh. From OpenCL to high-performance hard-
ware on FPGAs. In 22nd International Conference on Field Programmable
Logic and Applications (FPL), pages 531–534. IEEE, aug 2012.
[14] Intel FPGA SDK for OpenCL Programming Guide, 2017.
[15] SDAccel Development Environment Help, 2017.
[16] OpenMP Application Programming Interface - OpenMP Standard
4.5, 2015.
[17] Daniel Cabrera, Xavier Martorell, Georgi Gaydadjiev, Eduard
Ayguade´, and Daniel Jime´nez-Gonza´lez. OpenMP extensions for
FPGA accelerators. In 2009 International Symposium on Systems, Ar-
chitectures, Modeling, and Simulation, pages 17–24. IEEE, jul 2009.
[18] Lukas Sommer, Jens Korinth, and Andreas Koch. OpenMP device of-
floading to FPGA accelerators. In 2017 IEEE 28th International Confer-
ence on Application-specific Systems, Architectures and Processors (ASAP),
pages 201–205. IEEE, jul 2017.
[19] Seyong Lee, Jungwon Kim, and Jeffrey S. Vetter. OpenACC to FPGA:
A Framework for Directive-Based High-Performance Reconfigurable
Computing. In 2016 IEEE International Parallel and Distributed Process-
ing Symposium (IPDPS), pages 544–554. IEEE, may 2016.
79
