Transferência automática para aceleradores FPGA by Ceissler, Ciro Luiz Araujo, 1986-
Universidade Estadual de Campinas
Instituto de Computação
INSTITUTO DE
COMPUTAÇÃO
Ciro Luiz Araujo Ceissler
Automatic Offloading to FPGA Accelerators
Transferência Automática para Aceleradores FPGA
CAMPINAS
2018
Ciro Luiz Araujo Ceissler
Automatic Offloading to FPGA Accelerators
Transferência Automática para Aceleradores FPGA
Dissertação apresentada ao Instituto de
Computação da Universidade Estadual de
Campinas como parte dos requisitos para a
obtenção do título de Mestre em Ciência da
Computação.
Thesis presented to the Institute of Computing
of the University of Campinas in partial
fulfillment of the requirements for the degree of
Master in Computer Science.
Supervisor/Orientador: Prof. Dr. Guido Costa Souza de Araújo
Este exemplar corresponde à versão final da
Dissertação defendida por Ciro Luiz Araujo
Ceissler e orientada pelo Prof. Dr. Guido
Costa Souza de Araújo.
CAMPINAS
2018
Agência(s) de fomento e nº(s) de processo(s): CAPES 
Ficha catalográfica
Universidade Estadual de Campinas
Biblioteca do Instituto de Matemática, Estatística e Computação Científica
Ana Regina Machado - CRB 8/5467
    
  Ceissler, Ciro Luiz Araujo, 1986-  
 C329a CeiAutomatic offloading to FPGA accelerators / Ciro Luiz Araujo Ceissler. –
Campinas, SP : [s.n.], 2018.
 
   
  CeiOrientador: Guido Costa Souza de Araújo.
  CeiDissertação (mestrado) – Universidade Estadual de Campinas, Instituto de
Computação.
 
    
  Cei1. OpenMP (Programação paralela). 2. Compiladores (Computadores). 3.
FPGA (Field Programmable Gate Array). 4. Intel HARP (Microprocessador). I.
Araújo, Guido Costa Souza de, 1962-. II. Universidade Estadual de Campinas.
Instituto de Computação. III. Título.
 
Informações para Biblioteca Digital
Título em outro idioma: Transferência automática para aceleradores FPGA
Palavras-chave em inglês:
OpenMP (Parallel programming)
Compiling (Electronic computers)
Field programmable gate array
Intel HARP (Microprocessor)
Área de concentração: Ciência da Computação
Titulação: Mestre em Ciência da Computação
Banca examinadora:
Guido Costa Souza de Araújo [Orientador]
Ricardo dos Santos Ferreira
Lucas Francisco Wanner
Data de defesa: 17-10-2018
Programa de Pós-Graduação: Ciência da Computação
Powered by TCPDF (www.tcpdf.org)
Universidade Estadual de Campinas
Instituto de Computação
INSTITUTO DE
COMPUTAÇÃO
Ciro Luiz Araujo Ceissler
Automatic Offloading to FPGA Accelerators
Transferência Automática para Aceleradores FPGA
Banca Examinadora:
• Prof. Dr. Guido Costa Souza de Araújo
IC/UNICAMP
• Prof. Dr. Ricardo dos Santos Ferreira
DPI/UFV
• Prof. Dr. Lucas Francisco Wanner
IC/UNICAMP
A ata da defesa com as respectivas assinaturas dos membros encontra-se no
SIGA/Sistema de Fluxo de Dissertação/Tese e na Secretaria do Programa da Unidade
Campinas, 17 de outubro de 2018
Acknowledgements
First of all, I would like to thank all support from my family, Prof. Guido, colleagues
from LSC laboratory and CPqD, and Institute of Computing staff during this journey.
I must acknowledge the governmental institutions, CAPES (scholarship number: 1707894),
for the financial foundation.
Resumo
O aumento da quantidade de recursos computacionais necessários para executar as tarefas
nas modernas infraestrutura na nuvem coloca uma pressão no projeto de nós eficientes no
consumo de potência num cluster. Uma forte integração entre hardware, software, rede
e sistemas para cada aplicação é necessária para aumentar o uso computacional de uma
maneira eficaz, mas para combiná-los é necessário muito tempo de desenvolvimento além
do conhecimento profundo de cada um deles . Para solucionar este problema, diversas
empresas propuseram uma arquitetura com CPU-FPGA integradas, e.g., Intel HARP e
Microsoft Catapult, nas quais executam de maneira eficientes, levando em consideração
potência e performance, além de oferecer flexibilidade. Infelizmente, a integração de apli-
cações aceleradas em FPGA com o software é um tarefa desafiadora, que não possui um
modelo de programação simplificado. Esta dissertação discute em detalhes o HardCloud,
uma extensão da API do OpenMP para facilitar a transferência da computação para
os aceleradores FPGA. A versão 4.0 da especificação do OpenMP introduz novas direti-
vas para possibilitar a transferência da computação para dispostivos heterogêneos (e.g.,
GPUs, co-processadores criptográficos ou DSP), apesar disso a especificação não contém
a informação necessária para mover a computação para FPGA. Uma modificação na ge-
ração do código de execução do OpenMP, no compilador Clang/LLVM, estende o modelo
de programação para suportar a computação na plataforma Intel HARP 2, especificando
uma aplicação FPGA gerada previamente. Além disso, uma abstração da interface de
hardware concede um método fácil para conectar um IP com o OpenMP, facilitando a
leitura ou escrita das váriaves compartilhadas, enquanto gerência o acesso às interfaces
da FPGA. Uma vez que a geração de um arquivo, contendo os bits para programar a
FPGA, é uma atividade que consome muita memória e CPU demorando várias horas, o
desenvolvedor pode selecionador o simulador ao invés da plataforma para realizar uma
rápida avaliação. Adicionalmente, um mecanismo automático de verificação e validação
do hardware compara os valores da variáveis de saída do hardware e software. Resultados
experimentais usando o compilador Clang/LLVM e a arquitetura Intel HARP 2 demons-
tram que o uso para esta tarefa pode ser consideravelmente simplificado enquanto produz
ganho de desempenho para um conjunto de aplicações conhecidas.
Abstract
The sheer amount of computing resources required to run modern cloud workloads has
put a lot of pressure on the design of power efficient cluster nodes. A deep integra-
tion among hardware, software, network, and systems for each application is necessary
to increase the computational usage effectiveness, but to combine them requires a lot of
development time and in-depth knowledge. To address this problem, many vendors have
proposed CPU-FPGA integrated architectures, e.g., Intel HARP and Microsoft Catapult,
that can deliver efficient power-performance executions and flexibility. Unfortunately, the
integration of FPGA accelerated applications to software is a challenging endeavor that
does not have a seamless programming model. This dissertation discusses in details the
HardCloud, an extension of the OpenMP API that eases the task of offloading computa-
tion to FPGA accelerators. The OpenMP 4.0 specification introduces new directives that
enable the transfer of computation to heterogeneous computing devices (e.g., GPUs, cryp-
tography co-processors or DSP), although the specification does not provide all necessary
information to accomplish the offload to FPGA. An OpenMP runtime code generation
modification, inside the Clang/LLVM compiler, extends this programming model to sup-
port Intel HARP 2 platform offload by specifying a pre-generated FPGA application.
Further, a hardware interface abstraction provides an effortless method to connect the IP
core with the OpenMP runtime and reading or writing the shared variables in a seam-
less way, while manages the access to the FPGA interfaces. Since the FPGA bitstream
generation is CPU/memory intensive and demands many hours, the developer is capable
of select the simulator instead of the platform to a rapid evaluation. Additionally, an
automatic mechanism to verify and validate the hardware, which compares the hardware
and software output variables values, is available and reduce the test environment devel-
opment effort. Experimental results using the Clang/LLVM compiler and the Intel HARP
2 architecture show that HardCloud can considerably simplify such task while producing
good speed-ups for a set of well-known applications.
List of Figures
2.1 Schematic of libomptarget Interface . . . . . . . . . . . . . . . . . . . . . 16
2.2 Intel HARP Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Intel Xeon E5 v4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 ALM Block Diagram for Intel Arria 10 Devices . . . . . . . . . . . . . . . . 19
2.5 OPAE Layered Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1 Execution Flow Diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Block Diagram FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Read State Machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Write State Machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5 Sequence diagram of the offloading to the Intel HARP. . . . . . . . . . . . 31
3.6 Stream Access Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.7 Streaming Access Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.8 Indexed Access Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.9 Indexed Access Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1 Logarithmic benchmark speedup . . . . . . . . . . . . . . . . . . . . . . . . 46
List of Tables
3.1 Shorter Table Caption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Benchmark runtime. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Acronyms
AES Advanced Encryption Standard
AFU Accelerator Function Unit
ALM Adaptive Logic Module
CCI-P Core Cache Interface
CPU Central Processing Unit
CRC Cyclic Redundancy Check
ECC Error-Correcting Code
ELF Executable and Linkable Format
FFT Fast Fourier Transform
FIR Finite Impulse Response
FIU FPGA Interface Unit
FPGA Field Programmable Gate Array
GPU Graphics Processing Unit
GRN Gene Regulatory Network
HARP Hardware Accelerator Research Program
HLS High-Level Synthesis
LLVM Low Level Virtual Machine
MMIO Memory-Mapped I/O
OPAE Open Programmable Accelerator Engine
PCIe Peripheral Component Interconnect Express
QPI QuickPath Interconnect
SHA Secure Hash Algorithm
SIMD Single Instruction, Multiple Data
SoC System on a Chip
SSE2 Streaming SIMD Extensions 2
Contents
1 Introduction 12
2 Background and Related Works 15
2.1 OpenMP 4.x Offloading Constructs . . . . . . . . . . . . . . . . . . . . . . 15
2.2 clang-ykt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 libomptarget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Intel HARP Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Intel Xeon E5-2680 . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Intel FPGA Arria10 GX 1150 . . . . . . . . . . . . . . . . . . . . . 19
2.3.3 Intel QuickPath Interconnect . . . . . . . . . . . . . . . . . . . . . 19
2.3.4 Core-Cache Interface . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Open Programmable Accelerator Engine . . . . . . . . . . . . . . . . . . . 21
2.5 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 The HardCloud 25
3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Offloading Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 IP Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.1 Stream Access Mode . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.2 Indexed Access Mode . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.3 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.4 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4 Functional Verification and Validation Mode . . . . . . . . . . . . . . . . . 41
3.4.1 libomptarget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4.2 LLVM IR Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 Design Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4 Experimental Results 44
5 Conclusions and Future Works 47
Bibliography 48
12
Chapter 1
Introduction
Modern data-intensive cloud workloads demand a huge amount of computing and energy
resources, putting a lot of pressure on the design of power efficient cluster nodes. To
address that, companies like Microsoft (Catapult) [19], Intel (HARP) [27] and Amazon [15]
have recently released CPU-FPGA node architectures and made them available through
large-scale clusters.
Programming accelerators in heterogeneous architectures has become a challenging
endeavor that requires programmers to learn complex programming languages like CUDA
and OpenCL. Moreover, designing specialized hardware accelerators requires mastering
languages like VHDL/Verilog and a different set of skills unknown to typical software
programmers.
To help programmers generate hardware, companies like Intel-Altera and Xilinx have
developed toolchains that can synthesize FPGA designs starting from OpenCL code.
Unfortunately, this approach suffers from two fundamental problems: (a) OpenCL is still
a complex programming language for typical programmers; and (b) Even expert OpenCL
programmers need to have a good knowledge of hardware constructs to write OpenCL
code that can help synthesis tools to produce quality hardware.
This dissertation proposes an extension to the OpenMP 4.X parallel programming
model, that we call HardCloud . It seeks to ease the task of programming FPGA-based
accelerators. The rationale of extending OpenMP leverages on the fact that it is a well-
known parallelization standard widely used for multicore and GPU architectures. Hence,
programmers could take their original code, add a few annotations, and quickly evaluate
if an FPGA accelerator is a suitable solution to a particular application.
Consider, for example, the code fragments of Listing 1.1, where texts in blue are
standard OpenMP 4.X constructs, while texts in purple are new OpenMP extensions
proposed. The code in Listing 1.1(a) uses a standard OpenMP parallel for clause to
parallelize the iterations of a loop across the threads of a multicore architecture. If the
size of array x is very large a programmer could annotate the code with an additional
OpenMP target device directive to indicate that the offloaded of the loop iterations will
run on a larger device (e.g. GPU) to improve performance (Listing 1.1(b)). To offload
the computation the programmer must also use clause map to indicate which data will be
sent/received to/from the GPU (in this case vector x).
As a careful reader might notice, the annotations in Listing 1.1(a)-(b) are only possible
13
#pragma omp p a r a l l e l f o r
f o r ( i n t i = 0 ; i < n ; i++) {
y [ i ] += f (x , i ) ;
}
( a ) P a r a l l e l i z e a l l i t e r a t i o n s a c r o s s a standard mul t i co re
#pragma omp ta rg e t dev i ce (GPU) map( to : x ) map( from : y )
#pragma omp p a r a l l e l f o r
f o r ( i n t i = 0 ; i < n ; i++) {
y [ i ] += f (x , i ) ;
}
(b) P a r a l l e l i z e the i t e r a t i o n s and o f f l o a d them to a GPU
#pragma omp ta rg e t dev i ce (HARP | Catapult ) map( to : x ) map( from : y )
#pragma omp p a r a l l e l f o r use (hrw) syn th e s i z e (myhrw)
f o r ( i n t i = 0 ; i < n ; i++) {
y [ i ] += f (x , i ) ;
}
( c ) Use a HLS compi le r to generate FPGA bit−stream from C
and o f f l o a d i t to the FPGA
#pragma omp ta rg e t dev i ce (HARP | Catapult ) map( to : x ) map( from : y )
#pragma omp p a r a l l e l f o r use (hrw) module (myhrw)
f o r ( i n t i = 0 ; i < n ; i++) {
y [ i ] += f (x , i ) ;
}
(d) Subs t i tu t e C code by the execut ion o f a pre−compiled FPGA bit−stream
Listing 1.1: Four options to accelerate a loop
on multicore and GPU architectures when the loops are free of loop-carried dependencies ,
i.e. when no iteration of the loop depends on a previous iteration. Such loops are called
DOALL and are frequently found in scientific codes. Loops that have loop-carried de-
pendencies are called DOACROSS and are found in more generic codes. Unfortunately,
parallelizing DOACROSS loops in software is a hard problem in general, and neither
multicore nor GPUs can do that for complex DOACROSS loops. Hence, if the loop is a
complex DOACROSS it might be a good candidate for hardware acceleration.
Now, assume that function f in Listing 1.1(c)-(d) is a computational intensive code
that uses some previous value of x. For example, assume that f(x,i) returns x[i-1]. If
the programmer decides to use HardCloud to accelerate such loops using an integrated
CPU-FPGA architecture he/she must first change the value of the device clause from
GPU to an FPGA based accelerator architecture (e.g. HARP or Catapult). By doing
so the programmer will activate the proposed Clang/LLVM OpenMP 4.X libomptarget
plugin which calls our modified OpenMP runtime to automatically offload the netlist
and its required data to the FPGA. Notice that the proposed OpenMP extensions are
agnostic of the architecture, i.e. other libomptarget plugins can be added to Clang/LLVM,
to enable other architectures like the Microsoft Catapult.
14
There are two approaches that a programmer can use to accelerate the code. In both
approaches, the programmer has to insert a new clause use(hrw) to signal the HardCloud
OpenMP extension that he/she wants to accelerate the loop in hardware. The proposed
clause can also be combined with the OpenMP parallel directive to annotate non-loop
code.
In the first approach (Listing 1.1(c)), assume that the programmer does not have a
pre-synthesized hardware module capable of computing the loop. He/she can then use an
HLS toolchain, like the Intel FPGA SDK for OpenCL, to convert the C code loop to an
FPGA programming netlist. In this case, the programmer must add a new OpenMP clause
that we call synthesize(myhrw) to specify the filename of the FPGA netlist (myhrw) that
will result from the synthesis process. As said before, the quality of the resulting netlist
will depend on the effectiveness of the HLS toolchain and on how amenable to hardware
synthesis is the original C code. This operation mode of the synthesize(myhrw) clause
is not described herein.
In the second approach (Listing 1.1(d)), the programmer has a pre-synthesized netlist
(in-house or third party) which is optimized for his/her application. In this case, the
programmer just needs to add module(myhrw) clause to allow the modified OpenMP
runtime to upload and run the myhrw hardware when it reaches the annotation. The
remaining of the work focuses on such operation mode of the use clause.
This work makes the following contributions to the problem of integrating FPGA
acceleration into standard programs:
• It proposes an OpenMP extension that allows for a seamless integration of cluster
FPGA accelerators to regular programs.
• It proposes a novel Clang/LLVM libomptarget plugin that enables the usage of the
Intel HARP architecture through OpenMP 4.X;
• It proposes a verification mechanism to check the output of accelerators.
• It validates the proposed extension by improving the performance of some typical
applications.
The work in this dissertation resulted in the scientific publication below:
• Ciro Ceissler, Ramon Nepomuceno, Marcio Pereira and Guido Araujo, "Automatic
Offloading of Cluster Accelerators" in the 26th Annual International Symposium on
Field-Programmable Custom Computing Machines (FCCM), 2018.
This dissertation is divided as follows. Chapter 2 background on LLVM/Clang OpenMP,
Intel HARP platform, and discusses related work. Chapter 3 gives a detailed description
of the implementation of the use(hrw) clause and the design of the new Clang/LLVM
OpenMP (libomptarget) plugin for the Intel HARP architecture. Chapter 4 evaluates the
performance of the use(hrw) clause when using a pre-synthesize FPGA netlist. Finally,
Chapter 5 concludes the work.
15
Chapter 2
Background and Related Works
This Chapter presents a literature review of research on computing offload to FPGA us-
ing a standard programming model hence five sections accomplish that. In Section 2.1,
describe the OpenMP support to offload computation to another device. In the next Sec-
tion 2.2, give the details of the LLVM/Clang project branch, clang-ykt, and emphasizes
the libomptarget, a Clang library that manages the offload process independent of the
device and provides an interface to implements the device-specific code. The Intel HARP
platform facilitates the combination of an Intel Xeon processor and an Arria 10 FPGA,
more details in Section 2.3, with coherent low-latency interconnection. The Section 2.4
describes the Open Programmable Accelerator Engine (OPAE) library, which is an ab-
straction to communicate with the FPGA, i.e., program, initialize, send/receive data, and
control, in an effortless way. Section 2.5 contains a review of other frameworks that are
significant to this dissertation.
2.1 OpenMP 4.x Offloading Constructs
The OpenMP specification [4] defines a model for parallel programming to produce and
manage a program, also guarantees portability for a variety of architectures. This spec-
ification contains a collection of compiler directives, library routines, and environment
variables to create C, C++, or Fortran parallel programs. In this programming model,
the user expresses the actions of the compiler and the runtime system to achieve the
application requirements like reduce energy consumption or enhance performance.
Since OpenMP 4.x, the specification defines offloading directives and clauses that uses
an accelerator to execute the code enclosed by the device construct. Below, a description
of the most significant of them:
• target directive: this directive maps variables from the to the device data environ-
ment, i.e., the raw data will be shared or copied, depends on the coupling between
host and device, and the block of code inside the directive executes in the device.
• device clause: this optional clause receives as an argument a scalar integer ex-
pression. The programmer uses this clause to select the device that runs the part
specified when more than one is available, and the default device can be set using
the function omp_set_default_device.
16
• map clause: a map-type inside the map clause determine the action the host must
execute with the data. The operations available are to, from, tofrom or alloc.
The first three of them specify if the data should move to/from the device and the
last one only allocates data in the device memory.
Thus, the target directive as aforementioned produce region and a device data en-
vironment, but it needs more information to execute the offload the successfully. Conse-
quently, the device and map clauses complement the data to guarantee control over the
offloading process.
2.2 clang-ykt
The compiler front-end project, Clang, with support to C/C++ and OpenMP program-
ming model generates the LLVM intermediate representation (IR), which serve as input
to uses the LLVM project as compiler back-end. The clang-ykt [12] is an IBM’s branch
for LLVM/Clang project, which implements the OpenMP 4.5 features and supports the
Clang libomptarget library to connect with the device plugin.
2.2.1 libomptarget
The libomptarget is an OpenMP offloading runtime libraries for Clang that proposes
an offload mechanism with support to multiple targets using OpenMP 4.5. The class
for OpenMP runtime code generation inside the Clang code creates library calls when it
executes the emitTargetCall function, then the compiler generates a fat ELF binary with
the host and targets appropriate representation, e.g., ELF, PE32+, and Mach-O.
Figure 2.1: Schematic of libomptarget Interface
17
At the program initialization, the target agnostic offload library, libomptarget.so, man-
ages the initial steps of searching and adding to the device list the available device-specific
run-time libraries (RTLs), called plugins, if it conforms with the RTL interface. Below,
the description of this interface that the plugin must implement to be compatible with
the library:
• __tgt_rtl_data_alloc: allocates the size received as an argument in the device
and return the pointer.
• __tgt_rtl_data_submit: sends the data from a variable to the device, this
function receives as argument the host and target pointer.
• __tgt_rtl_data_retrieve: receives the data from the device.
• __tgt_rtl_data_delete: deallocates the memory in the device.
• __tgt_rtl_load_binary: extracts the device binary to execute in the device,
also the binary information can be used only as a guide, e.g., the bitstream location.
• __tgt_rtl_run_target_region: initiate the execution of the target device.
• __tgt_rtl_run_target_team_region: the last function calls this one when
exists multiple teams, which is a group of threads.
At the moment that the program reaches a target directive, the agnostic library
checks if the plugin is compatible with the target binary, then maps the data environment
and execute the computation remotely. Moreover, if the application works with multiple
targets, loop iterations can run among them in parallel.
2.3 Intel HARP Platform
Figure 2.2: Intel HARP Platform
Intel Heterogeneous Architecture Research Program (HARP) [29] platform is an ex-
perimental cluster platform designed to ease the integration of FPGA acceleration to
Intel architectures. Each HARP cluster node has a multi-chip module (MCM) package
containing a CPU and an FPGA as can be seen in Figure 2.2.
18
The FPGA contains two parts: the user-developed Accelerator Function Unit (AFU)
and Intel-provided FPGA Interface Unit (FIU). The FIU implements all the essential
features required for deployment and manageability of FPGA in a datacenter. More, it
controls the interface physical links protocols between the CPU and FPGA, also pro-
vides platform capabilities such as Intel VT-d, security, error monitoring, performance
monitoring, power and thermal management, and partial reconfiguration of AFUs. The
AFU access to the three physical links happens through virtual channels on Core-Cache
Interface (CCI-P).
2.3.1 Intel Xeon E5-2680
The CPU is the Intel Broadwell’s Xeon E5-2600 v4 @2.40GHz 14x cores, 2x QPI 1.1 links,
4x DDR4 memory channels, 40x PCIe 3.0 lanes, and 35 MB caches. Also, the processor
contains two interconnect ring-like buses, illustrated in Figure 2.3, that exchanges data
between them through another interconnect component [3] [5] [36] [44].
Figure 2.3: Intel Xeon E5 v4
The ring allows the cores to exchange information with the uncore elements, i.e.,
external interfaces and the last level cache. Just one ring communicates with the QPI and
PCIe Gen3 interfaces, and all of them exchange information with the memory controllers
on the same side. Furthermore, the LLC slice, which compromises 2.5 MB, is not exclusive
to one core. So, a core distributes to any of them using the ring structure.
19
2.3.2 Intel FPGA Arria10 GX 1150
The in-package FPGA is an Arria 10 GX 1150 @400MHz device (10AX115U3F45E2SGE3)
which communicates with the CPU through a 9.6 GT/s QPI 1.1 and two PCIe 3.0 x8
interfaces [13]. The Arria 10 GX1150 supports up to 1,1150 K logic elements. The
synthesis tool maps the logic elements to the Adaptive Logic Module (ALM), which is
the basic building block and contains two full adders logic and four dedicated registers
to help improve timing closure. The traditional adaptive look-up table (LUT) can also
implement two additional registers. Figure 2.4 illustrates one ALM.
Figure 2.4: ALM Block Diagram for Intel Arria 10 Devices
Further, the FPGA provides internal memory blocks of two types: M20K and MLAB.
The first one, M20K, is a 20Kb memory block with a hard error correction code (ECC)
and the total capacity is 54,260 Kb. The second, memory logic array block (MLAB),
is a 640-bit memory with the maximum size available is 12,984. Another feature is the
variable-precision DSP block to support fixed-point and floating-point arithmetic. The
device has 1,518 blocks with support to two modes that can be set one high-precision
mode with a 27-bit multiplier or an 18-bit precision mode multiplier, which doubles the
multiplier capacity to 3,036.
2.3.3 Intel QuickPath Interconnect
The Intel QuickPath Interconnect (QPI) is a point-to-point connection between proces-
sors, I/O hubs, and third-party node controllers [46] [1]. The main feature is to provide
high-speed, low latency, and coherency to the system. A packet with four layers is used
to exchange information. Below, the description of each layer:
• Physical: wires carrying the signal and the circuitry to implement the link. A
connection between devices uses up to 20 lanes in each direction, a total of 40 lanes.
20
Other configuration arrangements allow using 10 or 5 lanes instead of the maximum
capacity for a direction. A full connection utilizes 82 bits that are duplicate wires
for each lane because of a differential signaling, also two wires for the clock in each
direction. A Phit (physical unit) will transfer 20 bits every single clock cycle.
• Link: implements the flow control and error correction when transfer data to/from
the physical layer. A Flit (flow unit) will transfer 80 bits.
• Routing: contains information about the direction of the packet when the system
has multiple interconnections. The firmware has the routing tables definitions.
• Protocol: the packet, unit of the protocol layer, is of six type: home, snoop,
data response, non-data response, non-coherent standard, and non-coherent bypass.
Fundamentally, there are two kinds of packets: the coherent maintains the sys-
tem memory consistent, and non-coherent will transfer system level functions like
interrupts, memory mapped I/O plus locks.
When compared against the other physical link, two PCIe Gen3 x8, available, the QPI
provides a large theoretical bandwidth of 9.6 GT/s towards 8.0 GT/s from the PCIe. An
encoding scheme 128b/130b that maps 128-bit payload to 130-bit symbols, the additional
2 bits is an overhead to determine packet boundaries and clock recovery, is mandatory in
the PCIe specification, however in the QPI this scheme is not necessary. The PCIe also
adds a 32 bits cyclic redundancy check (CRC), and the packet has a variable size from 64
to 4096 bytes. In contrast, the QPI packet has a fixed size of 72-bits, and 8-bits of them
for CRC are reserved. Thus, the packet has a total overhead of 11%.
2.3.4 Core-Cache Interface
The Core-Cache Interface (CCI-P) [11] is an abstract interface that simplifies the use of
physical links, independently of the number of them. SystemVerilog structures encapsulate
the code, two types of them are available one for request and another to respond. The
code in Listing 2.1 illustrates an example of how to instantiate it.
module ccip_standard
(
input l o g i c pClk ,
input l o g i c pck_cp2af_softReset , // CCI−P So f t Reset
input l o g i c [ 1 : 0 ] pck_cp2af_pwrState , // CCI−P AFU Power State
input l o g i c pck_cp2af_error , // CCI−P Protoco l Error
input t_if_ccip_Rx pck_cp2af_sRx , // CCI − P Rx Port
output t_if_ccip_Tx pck_af2cp_sTx // CCI − P Tx Port
) ;
/∗ ccip_standard implementation ∗/
endmodule : cc ip_standard
Listing 2.1: Core-Cache Interface (CCI-P) Module Example
21
The CCI-P abstracts the three physical links in the HARP platform and offers a simple
way to load/store data from the memory system when compared with the raw links. All
requests from the AFU uses the t_if_ccip_Tx , and process only one request per clock
cycle, plus the responses from FIU utilizes the t_if_ccip_Rx . The access of them uses
the virtual channel (VC) to select the physical link when executing a request. Below, the
description of the four VC’s:
• VL0: low latency virtual channel, the request will be routed to the QPI.
• VH0: high latency virtual channel, the request will be routed to the PCIe 0. This
channel works better when payloads are huge.
• VH1: high latency virtual channel, the request will be routed to the PCIe 1. This
channel works better for huge payloads.
• VA: automatic virtual channel, the FIU will select the channel (VL0, VH0 or VH1)
available at the moment to send the request.
The user documentation recommends using the VA because of the difficult to balance
the channels. The only exception is when polling a cache line to avoid off-FPGA traffic,
then the VL0 usage is the best option in this case. Further, the read and write responses
are unordered because of the multiple physical links.
2.4 Open Programmable Accelerator Engine
Intel introduced the Open Programmable Accelerator Engine (OPAE) library [7] to sim-
plify accelerator access and management as a substitute of the accelerator abstraction
layer (AAL) service-oriented library. This user-space library exposes the FPGA resources
as a set of features to the host program, also abstracts OS and hardware specific details
in the driver stack as illustrates the Figure 2.5. These features consist of functions to load
the bitstream and manage the device, enabling the user application to transparently and
seamlessly leverage FPGA-based acceleration.
The library supports different FPGA integration and deployment models, which can
be a single-node system with a few FPGA devices or a large-scale deployment located at
data centers, through a unified C API. The single-node uses the API support to program
the FPGA with accelerator functions using the PCIe link. The multiple-nodes deployment
uses the same API to manage and orchestrate services resources, so the software discovers
and select the FPGA resources, then allocates them to accelerate workloads.
The OPAE C API is divided, primarily, of four usage model: query and search
for a resource, acquire and release a resource, shared memory buffer, and memory-
mapped I/O (MMIO). The first model, which is the query and search, uses the function
fpgaEnumerate() with an object as an argument that specifies the resource properties,
i.e., accelerator type and identification, and returns the list of all of them that matches.
Further, the acquire and release has only two functions, fpgaOpen() and fpgaClose(),
that controls the FPGA usage ownership. Next, the shared memory buffer manages
22
Figure 2.5: OPAE Layered Architecture
the creation of the buffer through the following functions: fpgaPrepareBuffer() and
fpgaReleaseBuffer(). The OPAE supports virtual addressing, and the fpgaGetIOVA()
function retrieves the virtual buffer address when this technique is available. The last
model is the MMIO that map/unmap the register file of an accelerator to the virtual
memory space of the application. To create this space, the functions fpgaMapMMIO() and
fpgaUnmapMMIO() are used, also the fpgaWriteMMIO64() permits to update the register
value.
2.5 Related Works
This section provides a literature review and a brief description of the most relevant of
them. Several works use code annotated with standards like Pthreads, OpenMP, and
OpenACC as input to HLS tools that it is a seamless solution to connect the accelerator
and software. Additionally, this section also includes some hardware functional verification
or validation works.
The work in [33] proposes a framework to generate hardware from OpenMP C program,
the result from it is a synthesizable VHDL or Handel-C, a high-level language to hardware
compilation, code to FPGA. The C-Breeze compiler [28] modified version with support
to OpenMP parses the C code to create an abstract syntax tree (AST), then maps to
a supported hardware description language dealing with each supported directive. The
VHDL version maps each C function to a component with common signals to control the
execution and return the result value, also reduce the control flows into state machine
operations. Our work uses a more complex set of signals to control the FPGA IP. Also,
23
their solution does not receive data information from these signals and only uses hardware
generated from a high-level synthesis tool inside the OpenMP pragma.
Cabrera et al. [17] implement an extension to OpenMP 3.0 to target heterogeneous
architecture with FPGA-based accelerators. While OpenMP 3.0 does not support offload
the computation to another device, this work extends the proposed clauses in [14] to
allow to specify the bitstream with the implements clause for the target directive, which
indicates that exists another implementation to one function, and a label clause for task
directive, which execute the alternative version aforementioned. Their solution limits to
offload tasks and our solution work with any block inside a parallel or parallel for
directive. Furthermore, their solution is not transparent to the user because he has to
describe to functions one for the sequence of operations in the host and another for the
FPGA, instead of only specify the device in the OpenMP clause inside the pragma.
Pilato et al. [37] created an automatic framework for a seamless hardware accelerators
integration to decrease the gap between HLS and system-level design. The framework
input is the OpenMP application and an XML file description of the target platform
and its components, i.e., number of software processors, interconnection topology, and
information about the memory. The design flow, first, generates a task graph to analyzes
the allocation of the process elements and synthesize the hardware, next create the target
platform specific parts to transfer the data and manages the runtime and, in conclusion,
produce the FPGA bitstream. Our solution abstract the system complexity differently
from the proposed in this work, where an XML file with system description is a input of
their framework.
Choi et al. [20] make available an open-source framework that generates a hardware
parallel from a software code using the following parallelization paradigms Pthreads and
OpenMP. This framework will create a SoC platform with a MIPS processor and the hard-
ware accelerators, also provide the binary to run in the processor. In [21] was proposed a
heterogeneous hardware/software system from an OpenMP annotated C-code, the system
is a multiprocessor System-on-Chip (MPSoC) for Xilinx FPGA’s with MicroBlaze micro-
processors. The work in [39] has improved a framework to transform C-code annotated
with OpenMP tasks in a System-on-Chip (SoC) with a soft-core processor, Altera Nios
II, and one hardware accelerator that deals with all tasks, all the system will run in the
FPGA. These works propose the creation of a SoC inside the FPGA with the soft-core
processors and embedded accelerators, but our solution integrates a multi-core processor,
an Intel Xeon processor, to the FPGA. Our solution is suitable for datacenter applications
because the SoC performance is at least 10x slower when comparing the performance of
a soft- and hard-core processor.
Zhang et al. [45] introduces a compilation flow that automatizes the map of C pro-
grams to full FPGA system and optimizes the integration using the following method:
task-level dependence analysis, block-based data stream, and automated synchronous
dataflow generation. The input C-code contains regions marked with pragmas, which
designate the block of code that executes in the hardware accelerator. Moreover, the
framework replaces this region with OpenCL to communicate with the platform. The
CMOST supports two scenarios processors connected via PCIe and processors embedded
in FPGA. In the first scenario, this work depends on the OpenCL to communicate with
24
the FPGA and adds an overhead to the solution. Our work has a low-level communication
with the FPGA to reduce the solution overhead.
The [31] proposes a high-level programming framework to FPGA and novel OpenACC
pragmas extensions to generate more efficient accelerators. The input is a C program
with OpenACC pragmas that the framework performs a source-to-source translation to
creates an accelerator specific code, OpenCL for FPGA in this work, with the OpenARC
compiler [32]. The output code serves as input to the Altera OpenCL compiler (AOC)
and generates the hardware configuration file to program the FPGA. The OpenACC
extensions produce FPGA architectures, which explore parallelization and deep pipelines,
to maximize performance. These pragmas extensions are mapped to OpenCL kernel,
which are loop unrolling, kernel vectorization, and compute unit replication. This work
is similar to our project but as the majority of the others works uses an HLS tool and
needs an interaction of the user to generate the best hardware for a specific solution.
Sommer et al. [41] enable the user to offload computation from one host to an FPGA
accelerator using the OpenMP target directives. This work, firstly, transforms the kernel
code with OpenMP marks to a hardware component using high-level synthesis, program
the FPGA and, finally, provides the infrastructure, added throughout the compilation
process, to handle with communication and invocation during the execution. The initial
steps use an automated framework, called ThreadPoolComposer (TPC) [30], to synthe-
size the accelerators from a C/C++ kernel code and assemble them in one bitstream,
the framework invokes the Xilinx Vivado to execute the HLS flow to fulfill this task and
generates the threads pool mapped to FPGA. Before the synthesis process, the TPC adds
components to standardize the connectivity between host and target. The TPC API
provides a set of functions to combine hardware and software, and control the acceler-
ators by programming, send and receive data, and launch or stop jobs. Furthermore,
the compilation flow uses the LLVM/Clang compiler with the libomptarget library. A
plugin was added to this library to connects with TCP API and handles with the FPGA
communication during the runtime. Our solution supports program the FPGA with a pre-
synthesized bitstream and has a specific implementation to each CPU-FPGA architecture,
which reduces the communication overhead.
OpenMP is a popular standard but nothing was found about extensions for hardware
functional verification or validation integration. However, many techniques have been
proposed based on code generation to support functional hardware verification [35, 9, 25,
23, 16], for example, directed tests or specific template testbench creation. Salamanca
et al [40] propose a dynamic loop-carried dependence checker, using the parallel for
check construct, which can help the programmer to identify this kind of dependence.
Our approach differs from the above-mentioned works because: (a) it enables program-
mers to access two recently released CPU-FPGA cluster architecture nodes (Intel HARP
and Microsoft Catapult); (b) it gives programmers the flexibility to select between two
operation modes: one based on pre-synthesized VHDL/Verilog modules and other based
on FPGA netlist synthesis from OpenMP annotated C code; (c) it connects applications
to FPGA acceleration through a modern compiler toolchain (LLVM/Clang); and (d) it
eases the functional verification using the host code.
25
Chapter 3
The HardCloud
This Chapter will describe HardCloud in details and how the OpenMP directives ex-
tensions transform an FPGA to another OpenMP acceleration device that can be used
directly from any user program. The topics discussed are the hardware architecture, the
offloading process, the accelerator core interface, and a mechanism to verify/validate.
initialize
next step
has reach
a
target directive?
try to offload
has offload
faileld?
has check clause?execute host version execute check version compare
end offload
yes
no
data output
no
yes
no
yes
Figure 3.1: Execution Flow Diagram.
26
The Figure 3.1 delineates the execution through a flow diagram. The program starts
at the initialize block, where loads the plugin necessary to offload the computation to
the FPGA. Moreover, the application runs each step of the binary code until reaching
a target directive and try to offload the computation with Intel HARP plugin. If the
offload fails, e.g., FPGA communication failure or an invalid data pointer; a fallback
mechanism executes the host version, which is available in the elf binary file. Next, the
program tests for the check clause in the OpenMP pragma and performs the process
that this clause enables, otherwise finishes the offload and continue to the next line of
code. If the developer inserted the clause to execute the verification/validation mode, the
program will run the check version and compare the output result with the accelerator
implementation.
The software application communicates with a Clang library, libomptarget, that
manages the offload process independently of the device and execute the computation in
the accelerator chosen in the OpenMP pragmas using a plugin specific implementation,
which executes the device driver. In the Intel HARP, the communication with the FPGA
occurs with OPAE, an easy interface composed of a library and API that facilitates
the data transfer and control. A hardware abstraction layer, accelerator core interface,
simplify the process of send/receive transactions using the CCI-P interface.
3.1 Architecture
Figure 3.2 shows the HardCloud extensions of the HARP base hardware architecture.
Highlighted in gray are the sub-blocks developed to enable HardCloud, in blue are Intel
building sub-blocks used to support HardCloud, while in red is the Core sub-block which
is programmed by the acceleration module netlist (myhrw) provided by the programmer
through OpenMP annotation. The FPGA Interface Unit (FIU) sub-block is a static
FPGA fabric that handles all information received from the FPGA external interface by
extending the coherence domain from the CPU through an on-FPGA cache. The QPI 1.1
and 2x PCIe physical links between the FPGA and the memory system are abstracted by
means of a virtual channels sub-block, called Core Cache Interface (CCI-P), that provides
a simple load/store access to the system memory.
Figure 3.2: Block Diagram FPGA Architecture
27
Inside the AFU there is also an optional clock domain-crossing shim sub-block, called
ASYNC , that enables accelerators that are slower than 400 MHz by connecting a pair
of CCI-P interfaces with different clocks. This sub-block implements two dual clock
parametrizable FIFOs for each direction of the CCI-P channel, and the soft reset signal
is also mapped to the destination clock domain. The Memory Properties Factory (MPF)
sub-block is a collection of shims that transforms the CCI-P to add more complex mem-
ory semantics because the CCI-P is deliberately to minimize FPGA area. The internal
components accessible are a reorder buffer to return the read responses in request order,
a virtual to physical translation to perform operations in virtual address, a masked write
that read-modify-write, and a shim to enforces a read and write ordering to the same
address.
Table 3.1: AFU Memory Map
Name Address Size Access Brief Description
HC_DEVICE_HEADER 0x000 64b RO Constant:
0x1000000010000000.
HC_AFU_ID_LOW 0x008 64b RO Constant:
0xC000C9660D824272.
HC_AFU_ID_HIGH 0x010 64b RO Constant:
0x9AEFFE5F84570612.
RESERVED 0x018 to
0x0f8
- RW Reserved for future use.
HC_DSM 0x110 64b RW HardCloud AFU DSM (De-
vice Status Memory) base
address which is cache align
and the six least significant
bits are zero.
HC_CONTROL 0x118 32b RW Control the module execu-
tion with the following com-
mands: ’0x0000’ to assert
reset; ’0x0001’ to de-assert
reset; ’0x0003’ to start;
’0x0007’ to stop.
HC_BUFFER_ADDR[i] 0x120 +
0x10*i
64b RW Buffer virtual ad-
dress, which size is
HC_BUFFER_SIZE[i].
Writes or reads are targeted
to this region.
HC_BUFFER_SIZE[i] 0x128 +
0x10*i
32b RW Size in number of cache
lines, each one is 512b.
RESERVED 0x1000 to
MPF_ADDR
- RW Reserved for MPF features.
The MPF_ADDR is uncer-
tain because of the depen-
dency on the features en-
abled.
28
The HardCloud Control Status Register (CSR) sub-block is a register file that manages
communication with the host. The CSR will treat the requests at 3.1, except the address
reserved to MPF features. The read-only addresses have two types of information the
AFU id, a unique identifier for HardCloud applications, and device header, which specify
the system configuration, e.g., the system has a virtual to physical translation and how
many AFU are available.
HardCloud Requestor controls the flow of data to/from the shared memory between
the CPU and the accelerator, also transforms the AFU requests to appropriate structure.
This sub-block contains two FIFO’s, one that stores read operations and another for write,
and handles multiples buffers which vary for each application. For each buffer operation,
the proper state machine creates a read/write CCI-P request.
Startreset Idle
Process Stream ModeIndexed Mode
1
2
3
4
5
6
7
9
10
8
Figure 3.3: Read State Machine.
At the beginning of the read state machine illustrates at Figure 3.3, the Start state
waits for an initial signal from HC CSR to begin the accelerator computation (1). When
the signal arrives (2), the actual state moves to Idle, where stays until the read request
FIFO is empty or the request channel zero from CCI-P interface is almost full (3).
A read request from the IP populates the FIFO aforementioned and the state machine
transitions to the Process (4). At this state, the requestor removes one element and
interprets the element structure to extracts the command field information before the
transition to the next state. The command designates the next step among the modes
of operations available, read stream (5) or indexed (6) mode. Both modes use the buffer
identification information from the element structure field to define the base address and
the size.
29
The Stream Mode state also needs to extract the number of cache lines from the el-
ement structure, then constructs a CCI-P read request with the address (base address
plus offset address, internal register for this mode of operation that increments automat-
ically in each read operation). In (7), a transition happens when the channel is busy or
offset address increments. Thus, hardware backs to the Idle state when the read stream
operation finish (9).
The Indexed Mode state only requests one cache line and the address information is
the base address plus the offset, the element structure contains one field with this data.
The indexed mode executes in one clock cycle if the channel is available, the transition
(8) shows what occurs when the channel is busy. Finally, requestor goes to Idle state
(10).
Startreset Idle
ProcessIndexed Mode Stream Mode
Send Finish Stop
1
2
3
4
11
6
5
8
10
7
9
12
13
Figure 3.4: Write State Machine.
As can be noticed when compare the write, Figure 3.4, and read state machine, they
are very similar. A write request FIFO stores the request from the IP, next the Process
state identify the write mode. The indexed mode only requests a write through the CCI-P
interface of one cache line in a specific address. On the other hand, the stream mode writes
a series of cache lines in sequential addresses, increments the last offset value. Differently,
30
from the read operation, the state machine contains two more states: Send Finish and
Stop. When the accelerator IP finalizes the computation, it sends a signal to inform the
requestor and moves to Send Finish state at the moment the actual state is Idle (11).
If the channel is busy, it maintains the current position (12), but the transition to Stop
occurs when the channel is free.
3.2 Offloading Flow
This section describes in details what happens during the offloading process that starts
at the moment that the program reaches the target device annotation in Listing 1.1(d)
and ends after the annotated code executes.
As mentioned above the OPAE library provides a set of functions that enable the
LLVM/Clang libomptarget HARP plugin to access the HardCloud registers and buffers
defined before. The protocol between the Application, the OPAE Library, and the FPGA
is described in the UML sequence diagram of Figure 3.5 and works as follows:
(1) When reaches target device, an OPAE function is called to program the FPGA
with pre-compiled bitstream named in the clause module(myhrw), allocate buffer
HC_DSM and write the virtual address to a base register (HC_DSM);
(2) OPAE starts a reset command to the FPGA using the HC_CONTROL register;
(3) OPAE ends the reset command to the FPGA again using the HC_CONTROL register;
(4) When the host reaches the map(to/from) directive, it calls the OPAE API to setup
the data send/receive operations. OPAE maps the data into a set of input/output
buffers addressed virtually;
(5) The buffers with the offloaded data are then sent to the FPGA. Each buffer i is
located in the virtual address specified at register HC_BUFFER_ADDRESS[i] and has
the size given by HC_BUFFER_SIZE[i];
(6) When the host reaches the use(hrw) directive, it calls the OPAE API to start the
execution of the accelerator;
(7) OPAE sets the start command at register HC_CONTROL to run the FPGA accelerator;
(8) The HardCloud runtime on the application side keeps polling the HC_DSM to check
the end of the execution of the FPGA accelerator;
(9) After detecting the end of the accelerator, the HardCloud runtime sends an end
execution command to OPAE;
(10) OPAE sends the end execution to the FPGA through the HC_CONTROL register;
31
:HARP Plugin :OPAE :FPGA
1 target device(HARP)
1 program(myhrw)
1 addr = allocBuffer(pagesize)
1 mmioWrite(HC_DSM, addr)
2 mmioWrite(HC_CONTROL, 0x0000)
3 mmioWrite(HC_CONTROL, 0x0001)
4 map(to/from: X[:size])
4 addr = allocBuffer(HC_BUFFER_SIZE[i])
5 mmioWrite(HC_BUFFER_ADDRESS[i], addr)
5 mmioWrite(HC_BUFFER_SIZE[i], size)
Iterate through all buffers
6 use(hrw)
7 mmioWrite(HC_CONTROL, 0x0003)
8 memRead(HC_DSM)
Wait execution loop
9 End Execution
10 mmioWrite(HC_CONTROL, 0x0007)
Figure 3.5: Sequence diagram of the offloading to the Intel HARP.
32
3.3 IP Interface
The HardCloud IP Interface provides an effortless method to connect the IP core by ex-
changing information with the OpenMP runtime and reading/writing the shared memory
in a seamless way, through the CCI-P interface. Moreover, two operation modes are
available to execute a read/write operation: stream and indexed. This section describes
how to configure and integrate your hardware module with HardCloud, also illustrates,
through examples, how the communication protocol works.
The code below shows a stub implementation of an IP with the HardCloud interface
used to communicate with the memory system, which provides access to the shared buffers
between host and device.
module IP
(
input l o g i c c lk ,
input l o g i c r e s e t ,
input l o g i c s ta r t ,
output l o g i c f i n i s h ,
hc_buf fe r s_i f bu f f e r
) ;
/∗ IP implementation ∗/
endmodule : IP
Listing 3.1: Example of SystemVerilog HardCloud IP
In Listing 3.1, the start port informs that hardware configuration has been done and
now the computation could start. The finish port is used to send this signal to the soft-
ware, setting it to high. The buffer port is a SystemVerilog interface and comprehends
several functions that enable the user to retrieve or send data, cache lines, in an effortless
way. This port is composed of two channels that operate in parallel, one for transmission
and another for the reception. Moreover, each channel has a couple of signals to commu-
nicate with the HC Requestor, and they include the following functionality: send/receive
data, control, and status.
The send/receive data contains a data signal with 512 bits, which is the size one
cache line, and a valid signal. Next, the control uses the id signal to inform which buffer
the command, the signal cmd, will occur. There are five commands: idle state, read
stream, read indexed, write stream and write index. The other two signals, size and
offset, complement the command operation. The status notifies the IP about the FIFO’s
information, i.e., full, empty, and the number of requests issued.
3.3.1 Stream Access Mode
The stream access mode provides a way to read or write from sequential addresses without
specifying them. To illustrate how this mode operates, consider the C/C++ code below
that uses OpenMP and has two variables, X and Y.
33
uint32_t X [ 4 8 ] ; // 3 cache l i n e s
uint32_t Y [ 3 2 ] ; // 2 cache l i n e s
/∗ some code ∗/
#pragma omp ta rg e t dev i ce (HARP) map( to : X) map( from : Y)
#pragma omp p a r a l l e l use (hrw) module ( b i t s t ream )
{
// Software ve r s i on o f the hardware module .
/∗ some code ∗/
}
Listing 3.2: Example of C/C++ OpenMP Code
From the code in Listing 3.2, the memory contains two buffers, one for reading and
another for writing, specified in the map clause. Now, assume without loss of generality
that the variable X, which size is equal to three cache lines, has the values A, B, and C
for each subsequent cache line address. Further, the variable Y is an output buffer from
the FPGA perspective and has the size of two cache lines.
The Figure 3.6 shows the block diagrams to perform the operations with the stream
mode. Below, the description of each number to explain how the data flow through the
main components: IP, HardCloud Requestor, and the shared memory.
(1) The IP enqueue three cache lines requests on read request FIFO, 512 bits each, from
the buffer_i in the shared memory, starting from the base address (0x000 0000
0010). In this mode, every request will choose the subsequent address of the last
one used. In the first access, HardCloud CSR has already specified the base address,
and the initial offset value is zero;
(2) When the CCI-P interface is not busy, the HardCloud Requestor will dequeue the
first element from read request FIFO and process the read to the next address from
buffer X;
(3) When the data is available, a valid bit will be set to inform the IP. Besides, the data
will be accessible in this clock cycle, and the IP must store or consume the data;
(4) After the IP process the information received, it will put the output data in the
second buffer. A enqueue command adds the new data in the write FIFO;
(5) The stream mode is selected to write the other buffer. In the same way as the read,
each write is sequential and use the next address from the last operation on it;
The Figure 3.7 above represents the waveform of the flow described. The clock cycles
that consist of relevant events are put into words hereafter.
The channels, rx and tx , are capable of operating in parallel but to improve the
waveform readability and without losing the generality the read requests will happen first
followed by the write operations. Additionally, the variable X as interpreted in the IP as
the buffer number 0 and the Y as the buffer number 1.
34
Figure 3.6: Stream Access Mode
A read request is requested to buffer with identification 0. The event occurs during
the clock cycle 2, and the id signal selects the buffer, the cmd signal with value one
annunciates a read in the stream mode, and the size signal with value 3 indicates the
number of cache lines requested by this operation. In the next clock cycle, the count
signal from the rx channel is updated to one because issued entirely one operation. After
several cycles, the IP receives a valid data, A, at the instant of time 7 and the IP must
store or consume the data at the same time because it will only be available this time.
At the clock cycle 9, the FPGA has requested all the three cache lines to the memory
system, and the count signal goes to zero. After more clock cycles, the remaining data,
which are B and C, arrive in the IP.
The subsequent operation is a write stream of two cache lines. The procedure executes
at the clock cycle 16, and first will inform the data, which is D, and the number of cache
lines that this stream operation will provide, i.e., two cache lines that maps the size signal.
The size signal will be used by the HC Requestor to request from the CCI-P interface
multiple cache lines in a single operation when it is possible. The command value for this
operation must be equal to three also the FIFO counter increments each time receives a
valid data, and decrement when transmits a request through the CCI-P. Finally, at clock
cycle 21, send the termination signal to the host application, but after dispatch, every
request to the memory.
35
Figure 3.7: Streaming Access Mode
3.3.2 Indexed Access Mode
The indexed access mode will be explained later in this subsection. In this mode, there is
an obligation to set the address offset for the buffer before executing any access memory
operation, and, differently, from the last C/C++ code example used before, a larger
variable X and Y will be used as illustrated below in Listing 3.3.
uint32_t X [ 6 4 ] ; // 4 cache l i n e s
uint32_t Y [ 6 4 ] ; // 4 cache l i n e s
/∗ some code ∗/
#pragma omp ta rg e t dev i ce (HARP) map( to : X) map( from : Y)
#pragma omp p a r a l l e l use (hrw) module ( b i t s t ream )
{
/∗ so f tware ve r s i on o f the hardware module . ∗/
}
Listing 3.3: Example of C/C++ Code
The variable Y likewise is an output buffer, but now has the size of four cache lines.
Besides, the variable X is an input buffer with a new capacity to store four cache lines, and
36
the values of this buffer for each subsequent address are A, B, C, and D. The Figure 3.8
data flow shows how the index mode works for the example above and each step will be
described below.
Figure 3.8: Indexed Access Mode
(1) The IP requests three cache lines, not necessarily in sequential order. The first
request is from the address 0x000 00000 0013, which contains the data D, and the
second is from the previous one. Finally, the last is from the address 0x000 0000
0010;
(2) When the CCI-P interface is not busy, the HardCloud requestor will dequeue the
first element from read request FIFO and process the read to a specific address from
the buffer X;
(3) When the data is available, a valid bit will be set to inform the IP;
(4) After the IP process the information received, it will put the output data in the
second buffer. A enqueue command adds the new data in the write FIFO;
(5) The indexed mode is selected to write the other buffer. In the same way as the read,
each write command has a specific address. The first write is to the address 0x000
0000 0033 with data H, and the second to address 0x000 0000 0031 with data F;
37
Figure 3.9: Indexed Access Mode
The Figure 3.9 above represents the waveform of the indexed access mode. The de-
scription hereafter will highlight the most important events. In a similar trend of the
previous waveform explanation, a series of assumptions will be defined to simplify the
exposition. First, the variable X is buffer with id number 0, and the variable Y has the
id number 1. Second, the channel’s operations are not parallel.
A read request is issued to buffer with id number 0, the variable X, at three different
clock cycles 2, 3, and 8. They are distinct requests only because of the offset signal, the
first of them is address offset 3, then 2 and, finally, 0. The command signal is equal to 2,
which set a read indexed transaction. The count signal from the rx channel increments
each time issues an operation; this action occurs during clock cycle 2, 3, and 7. The
FIFO counter decrements for each read when the CCI-P is not busy, the waveform above
illustrates this behavior at the clock cycle 3, 8 and 10. After an operation of access to the
memory system is sent, the IP has to wait for several clock cycles to receive the data in
the same order as requested. The preceding example receives data (C, D, and B) when
the valid signal is one, this happens in three moments at the clock cycle 8, 13, and 14.
Succeeding the read request, the next operation is the write indexed of two cache lines.
The write command to the tx channel is set, which representation is the value 4, with the
buffer identifier for the variable Y and the address offset. The first data, H, write request
is to the address 0x0033. Next, the following data with value F is written to a lower
address 0x0031, since the offset equals to one. Every time an operation to the memory
system is sent, the same behavior described before the FIFO counter decrements.
38
3.3.3 Integration
This section will describe how to create your module and integrate with Intel and Hard-
Cloud blocks.
The first function of the buffer port is a read request operation. The IP core module
sends a data request from the memory to the HardCloud Requestor block, which contains
a read request FIFO of depth size 8. Below, the description of the functions used to read
data:
• read_idle(): no operation is requested.
• read_stream(id, size): sequential read of size cache lines from buffer id, an internal
register will increment the read offset to the next address. The initial value of the
offset is zero.
• read_indexed(id, offset): indexed read of one cache line from buffer id at buffer
address plus offset.
• read_fifo_count(): return the number of items in the read request FIFO.
• read_fifo_is_empty(): return the empty status of read request FIFO.
• read_fifo_is_full(): return the full status of read request FIFO.
. . .
always_ff@ ( posedge c l k or posedge r s t ) begin
i f ( r s t ) begin
bu f f e r . read_id le ( ) ;
end
e l s e begin
// I f r eque s t FIFO i s not f u l l , read
// 512 cache l i n e s from bu f f e r 0 .
i f ( ! b u f f e r . r e ad_f i f o_ i s_ fu l l ( ) ) begin
// Since the read_stream func t i on i s used ,
// the o f f s e t i s automat i ca l l y incremented .
bu f f e r . read_stream (0 , 512) ;
end
e l s e begin
bu f f e r . read_id le ( ) ;
end
end
end
. . .
Listing 3.4: Example of Read Stream Request
The SystemVerilog block of code above, Listing 3.4, presents an example of how to
request a stream read. The initial action is to set the buffer in the idle state when asserts
39
the reset signal. Next, the code check if the read FIFO full status. If not full, a read
operation occurs; otherwise, the buffer retrocedes to the idle state. In the read_stream()
operation, the arguments necessary are the buffer id and the size, since the address will
be updated automatically.
The read response is the next functionality that will be described. Moreover, the data
will be available in the same order as requested because of the reorder feature from the
MPF sub-block. The functions necessary to read the data, if valid, are described below:
• valid(): return if data is available for read.
• data(): return one cache line
The SystemVerilog block of code hereafter, Listing 3.5, presents an example of how to
check if a data is valid and store in an internal register. The code is simple and works
with a SystemVerilog if statement.
. . .
t_buffer_data cache_l ine ;
always_ff@ ( posedge c l k or posedge r s t ) begin
i f ( r s t ) begin
cache_l ine = ’ 0 ;
end
e l s e begin
i f ( bu f f e r . v a l i d ( ) )
cache_l ine <= bu f f e r . data ( ) ;
end
end
. . .
Listing 3.5: Example of Read Response
Continuing describing the interface functionality, the next is the write request. When
executing a write operation, the IP core module sends data to the HardCloud Requestor
block, which contains a write FIFO of default size depth 8. Beneath, the description of
the functions used to write the buffer:
• write_idle(): no operation is requested.
• write_stream(id, data): sequential write of data to buffer id, an internal register
will increment the write offset to the next address. The initial value of the offset is
zero.
• write_indexed(id, offset, data): indexed write of data to buffer id and buffer address
plus offset.
• write_fifo_count(): return the number of items in the write FIFO.
• write_fifo_is_empty(): return the empty status of write FIFO.
40
• write_fifo_is_full(): return the full status of write FIFO.
The code, Listing 3.6, following this paragraph presents an example of how to write
a specific buffer. It starts moving the buffer write state to idle, then the write FIFO
is evaluated to check if it is full or not. Considering that the full status is false, an
execution of write operation in stream mode with buffer id zero and data value is the
register cache_line, which increments each time a write operation occurs.
. . .
t_buffer_data cache_l ine ;
always_ff@ ( posedge c l k or posedge r s t ) begin
i f ( r s t ) begin
cache_l ine <= ’0 ;
bu f f e r . wr i t e_ id l e ( ) ;
end
e l s e begin
// I f wr i t e FIFO i s not f u l l , wr i t e data to bu f f e r 0 .
i f ( ! b u f f e r . w r i t e_ f i f o_ i s_ fu l l ) begin
cache_l ine <= cache_l ine + 1 ;
// The o f f s e t i s automat i ca l l y incremented .
bu f f e r . write_stream (0 , cache_l ine ) ;
end
e l s e begin
bu f f e r . wr i t e_ id l e ( ) ;
end
end
end
. . .
Listing 3.6: Example of Write Request
The last functionality is the buffer information that contains the size for all buffers
available to the accelerator. This acquired information is set in a register inside the HC
CSR sub-block before the initiate the computation and has the exact size of the variable
in the host. Below, the description of the unique function.
• buffer_size(id): returns the size, number of cache lines, of buffer number equals to
id received as an argument.
3.3.4 Configuration
The HardCloud provides a SystemVerilog package to configure the system parameters.
These parameters define the structure of the hardware components of HC Requestor, also
41
streamckage hc_user_pkg ;
parameter HC_BUFFER_TX_SIZE = 1 ; // number o f v a r i a b l e s in map( to : )
parameter HC_BUFFER_RX_SIZE = 1 ; // number o f v a r i a b l e s in map( from : )
parameter HC_WRITE_FIFO_DEPTH = 8 ;
parameter HC_READ_REQUEST_FIFO_DEPTH = 8 ;
endpackage : hc_user_pkg
Listing 3.7: HardCloud Configuration
the interface between HC Requestor and the IP core. Hence, the user needs to set the
number of buffers to send and receive data, and the buffer depth for each of them.
The Listing 3.7 exemplify the configuration with one input and one output buffer.
Additionally, a parametrizable FIFO depth gives the flexibility to create IP with different
requirements, like, area and performance.
3.4 Functional Verification and Validation Mode
This subsection proposes an OpenMP check clause, which enables the verification feature.
Indeed, this is not exclusive to our project as can be noticed in [14]. So, the clause
check must operate in conjunction with clause use(hrw) to avoid undesirable response as
exemplifying the code in Listing 3.8.
#pragma omp ta rg e t dev i ce (HARPSIM) map( to : X) map( from : Y)
#pragma omp p a r a l l e l f o r use (hrw) module ( loopback ) check
// Software ve r s i on o f the loopback hardware module .
f o r ( i n t i = 0 ; i < NI ; i++) {
Y[ i ] = X[ i ] ;
}
Listing 3.8: Example of check Clause.
The implementation of the proposed technique consists of modifying the Clang OpenMP
runtime code generation to create LLVM intermediate representation (IR) to verify the
results. Moreover, besides the offloading, the new IR transforms the code to execute the
software in the host and write the results in a temporary variable. After the device fin-
ishes the computation, a match function, also inserted, will compare the device output
to a temporary variable. The Figure 3.1 illustrates how the check clause works and in
which part of the execution flow occurs.
Finally, a message will inform the user if any error occurs during this comparison
and could provide additional information, like power consumption and cache miss in Intel
HARP platform. When comparing the results, the target device might work with a
different precision than the host device because of that if the values mismatch, an percent
of the difference will be shown.
42
3.4.1 libomptarget
The compare function implementation is located inside the libomptarget library to give
the flexibility to modify the code. A function, called __tgt_check_compare_variable,
receives three arguments: host pointer, target pointer and size.
The default behavior of this functions is to call the C library function memcpy and an
error message if the test fails. However, a user-defined shared library could be used to
compare values which are not equals, e.g., applications with float-point approximations.
The shared library name has to be __rtl_compare_variable.so.
3.4.2 LLVM IR Generation
The OpenMP runtime code generation includes the emitTargetCall method that emits
the target offloading code. This method receives as an argument the host version of the
code to be offloaded and the arguments. The method creates a basic block with the
label omp_offload.failed and calls the host version with the arguments mapped. To
implements the check clause, the aforementioned method is utilized.
The modified method includes two new basic building blocks one with the label
omp_offload.before_check and the other with omp_offload.check. The first basic
block creates a conditional branch that tests if uses the check clause. If it is available,
execute the second basic block, otherwise continue the code execution. The second basic
block emits the code to allocate the temporary output variables, one for each one speci-
fied in the OpenMP pragma, and call to the host version with the temporary variables.
Furthermore, a runtime call to the function __tgt_check_compare_variable.
. . .
01 . br i 1 %16, l a b e l %omp_offload . f a i l e d , l a b e l %omp_offload . before_check
02 .
03 . omp_offload . f a i l e d :
04 . c a l l void @__host_main ( [ 8192 x i32 ]∗ %B, [8192 x i32 ]∗ %A) #5
05 . br l a b e l %omp_offload . cont
06 .
07 . omp_offload . before_check :
08 . br i 1 %17, l a b e l %omp_offload . check , l a b e l %omp_offload . cont
09 .
10 . omp_offload . check :
11 . %18 = c a l l i 8 ∗ @malloc ( i 64 32768)
12 . %19 = b i t c a s t i 8 ∗ %18 to [8192 x i32 ]∗
13 . %20 = b i t c a s t [8192 x i32 ]∗ %B to i 8 ∗
14 . c a l l void @__host_main ( [ 8192 x i32 ]∗ %19, [8192 x i32 ]∗ %A)
15 . c a l l void @__tgt_check_compare_variable ( i 8 ∗ %18, i 8 ∗ %20, i 64 32768)
16 . c a l l void @free ( i 8 ∗ %18)
18 . br l a b e l %omp_offload . cont
19 .
20 . omp_offload . cont :
. . .
Listing 3.9: LLVM IR Example
43
The Listing 3.9 shows one example of the LLVM IR for a loopback code with one input
and one output variable after the modification. The line 01 has a branch instruction to ver-
ify if the offload execution failed. If failed, jump to the basic block omp_offload.failed,
runs the host version of the block code (__host_main) and go to omp_offload.cont.
Otherwise, the omp_offload.before_check at line 07 read the value of %i17 to compute
the next path. If the clause check is not in this OpenMP pragma, finish the offload
and move to line 20. When the user decides to test what execute in the device, then
the program move to line 10. At line 10, the program allocates space for each output
variable. The example only calls the malloc function one time and assign the pointer to
variable %18. The host version use the new variable as an output as can be seen in line
14, where uses %19, which is the %18 type conversion from byte pointer to fixed size
array pointer, instead of %B (the original output variable). In the next line, a call for the
external function from libomptarget with the original and temporary variable compares
the values. Next, the function free deallocates memory block the temporary variable.
Finally, the program execution continues to the next step.
3.5 Design Decisions
In this section, we present the design decisions of HardCloud and why we decide to adopt
them. The FPGA platform, Intel HARP, has shared memory to exchange data with the
host. The OpenMP API, which supports multi-platform shared memory multiprocess-
ing, was a natural choice to abstract the communication between software and hardware
because it supports shared memory data exchange among nodes.
The HardCloud Requestor requirements are to run at maximum frequency, 400 MHz,
and send/receive information through the CCI-P using the minimum area as possible. The
FPGA interface, CCI-P, is a full-duplex channel but sends/receives only one request/re-
sponse per clock cycle. A parametrizable FIFO with a default depth of size 8 controls
the data flow when the interface is not available, and the size is variable because of the
application and the requirements of each IP core. The interface response connects directly
to the IP core since only one will arrive in a clock cycle period. Furthermore, this reduces
the number of logical elements necessary and transfers the responsibility to store the data
information to the hardware designer.
The HardCloud CSR only stores the minimum information for each buffer, the base
address, and size, in a memory. The number of slots with the buffer information is pre-
defined before synthesis. So, we only reserve space to store the buffer that will be used
to maps the software variables in the map clause. The Intel FPGA MPF block is not
an optional block in our architecture because two features are considered to be always
there. The read and write from a virtual address to give the flexibility to work with huge
variables and the reorder buffer to receive responses in request order. This decision incurs
several cycles of delay to communicate with the host.
The decisions were because of the platform features and cost a communication over-
head of several cycles. However, our preliminary results show that the benefits overcome
the delay. The components may be reuse to other platforms, but adjustments are essential.
44
Chapter 4
Experimental Results
This chapter describes a series of experiments to evaluate the proposed tool, HardCloud.
A set of well-known applications was executed in Intel HARP 2 platform at Paderborn
University with cluster nodes running Ubuntu Linux 14.04 with Kernel 3.13.0. The appli-
cations stress out the communication between hardware and software with different char-
acteristics among them, like latency, data demand, and data production. Further, must
of them chosen from the hardware open-source community projects, like OpenCores[8]
and LibreCores[6], to examine the tool flexibility to connect IP from various developers.
Below, a brief description of each application:
• AES-128: the Advanced Encryption Standard (AES) is a worldwide used symmetric-
key algorithm, the same key is used to encrypt plus decrypt, and this implementation
utilizes a key size of 128 bits.
• SHA-512: a cryptographic hash function to guarantee data integrity with a digest
size of 512 bits, this function is part of SHA-2 (Secure Hash Algorithm 2). The
function processes block with the size of 1024 bits and needs 80 rounds of hashing
execution to compute the output.
• Sobel Filter: an edge enhancement filter used in image processing applications by
performing a gradient measurement. First, the algorithm extracts the luma com-
ponent from the RGB image to create an intermediate grayscale image. Then, the
image is convolved using two 3x3 pre-calculated matrix kernel, one represents the
horizontal changes and the other the vertical, and finally, they are combined to
calculate the approximation of the gradient.
• Gaussian Blur Filter: blur the image and remove noise by convolving the image
with a Gaussian function. The mechanism to achieve the resulting image is similar
of the Sobel Filter, but instead, it uses only one kernel with different pre-calculated
values.
• MD5: message digest algorithm to ensure data integrity for information exchange
but not secure for cryptographic purposes. The block with the size of 512 bits
produces a digest message of 128 bits. To process the data, the MD5 needs four
rounds.
45
• FFT: a fast implementation of a discrete Fourier transform (DFT) and fundamental
block for DSP systems that sample the signal information from the original domain,
like time or space, and transform to frequency components.
• Reed-Solomon Decoder: an error-correcting code (ECC) with nonbinary cyclic codes,
which utilizes polynomials from the Galois field to simplify the integer operations.
This decoder receives 204 codeword bytes and contains 188 bytes of the message. It
detects an error if up to 16 corrupted bytes are found and correct the maximum of
8 corrupted bytes.
• FIR Filter 40th Order: a finite impulse response (FIR) filter function is a discrete
convolution of the input signal and the impulse response for each instant of time, the
40th order defines the number of filter coefficients that combines with the previously
sampled data. Many applications use the filter to spectral reshaping, reduce noise,
or signal detection.
• Smith-Waterman: a sequence alignment in bioinformatics is a daunting and complex
task, this dynamic programming algorithm compares segments of all possible lengths
and optimizes the similarity measure instead of comparing the entire sequence.
• Gene Regulatory Network: a particular cell function control depends on the inter-
action, called nodes, of group genes or portion of them, this organization is a gene
regulatory network (GRN). This application simulates a GRN, described in terms
of Boolean models.
Table 4.1 and Figure 4.1 shows the execution time of the applications in FPGA and
CPU. It also shows the accelerator clock frequency, sometimes lower than the maximum
of 400MHz due to hardware architecture limitations, and the amount of data offloaded
to or retrieved from the FPGA. The evaluation of the software execution time of each
benchmark program was measured using the LLVM/Clang version 4.0 with flag -O3.
ID Benchmark CPU FPGA Frequency Offload Retrieve
1 AES-128 25.46s 3.71s 200 MHz 1.2 GB 1.2 GB
2 SHA-512 10.44s 6.16s 200 MHz 1.2 GB 64 B
3 Sobel Filter 0.57s 0.43s 400 MHz 0.8 MB 0.8 MB
4 Gaussian Blur Filter 5.15s 4.86s 400 MHz 3.1 MB 3.1 MB
5 MD5 2.87s 2.06s 200 MHz 1.2 GB 0.6 GB
6 FFT 27.47s 3.29s 200 MHz 1.2 GB 1.2 GB
7 Reed-Solomon Decoder 113.45s 61.68s 300 MHz 1.6 GB 1.5 GB
8 FIR Filter 40th Order 162.12s 21.11s 200 MHz 1.2 GB 1.2 GB
9 Smith-Waterman 10.74s 1.31s 200 MHz 64 MB 64 B
10 Gene Regulatory Network 52.05s 0.39s 200 MHz 0 GB 80 MB
Table 4.1: Benchmark runtime.
46
1 2 43 5 6 7 8 9 10
100
101
102
6.9
1.7
1.1
1.3 1.4
8.4
1.84
7.7 8.2
135.2
F
P
G
A
sp
ee
du
p
ov
er
C
P
U
Figure 4.1: Logarithmic benchmark speedup
The FFT configuration supports eight complex words and uses radix equals two. Fur-
thermore, this application implements a single precision data type (floating point) with a
fully streaming architecture, which allows data to stream in and out of the system contin-
uously. Additionally, the architecture latency is 200 clock cycles and has a throughput of
one transform every eight cycles. FFT deep pipeline was determinant in achieving high
performance.
The SHA-512 application runs at 200MHz because combinational logic size hurdles to
reach higher frequencies without violating timing. Each iteration takes 80 clock cycles,
and each clock cycle round requires the previous round values. Like the SHA-512, the Gene
Regulatory Network application also runs at 200MHz. However, despite the frequency,
its hardware version produced a good speedup since it only needs one cycle to calculate
the next state of the Boolean Network, while in the software version several cycles are
required.
Although they are more amenable to GPUs, for completeness we also tested Gaussian
and Sobel filters. As anticipated, they resulted in small speed-ups, due to their small
image and kernel sizes (3 x 3): the Sobel filter’s input is a 512x512 image, and the
Gaussian filter uses a small 1024x1024 image. On the other hand, the software versions of
these filters take advantage of the Intel SSE2 (Streaming SIMD Extensions 2) vectorized
execution.
47
Chapter 5
Conclusions and Future Works
This dissertation proposes a novel OpenMP 4.X extension that eases the task of integrat-
ing FPGA acceleration into software applications to provide a seamless data movement
between the IP inside the FPGA and host CPU, in short, the framework HardCloud
based on the LLVM/Clang project. A series of applications widely used in the FPGA
world community, accessible through hardware open-source sites were helpful to evaluate
the tool flexibility for distinct interfaces and data transfer consumption.
The software side uses OpenMP to rapidly offload the computation without exposes the
specific FPGA driver API. The Intel HARP plugin for libomptarget, a Clang library that
extends the compiler to manages a heterogeneous system, was developed to administer
the computation offload. The framework takes advantage of the programming model
versatility to change the device accelerator and, also, choose to debug the system in the
simulator. Additionally, a check clause permits the user to compare the accelerator and
host results. Inside the FPGA, two sub-blocks were developed together with a simple
interface that abstracts the CCI-P. The sub-blocks, HC CSR and HC Requestor, manages
the communication with the host and the data requests. The combination of these sub-
blocks with the Intel basic building blocks available permits an easier system integration.
The results demonstrate the potential of the Intel HARP platform in opposition to the
CPU only version for various applications with different data demand and communication
requirements. To evaluate the performance, the time of computation, speedup, shows a
considerable enhancement to applications, e.g., Gene Regulatory Network, FFT, and FIR,
that demands architectures with deep pipeline and a lot of parallelism.
In the future, the open-source framework presented in this work has a potential to
be extended to other clusters FPGA, e.g., Amazon AWS and Huawei Cloud, because of
the demand to create systems including multiple devices and heterogeneous architectures
with different power-performance from the traditional ones. The OpenMP standard sup-
ports task parallelism computation that could be extended to create a workflow dynamic
connection with devices like FPGAs and GPUs. The FPGA overlay concept, also, has
a research potential, where a pre-defined architecture in the device executes a domain-
specific language binary, e.g., P4, to alleviate the development time.
48
Bibliography
[1] An introduction to the intel R© quickpath interconnect.
[2] Pci express R© 3.0 base specification revision 3.0.
[3] Product specifications: Intel R© xeon R© processor e5-2680. Accessed: 2018-07-16.
[4] OpenMP 4.5 Specifications. http://www.openmp.org/specifications/, November
2015.
[5] Intel R© xeon R© processor e5 v4 product family, volume 2, 2016.
[6] Librecores: Free and open source digital hardware. https://www.librecores.org/,
2018. Accessed: 2018-08-16.
[7] Open programmable acceleration engine (opae) c api programming guide, 2018. Ac-
cessed: 2018-01-25.
[8] Opencores. https://www.opencores.org/, 2018. Accessed: 2018-08-16.
[9] Aharon Aharon. Test program generation for functional verification of powepc pro-
cessors in ibm. In Design Automation, 1995. DAC’95. 32nd Conference on, pages
279–285. IEEE, 1995.
[10] Paul Alcorn. Intel xeon e5-2600 v4 broadwell-ep review. Tom’s Hardware, pages
4514–2, 2016.
[11] Michael Alder. Intel cci: Core cache interface. https://01.org/sites/default/
files/downloads/opae/cci-p-mpf-overview.pdf, 2016. Accessed: 2018-07-16.
[12] Samuel F Antao, Alexey Bataev, Arpith C Jacob, Gheorghe-Teodor Bercea, Alexan-
dre E Eichenberger, Georgios Rokos, Matt Martineau, Tian Jin, Guray Ozen, Zehra
Sura, et al. Offloading support for openmp in clang and llvm. In Proceedings of the
Third Workshop on LLVM Compiler Infrastructure in HPC, pages 1–11. IEEE Press,
2016.
[13] Intel Arria. Device overview, 2017.
[14] Eduard Ayguade, Rosa M Badia, Daniel Cabrera, Alejandro Duran, Marc Gonzalez,
Francisco Igual, Daniel Jimenez, Jesus Labarta, Xavier Martorell, Rafael Mayo, et al.
A proposal to extend the openmp tasking model for heterogeneous architectures. In
International Workshop on OpenMP, pages 154–167. Springer, 2009.
49
[15] Jeff Barr. Ec2 f1 instances with fpgas - now generally available, 2017. Accessed:
2017-12-16.
[16] Michael Behm, John Ludden, Yossi Lichtenstein, Michal Rimon, and Michael Vinov.
Industrial experience with test generation languages for processor verification. In
Proceedings of the 41st annual Design Automation Conference, pages 36–40. ACM,
2004.
[17] Daniel Cabrera, Xavier Martorell, Georgi Gaydadjiev, Eduard Ayguade, and Daniel
Jiménez-González. OpenMP extensions for FPGA accelerators. In Systems, Archi-
tectures, Modeling, and Simulation, 2009. SAMOS’09. International Symposium on,
pages 17–24. IEEE, 2009.
[18] Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona,
Jason H Anderson, Stephen Brown, and Tomasz Czajkowski. Legup: high-level
synthesis for fpga-based processor/accelerator systems. In Proceedings of the 19th
ACM/SIGDA international symposium on Field programmable gate arrays, pages
33–36. ACM, 2011.
[19] Adrian M Caulfield, Eric S Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers,
Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim,
et al. A cloud-scale acceleration architecture. In Microarchitecture (MICRO), 2016
49th Annual IEEE/ACM International Symposium on, pages 1–13. IEEE, 2016.
[20] Jongsok Choi, Stephen Brown, and Jason Anderson. From software threads to parallel
hardware in high-level synthesis for fpgas. In Field-Programmable Technology (FPT),
2013 International Conference on, pages 270–277. IEEE, 2013.
[21] Alessandro Cilardo, Luca Gallo, and Nicola Mazzocca. Design space exploration for
high-level synthesis of multi-threaded applications. Journal of Systems Architecture,
59(10):1171–1183, 2013.
[22] Intel Coorporation. An introduction to the intel R© quickpath interconnect, 2013.
[23] Karina RG Da Silva, Elmar UK Melcher, Guido Araujo, and Valdiney Alves Pi-
menta. An automatic testbench generation tool for a systemc functional verification
methodology. In Proceedings of the 17th symposium on Integrated circuits and system
design, pages 66–70. ACM, 2004.
[24] Johan De Gelas. The intel xeon e5 v4 review: Testing broadwell-ep with demanding
server workloads. Anandtech, June, 13, 2016.
[25] Shai Fine and Avi Ziv. Coverage directed test generation for functional verifica-
tion using bayesian networks. In Proceedings of the 40th annual Design Automation
Conference, pages 286–291. ACM, 2003.
[26] Jeremy Fowers, Greg Brown, Patrick Cooke, and Greg Stitt. A performance and
energy comparison of fpgas, gpus, and multicores for sliding-window applications.
50
In Proceedings of the ACM/SIGDA international symposium on Field Programmable
Gate Arrays, pages 47–56. ACM, 2012.
[27] PK. Gupta. Accelerating datacenter workloads, 2016. Accessed: 2017-12-16.
[28] Sam Guyer, Daniel A Jiménez, and Calvin Lin. The c-breeze compiler infrastruc-
ture. Technical report, Technical Report UTCS-TR01-43, The University of Texas
at Austin, 2001.
[29] Intel. Hardware accelerator research program. Accessed: 2018-07-16.
[30] Jens Korinth, David de la Chevallerie, and Andreas Koch. Threadpoolcomposer-an
open-source fpga toolchain for software developers. arXiv preprint arXiv:1508.06821,
2015.
[31] Seyong Lee, Jungwon Kim, and Jeffrey S Vetter. Openacc to fpga: A framework for
directive-based high-performance reconfigurable computing. In Parallel and Dis-
tributed Processing Symposium, 2016 IEEE International, pages 544–554. IEEE,
2016.
[32] Seyong Lee and Jeffrey S Vetter. Openarc: open accelerator research compiler for
directive-based, efficient heterogeneous computing. In Proceedings of the 23rd inter-
national symposium on High-performance parallel and distributed computing, pages
115–120. ACM, 2014.
[33] YY Leow, CY Ng, and WF Wong. Generating hardware from openmp programs. In
Field Programmable Technology, 2006. FPT 2006. IEEE International Conference
on, pages 73–80. Ieee, 2006.
[34] Peter Milder, Franz Franchetti, James C Hoe, and Markus Püschel. Computer gener-
ation of hardware for linear digital signal processing transforms. ACM Transactions
on Design Automation of Electronic Systems (TODAES), 17(2):15, 2012.
[35] Jiro Miyake, Gary Brown, Masahiko Ueda, and Tamotsu Nishiyama. Automatic test
generation for functional verification of microprocessors. In Test Symposium, 1994.,
Proceedings of the Third Asian, pages 292–297. IEEE, 1994.
[36] Timothy Morgan P. Broadwell brings xeon e5 a balanced performance bump, 2016.
Accessed: 2018-07-16.
[37] Christian Pilato, Andrea Cazzaniga, Gianluca Durelli, Andres Otero, Donatella Sci-
uto, Marco D Santambrogio, et al. On the automatic integration of hardware accel-
erators into fpga-based embedded systems. In FPL, pages 607–610, 2012.
[38] Artur Podobas. Accelerating parallel computations with openmp-driven system-on-
chip generation for fpgas. In Embedded Multicore/Manycore SoCs (MCSoc), 2014
IEEE 8th International Symposium on, pages 149–156. IEEE, 2014.
51
[39] Artur Podobas and Mats Brorsson. Empowering openmp with automatically gener-
ated hardware. In Embedded Computer Systems: Architectures, Modeling and Simu-
lation (SAMOS), 2016 International Conference on, pages 245–252. IEEE, 2016.
[40] Juan Salamanca, Luis Mattos, and Guido Araujo. Loop-carried dependence verifi-
cation in openmp. In International Workshop on OpenMP, pages 87–102. Springer,
2014.
[41] Lukas Sommer, Jens Korinth, and Andreas Koch. Openmp device offloading to fpga
accelerators. In Application-specific Systems, Architectures and Processors (ASAP),
2017 IEEE 28th International Conference on, pages 201–205. IEEE, 2017.
[42] Lukas Sommer, Julian Oppermann, and Andreas Koch. C-based synthesis of area-
efficient accelerators for openmp worksharing loops. In Second InternationalWorkshop
on Heterogeneous High-performance Reconfigurable Computing (H2RC’16), 2016.
[43] Thomas Willhalm. Memory latencies on intel R© xeon R© processor e5-4600 and e7-
4800 product families.
[44] Chris Williams. Intel’s broadwell xeon e5-2600 v4 chips: So what’s in it for you,
smartie-pants coders, 2016. Accessed: 2018-07-16.
[45] Peng Zhang, Muhuan Huang, Bingjun Xiao, Hui Huang, and Jason Cong. Cmost: a
system-level fpga compilation framework. In Design Automation Conference (DAC),
2015 52nd ACM/EDAC/IEEE, pages 1–6. IEEE, 2015.
[46] Dimitrios Ziakas, Allen Baum, Robert A Maddox, and Robert J Safranek. Intel R©
quickpath interconnect architectural features supporting scalable system architec-
tures. In High Performance Interconnects (HOTI), 2010 IEEE 18th Annual Sympo-
sium on, pages 1–6. IEEE, 2010.
