Evaluating FPGA Accelerator Performance with a Parameterized OpenCL
  Adaptation of the HPCChallenge Benchmark Suite by Meyer, Marius et al.
Evaluating FPGA Accelerator Performance with a
Parameterized OpenCL Adaptation of the
HPCChallenge Benchmark Suite
Marius Meyer∗, Tobias Kenter† and Christian Plessl‡
Department of Computer Science and Paderborn Center for Parallel Computing (PC2)
Paderborn University, Paderborn, Germany
Email: ∗marius.meyer@uni-paderborn.de, †tobias.kenter@uni-paderborn.de, ‡christian.plessl@uni-paderborn.de
Abstract—FPGAs have found increasing adoption in data
center applications since a new generation of high-level tools
have become available which noticeably reduce development time
for FPGA accelerators and still provide high-quality results.
There is, however, no high-level benchmark suite available,
which specifically enables a comparison of FPGA architectures,
programming tools, and libraries for HPC applications.
To fill this gap, we have developed an OpenCL-based open-
source implementation of the HPCC benchmark suite for Xilinx
and Intel FPGAs. This benchmark can serve to analyze the
current capabilities of FPGA devices, cards, and development tool
flows, track progress over time, and point out specific difficulties
for FPGA acceleration in the HPC domain. Additionally, the
benchmark documents proven performance optimization pat-
terns. We will continue optimizing and porting the benchmark
for new generations of FPGAs and design tools and encourage
active participation to create a valuable tool for the community.
Index Terms—FPGA, OpenCL, High Level Sythesis, HPC
benchmarking
I. INTRODUCTION
In High Performance Computing (HPC), benchmarks are an
important tool for performance comparison across systems.
They are designed to stress important system properties or
generate workloads that are similar to relevant applications
for the user. Especially in acquisition planning they can be
used to define the desired performance of the acquired system
before it is built. Since it is a challenging task to select a set of
benchmarks to cover all relevant device properties, benchmark
suites can help by providing a pre-defined mix of applications
and inputs, for example SPEC CPU [1] and HPC Challenge
(HPCC) [2].
There is an ongoing trend towards heterogeneity in HPC,
complementing CPUs by accelerators, as indicated by the Top
500 list [3]. From the top 10 systems in the list, seven are
equipped with different types of accelerators. Nevertheless,
to get the best matching accelerator for a new system, a
tool is needed to measure and compare the performance
across accelerators. For well-established accelerator architec-
tures like Graphics Processing Units (GPUs), there are already
standardized benchmarks like SPEC ACCEL [4]. For Field
Programmable Gate Arrays (FPGAs), that are just emerging
as accelerator architecture for data centers and HPC, existing
benchmarks do not focus on HPC and miss to measure highly
relevant device properties.
Similar to the compiler for CPU applications, the High
Level Synthesis (HLS) framework consisting of Software
Development Kit (SDK) and Board Support Package (BSP)
takes a very important role to achieve performance on an
FPGA. The framework translates the accelerator code (denoted
as kernel), most commonly from Open Computing Language
(OpenCL), to intermediate languages, organizes the commu-
nication with the underlying BSP, performs optimizations and
synthesizes the code to create executable FPGA configurations
(bitstreams). Hence, the HLS framework has a big impact on
the used FPGA resources and the maximum kernel frequency,
which might vary depending on the kernel design. An HPC
benchmark suite for FPGAs should capture this impact and, for
comparisons, must not be limited to a single HLS framework.
One of the core aspects of HPC is communication. Some
FPGA cards offer a new approach for scaling with their
support for direct communication to other FPGA cards without
involving the host CPU. Such technology is already used in
first applications [5], [6] and research has started to explore
the best abstractions and programming models for inter-FPGA
communication [7], [8]. Thus, communication between FPGAs
out of an HLS framework is another essential characteristic
that a FPGA benchmark suite targeting HPC should consider.
In this paper, we propose HPCC FPGA, an FPGA OpenCL
benchmark suite for HPC using the applications of the HPCC
benchmark suite. The motivation for choosing HPCC is that
it is well-established for CPUs and covers a small set of
applications that evaluate important memory access and com-
puting patterns that are frequently used in HPC applications.
Further, the benchmark also characterizes the HPC system’s
network bandwidth allowing to extrapolate to the performance
of parallel applications.
Specifically, we make the following contributions in this
paper:
1) We provide FPGA-adapted OpenCL kernel implementa-
tions along with corresponding host code for setup and
measurements for all HPCC benchmark applications.
2) We provide configuration options for the OpenCL ker-
nels that allow adjustments to resources and architecture
of the target FPGA and board without the need to change
the code manually.
ar
X
iv
:2
00
4.
11
05
9v
3 
 [c
s.D
C]
  1
2 J
un
 20
20
3) We evaluate the execution of these benchmarks on
different FPGA families and boards with Intel and Xilinx
FPGAs and show the benchmarks can capture relevant
device properties.
4) We make all benchmarks and the build system available
as open-source on GitHub to encourage community
contributions.
The remainder of this paper is organized as follows: In
Section II, we give an overview of existing FPGA benchmark
suites. In Section III, we introduce the benchmarks in HPCC
FPGA in more detail and briefly discuss the contained bench-
marks and the configuration options provided for the base runs.
In Section IV we build the benchmarks for different FPGA
architectures and evaluate the results to show the potential of
the proposed configurable base runs. In Section V, we evaluate
the global memory system of the boards in more detail and
give insights into experienced problems and the potential of
the benchmarks to describe the performance of FPGA boards
and the associated frameworks. Finally, in Section VI, we draw
conclusions and outline future work.
II. RELATED WORK
There already exist several benchmark suites for FPGAs and
their HLS frameworks. Most OpenCL benchmark suites like
Rodinia [9], OpenDwarfs [10] or SHOC [11] are originally
designed with GPUs in mind. Although both GPU and FPGA
can be programmed using OpenCL, the design of the compute
kernels has to be changed and optimized specifically for FPGA
to achieve good performance. In the case of Rodinia this was
done [12] for a subset of the benchmark suite with a focus on
different optimization patterns for the Intel FPGA (then Altera)
SDK for OpenCL. In contrast, to port OpenDwarfs to FPGAs,
Feng et al. [10] employed a research OpenCL synthesis
tool that instantiates GPU-like architectures on FPGAs. With
Rosetta [13], there also exists a benchmark suite that was
designed targeting FPGAs using the Xilinx HLS tools from
the start. It focuses on typical FPGA streaming applications
from the video processing and machine learning domains.
The CHO [14] benchmark targets more fundamental FPGA
functionality and includes kernels from media processing
and cryptography and the low-level generation of floating-
point arithmetic through OpenCL, using the Altera SDK for
OpenCL.
The mentioned benchmarks often lack possibilities to ad-
just the benchmarks to the target FPGA architecture easily.
Modifications have to be done manually in the kernel code,
sometimes many different kernel variants are proposed or the
kernels are not optimized at all, making it difficult to compare
results for different FPGAs. A benchmark suite that takes a
different approach is Spector [15]. It makes use of several
optimization parameters for every benchmark, which allows
modification and optimization of the kernels for a FPGA
architecture. The kernel code does not have to be manually
changed, and optimization options are restricted by the defined
parameters. Nevertheless, the focus is more on the research of
the design space than on performance characterization.
To our best knowledge, there exists no OpenCL benchmark
suite for FPGA with a focus on HPC characteristics at the
point of writing. All of the mentioned benchmark suites lack
a way to measure the inter-FPGA communication capability of
recent high-end FPGAs. In some of the benchmarks, the inves-
tigated input sizes are small enough to fit into local memory
resources of a single FPGA. Since actual HPC applications are
highly parallel and require effective communication, an HPC
focused benchmark must also evaluate the characteristics of
the communication network.
III. HPC CHALLENGE BENCHMARKS FOR FPGA
The HPCC benchmark suite [2] consists of seven bench-
marks. Three of them are synthetic benchmarks that measure
the memory performance for successive accesses (STREAM
[16]) and random updates (RandomAccess) as well as the
effective network bandwidth of a system (b eff). Especially
the latter will give important insights for the use in HPC
because of the inter-FPGA communication support in recent
devices. Moreover, it contains four applications of varying
complexity: Matrix transposition (PTRANS), 1D Fast Fourier
Transformation (FFT), matrix multiplication (GEMM) and
High Performance LINPACK (HPL). All benchmarks except
RandomAccess and b eff use single-precision floating-point
values for calculation since this is most commonly used in
FPGA designs. RandomAccess is using 64 bit integers because
of the used pseudo-random number scheme and b eff 8 bit
integers. Nevertheless, the build system allows to change the
used data type for all benchmarks easily. A central concept of
the HPCC benchmark suite is to characterize the performance
of different memory access patterns of HPC applications.
By determining the spatial and temporal dependencies of
the memory accesses for another application, the benchmark
results can then be used to estimate the performance of
this application on the system more precisely. The achieved
performance in CPU systems is thus highly dependent on the
memory and cache architecture.
When benchmarking FPGA boards with this concept, it
is crucial to note that even for a given board, the memory
hierarchy is not fixed. While the off-chip memory itself and
typically parts of the memory controllers are fixed hardware
components, local buffers or caches, address generators, pre-
fetchers, and data buses inside the FPGA are provided by
the BSP or generated by the SDK depending on the bench-
mark implementation. Thus, in contrast to the CPU version
of HPCC, HPCC FPGA does not only measure hardware
properties of a given memory interface but rather also the
ability of the tools to optimize the memory hierarchy and
compute pipelines of a kernel for the specific pattern to provide
good performance of the calculation on an FPGA.
Another aspect of the HPCC benchmark suite is the distinc-
tion between two different runs:
• Base runs are done with the provided reference imple-
mentations.
• Optimized runs are done with implementations that use
architecture-specific optimizations.
TABLE I
COMPARISON OF THE HPCC BENCHMARKS AND THEIR MEMORY ACCESS
PATTERNS ON CPU AND FPGA
Benchmark CPU FPGAGlobal Mem-
ory
Local Memory
STREAM linear linear linear
RandomAccess random random linear
PTRANS strided, linear blocked, linear strided, linear
FFT strided, T linear strided, T
GEMM strided, linear, T blocked, linear strided, linear, T
LINPACK strided, linear, T blocked,linear strided, linear, T
T = temporal locality
The goal of HPCC FPGA, as presented in this article, is to
provide implementations for the base runs that are reasonably
optimized for Intel and Xilinx FPGAs and their SDKs. With
current FPGA execution models, such optimizations necessar-
ily include adaptations to the available resources, in particular
the number of physical memory interfaces and configurable
local memory. Therefore, the base implementation exposes de-
fined configuration parameters for such customization, without
pre-empting the full flexibility that is reserved for optimized
runs with manual code changes and target architecture- or
SDK-specific designs. The presented adjustable parameters for
the base runs are a subject of discussion and might be changed
with the evolution of the FPGA architectures and toolchains.
In Table I the memory access patterns for the benchmarks
contained in HPCC are given for CPU and the FPGA base im-
plementations proposed in this paper. For FPGA, the memory
is further divided into global and local memory representing
the DDR RAM or Block RAM (BRAM) and registers, respec-
tively. Spatial locality is represented by the linear and blocked
access pattern, while temporal locality is separately indicated
with a T. Since it is possible to partially define the memory
hierarchy used for the application, the FPGA designs attempt
to move the strided memory accesses with low spatial locality
into the local memory and increase the temporal locality if
possible using blocked algorithms. The global memory again
is accessed in a blocked fashion to increase spatial locality
and decrease temporal locality.
The benchmark suite is open source and publicly available
on GitHub. 1
A. Common Build Setup
The HPCC FPGA benchmark suite is set up to create one
host binary and FPGA bitstream per benchmark, with source
code structure, build process, and benchmark execution very
similar for all benchmarks. For usability, HPCC FPGA adopts
the following approaches:
• Usage of the same build system for all benchmarks
(CMake) offers a unified user experience during the build
process and for the modification of the configuration
parameters.
1https://github.com/pc2/HPCC FPGA
TABLE II
THE BUILD PARAMETERS TO SELECT THE TARGET DEVICE AT COMPILE
TIME AND DEFAULT VALUES AT EXECUTION TIME
Parameter Description
FPGA_BOARD_NAME Name of the target board
DEFAULT_DEVICE Index of the default device
DEFAULT_PLATFORM Index of the default platform
DEFAULT_REPETITIONS Number of times the kernel will be
executed
TABLE III
TOOL-SPECIFIC OPTIMIZATION FLAGS AS EXPOSED THROUGH THE
CMAKE BUILD SYSTEM OF HPCC FPGA
Parameter Description
AOC_FLAGS Additional compiler flags that are
used for kernel compilation with
the Intel toolchain
XILINX_COMPILE_FLAGS Additional compiler flags that are
used for kernel compilation with
the Xilinx Vitis toolchain
XILINX_COMPILE_SETTINGS Path to the settings file that con-
tains additional compile options for
the Xilinx Vitis toolchain
XILINX_GENERATE_LINK
_SETTINGS
Boolean that can be set to trigger
link setting generation using a file
template for Xilinx Vitis
XILINX_LINK_SETTINGS Path to the settings file that con-
tains the link options for the Xilinx
Vitis toolchainn
• Integrated tests allow checking functional correctness of
the configuration using emulation before actual synthesis.
A first set of parameters (Table II) is exposed to select
the target FPGA board (FPGA_BOARD_NAME) and provide
default parameters for the execution of the OpenCL host
binary. The latter values can still be changed at execution
time, for example, to test several identical devices in one
server sequentially. Additional compiler arguments can be
used with the Intel and Xilinx compiler to control the synthesis
or trigger special optimizations (Table III). While the Intel
FPGA SDK for OpenCL options are provided as AOC_FLAGS,
the Xilinx Vitis toolchain expects parameters or configuration
files at several stages. They are used, for example, to create
multiple kernel copies for the STREAM and RandomAccess
benchmarks. Finally, for each benchmark, specific parameters
are exposed to make good use of the board’s memory inter-
faces and local memory of the FPGA. These parameters are
summarized in Tables V–XI with the benchmarks described
in the following subsections.
B. STREAM benchmark
The goal of the STREAM benchmark is to measure the
sustainable memory bandwidth of a device using four different
vector operations: Copy, Scale, Add and Triad. All kernels are
TABLE IV
THE STREAM BENCHMARK KERNELS THAT ARE CALCULATING ON
THREE ARRAYS A, B AND C AND THE PCIE TRANSFERS THAT ARE
MEASURED IN THE ORDER THEY ARE EXECUTED
Name Kernel Logic
PCI Write write arrays to device
Copy C[i] = A[i]
Scale B[i] = j ∗ C[i]
Add C[i] = A[i] +B[i]
Triad A[i] = j ∗ C[i] +B[i]
PCI Read read arrays from device
taken from the STREAM benchmark2 v5.10 and thus slightly
differ from the kernels proposed in the HPCC article [2]. The
benchmark will sequentially execute the operations given in
Table IV. PCI Read and PCI Write are no OpenCL kernels
but represent the read and write of the arrays to the device
memory. The benchmark will output the maximum, average,
and minimum times measured for all of these operations. The
minimum time will also be used to calculate the memory
bandwidth for the operation.
The arrays A, B, and C are initialized with a constant value
over the whole array. This allows us to validate the result by
only recalculating the operations with scalar values. The error
is calculated for every value in the arrays and must be below
the machine epsilon  < ||d− d′|| to pass the validation.
A simplified version of the code is given in Listing 1. The
OpenCL kernel of the benchmark combines all four described
compute kernels in a single function. Since on FPGAs the
source code is translated to spatial structures that take up
resources on the device, a single combined kernel allows for
best reuse of those resources. The computation is split into
blocks of a fixed length, which makes it necessary that the
arrays have the length of a multiple of the block size. In
the first inner loop, the input values of the first input array
in1 are loaded into a buffer located in the local memory of
the FPGA. While they are loaded, they are multiplied with a
scaling factor scalar. This allows the execution of the Copy
and Scale operation. In the case of Copy, the scaling factor is
set to 1.0. In the second loop, the second input in2 is only
added to the buffer, if a flag is set. Together with the first loop,
this makes it possible to recreate the behavior of the Add and
Triad operation. In the third loop, the content of the buffer is
stored in the output array out located in global memory.
At build time, the parameters given in Figure V can be
modified to generate a the base run kernel. DATA_TYPE
and VECTOR_COUNT can be used to define the data type
used within the kernel. If VECTOR_COUNT is greater than 1,
OpenCL vector types of the given length are used. Moreover,
it is possible to adjust the resource usage of the kernel
by specifying the size of the local memory buffer and the
unrolling factor of the three loops of the kernel. The intended
use case of the OpenCL kernel is to replicate the kernel for
every available memory bank. The host code will split and
2https://www.cs.virginia.edu/stream/
Listing 1
SIMPLIFIED KERNEL LOGIC OF THE STREAM BENCHMARK
f o r ( u i n t i = 0 ; i < a r r a y s i z e ;
i += BUFFER SIZE ) {
f l o a t b u f f e r [ BUFFER SIZE ] ;
f o r ( u i n t k = 0 ; k < BUFFER SIZE ;
k ++) {
b u f f e r [ k ] = s c a l a r * i n 1 [ i + k ] ;
}
i f ( s e c o n d i n p u t ) {
f o r ( u i n t k = 0 ; k < BUFFER SIZE ;
k ++) {
b u f f e r [ k ] += i n 2 [ i + k ] ;
}
}
f o r ( u i n t k = 0 ; k < BUFFER SIZE ;
k ++) {
o u t [ i + k ] = b u f f e r 1 [ k ] ;
}
}
TABLE V
THE CONFIGURATION OPTIONS THAT ARE EXPOSED TO THE USER TO
MODIFY THE KERNEL GENERATION OF THE STREAM BENCHMARK
Parameter Description
DATA_TYPE Data type used for host and device code
VECTOR_COUNT If > 1 OpenCL vector types of the
given size are used in the device code
GLOBAL_MEM_UNROLL Loop unrolling factor for the inner
loops in the device code
NUM_REPLICATIONS Replicates the kernels the given number
of times
DEVICE_BUFFER_SIZE Number of values that are stored in
the local memory in the single kernel
approach
distribute the arrays equally to all memory banks, so every
kernel will just have a fraction of the whole arrays to update.
C. RandomAccess benchmark
The random access benchmark measures the performance
for non-consecutive memory accesses expressed in the perfor-
mance metric Giga Updates Per Second (GUPS). It updates
values in an data array d ∈ Zn such that di = di⊕a where a ∈
Z is a value from a pseudo random sequence. n is defined to be
a power of two. Since only operations in a finite field are used
to update the values, the correctness of the updates can easily
be checked by executing a reference implementation on the
host side on the resulting data. The incorrect items are counted,
and the error percentage is calculated with errorn ∗100. An error
of < 1% has to be accomplished to pass the validation. Hence,
update errors caused by concurrent data accesses are tolerated
to some degree. Similar to the STREAM implementation the
benchmark allows us to replicate the kernel for every memory
bank. The benchmark allows to store subsequent reads and
writes from the global memory in a local memory buffer. On
TABLE VI
THE CONFIGURATION OPTIONS THAT ARE EXPOSED TO THE USER TO
MODIFY THE KERNEL GENERATION OF THE RANDOM ACCESS
BENCHMARK
Parameter Description
NUM_REPLICATIONS Replicates the kernels the given number
of times
DEVICE_BUFFER_SIZE Number of values that are stored in
the local memory in the single kernel
approach
the one hand, this allows us to partially hide the latency of
a single memory access and increase the performance of the
benchmark. On the other hand, this mechanism will lead to
errors if the same memory address is loaded multiple times
into the buffer. Since the benchmark allows a small amount
of errors, changing the buffer size can be used as a trade-off
between benchmark error and performance. The local memory
buffer size and the number of replications can be configured
with build parameters that are given in Table VI.
D. Effective Bandwidth Benchmark (b eff)
b eff is a benchmark that measures the effective network
bandwidth. It is originally designed for the use with Message
Passing Interface (MPI) and thus is not straight forward
to translate the same functionality to OpenCL and the I/O
channels of the FPGA. For inter-FPGA communication, there
exists no such default communication paradigm and also, the
network layer is a matter of current research [7], [8]. Thus, the
current implementation of the network benchmark does not
consider routing overheads and different network topologies
as it is done in the original benchmark. Instead, only a single
ring topology containing all FPGAs is used to execute the
benchmark.
The benchmark contains two kernels send and recv that will
alternately send an receive messages over the I/O channels.
The send kernel will first send a message of a given size and
then receive the message, whereas the recv kernel will first
receive and then send a message. This process is repeated for
an adjustable number of iterations to reduce the kernel start
overhead in the measurements. The benchmark execution can
thus be scaled to an arbitrary number of FPGAs.
The benchmark executes the kernels for 21 different mes-
sage sizes L = 20, 21, . . . , 220. Based on that, the effective
network bandwidth is calculated with Equation 1.
beff =
∑
L bL
21
(1)
with bL being the measured bandwidth for message size L.
The modifiable parameters for b eff are given in Table VII.
So far, only the width of a single channel can be varied.
External channels are currently very specific to the used device
because of the missing layer of abstraction like it would
be done in MPI, so manual modification is allowed for this
kernel file. The host code of the benchmark uses MPI for
TABLE VII
THE CONFIGURATION OPTIONS THAT ARE EXPOSED TO THE USER TO
MODIFY THE KERNEL GENERATION OF THE B EFF BENCHMARK
Parameter Description
CHANNEL_WIDTH Channel width in Bytes
TABLE VIII
THE CONFIGURATION OPTIONS THAT ARE EXPOSED TO THE USER TO
MODIFY THE KERNEL GENERATION OF THE PTRANS BENCHMARK
Parameter Description
BLOCK_SIZE Size of the symmetric matrix block
that will be stored in local memory.
Should have a sufficient size to allow
full utilization of global memory read
and write bursts.
GLOBAL_MEM_UNROLL Number of times the loops loading and
storing to global memory have to be
unrolled to create Load Store Units
(LSUs) with the same width than the
memory interface
communication and to collect the runtime measurements of
the kernels.
E. Matrix Transposition Benchmark (PTRANS)
The PTRANS benchmark works on square matrices
A,B,C ∈ Rn×n and calculates C = AT +B, i.e. it transposes
a matrix A and adds B to the result. The number of Floating
Point Operations (FLOPs) is n2 for A,B,C
∫
Rn×n for this
calculation. The kernel execution time is measured with the
host code and used to calculate the final performance metric
Floating Point Operations per Second (FLOPS). The execution
is validated by calculating the residual ||C−C
′||
n where C
′ is the
result of a reference implementation,  is the machine epsilon
and n dimension of the matrices.
F. Fast Fourier Transformation Benchmark (FFT)
In the FFT benchmark a 1d FFT is calculated with a
size of up to 212 for single precision complex numbers. The
benchmark is based on a reference implementation for the Intel
OpenCL FPGA SDK included in version 19.4.0. Since the
calculation of a single FFT would lead to very short execution
times that lead to high measurement errors, batch processing
is used to increase the overall execution time. This also allows
TABLE IX
THE CONFIGURATION OPTIONS THAT ARE EXPOSED TO THE USER TO
MODIFY THE KERNEL GENERATION OF THE FFT BENCHMARK
Parameter Description
LOG_FFT_SIZE Logarithm of the FFT size that should
be calculated. It should be as big as pos-
sible to make use of a much resources
as possible and to utilize the global
memory bursts. The size is limited by
the implementation to 12.
TABLE X
THE CONFIGURATION OPTIONS THAT ARE EXPOSED TO THE USER TO
MODIFY THE KERNEL GENERATION OF THE DGEMM BENCHMARK
Parameter Description
BLOCK_SIZE Size of the symmetric matrix block
that will be stored in local memory.
Should have a sufficient size to allow
full utilization of global memory read
and write bursts.
GEMM_SIZE Size of the symmetric matrix block
that will be stored in registers. This
will affect the amount of Digital Signal
Processors (DSPs) used in the imple-
mentation.
GLOBAL_MEM_UNROLL Number of times the loops loading and
storing to global memory have to be
unrolled to create LSUs with the same
width than the memory interface
TABLE XI
THE CONFIGURATION OPTIONS THAT ARE EXPOSED TO THE USER TO
MODIFY THE KERNEL GENERATION OF THE LINPACK BENCHMARK
Parameter Description
LOCAL_MEM_BLOCK_LOG Size of the symmetric matrix block
that will be stored in local memory.
Should have a sufficient size to allow
full utilization of global memory read
and write bursts.
REGISTER_BLOCK_LOG Size of the symmetric matrix block that
will be stored in registers. This will
affect the amount of DSPs used in the
implementation.
GLOBAL_MEM_UNROLL Number of times the loops loading and
storing to global memory have to be
unrolled to create LSUs with the same
width than the memory interface
better utilization of the kernel pipeline. The number of FLOPs
for this calculation is defined to be 5 ∗ n ∗ ld(n) for an FFT
of dimension n. The result of the calculation is checked by
calculating the residual ||d−d
′||
ld(n) where  is the machine epsilon,
d′ the result from the reference implementation and n the FFT
size.
G. Matrix Multiplication Benchmark (GEMM)
The GEMM benchmark implements a matrix-matrix multi-
plication similar to the GEMM routines in the BLAS library.
It calculates C = α ∗ A ∗ B + β ∗ C where A,B,C ∈ Rn×n
and α, β ∈ R. The number of FLOPs for the performance
calculation is defined to be 2 ∗ n3. The result is verified
by calculating the residual ||C−C
′||
n||C||F where  is the machine
epsilon and C ′ the result of the reference implementation. The
implementation is based on a matrix multiplication design for
Intel Stratix 10 proposed by Gorlani et al. [17] and simplified
to make it compatible with a broader range of devices.
H. High-Performance Linpack Benchmark (HPL)
The HPL benchmark solves a linear system of equations of
order n: Ax = b where A ∈ Rn×n and x, b,∈ Rn. The original
benchmark first calculates the LU factorization with row-wise
partial pivoting of the matrix A such that P [A, b] = [[L,U ], y]
In a second step, the linear equation system can be solved by
first solving Ly = b and then Ux = y. The FLOPs for the
factorization are 23n
3− 12n2 and for solving the linear equations
2n2.
Currently, the benchmark is only partially implemented on
FPGA. The kernel implementation is based on a blocked
approach proposed in [18]. A LU factorization kernel that
corresponds to the LINPACK gefa routine is implemented on
the FPGA. It uses block-wise partial pivoting instead of partial
pivoting over the whole matrix. This design decision was made
to reduce the complexity of the kernel. The equation system is
solved on the CPU and not taken into account for the kernel
performance. The resulting performance metric is FLOPS and
the result is validated by a reference implementation that
checks the residual ||Ax−b||||A||n .
IV. BENCHMARK EXECUTION AND EVALUATION
In the following, we synthesize and execute all benchmarks
of the suite, collect first benchmark results for the proposed
base implementations, and evaluate the results to performance
models.
A. Evaluation Environment and Configuration
The benchmarks were synthesized and tested on a cluster
containing multiple Nallatech 520N cards and research sys-
tems with Intel PAC D5005 and Xilinx Alveo U280 FPGA
boards. The Nallatech 520N boards are connected to the host
node via x8 PCIe 3.0 and are equipped with Intel Stratix 10
GX2800 with access to four banks of DDR4 SDRAM x 72
bit with 8 GB per bank and a transfer rate of 2400MT/s.
Moreover, up to 32 FPGAs are connected within an inter-
FPGA Circuit Switched Network (CSN). Every FPGA has
four 40 Gbit/s full-duplex links that are connected through
a CALIENT S320 Optical Circuit Switch which allows the
creation of arbitrary network topologies. In the OpenCL code,
these streaming board-to-board connections are exposed as
OpenCL pipes or channels. With the BSP version 19.2.0 and
SDK version 19.4.0 the most recent versions available at the
time of writing are used for synthesis and execution.
The Intel PAC D5005 board hosts a Stratix 10 SX2800
FPGA and is connected to the host with x16 PCIe 3.0. The
board contains a DDR4 memory infrastructure similar to the
other boards, which is however, not used for these experiments.
Instead, a reference design BSP (18.1.2 svm) was used, that
offers direct access to the host’s memory using Shared Virtual
Memory (SVM) by building upon the the Intel Acceleration
Stack (IAS) [19] version 1.2. The OpenCL kernel compilation
was performed with the SDK version 19.4.0. The host node
for the Nallatech 520N and Intel PAC D5005 is a two-
socket system equipped with Intel Xeon Gold 6148 CPUs
and 192 GB of DDR4-2666 main memory. In the case of the
Nallatech 520N, two FPGAs are connected to a single node.
Except for the b eff benchmark, only one of the FPGAs is
used to execute the benchmark. The host code is compiled with
GCC 8.3.0 and CMake 3.15.3 is used for the configuration and
build of the benchmarks.
The Alveo U280 boards are equipped with the XCU280
FPGA. The board is connected to an Intel Xeon Gold 6234
CPU over x8 PCIe 4.0 and equipped with two banks of DDR4
SDRAM x 72 bit with 16 GB per bank and a transfer rate
of 2400 MT/s. Moreover the FPGA is equipped with 8 GB
of High Bandwidth Memory 2 (HBM2) split over 32 banks.
The Xilinx Vitis SDK is used in version 2019.2 and the shell
version is 2019.2.3. The host code is compiled with GCC 7.4.0
and CMake 3.3.2 is used for the configuration and build of the
benchmarks. The CPU has access to 108 GB of main memory.
These benchmarks contain simple kernels that allow syn-
thesis with the Intel and the Xilinx Vitis toolchain without
changes in the host or kernel code.
The used synthesis parameters for all benchmarks are given
in Table XII. The resulting resource usage of the synthesized
kernels for STREAM and RandomAccess are given in Ta-
ble XIII and for the remaining benchmarks of the suite in
Table XV. A straight forward comparison of the resource
usage between the FPGA boards is not possible because hard-
ware and software architectures are different. Nevertheless, in
the table a general overview of the resource usage is given
by looking at the basic resource elements of an FPGA: The
Lookup Tables (LUTs), Flop-Flops (FFs), BRAM and DSPs.
The table only takes into account the resources directly used
for the kernels. Next to the absolute value, the percentage of
the used resources relative to the available resources is given.
Absolute values and ratios of the resource usage are taken
directly from the reports generated by the HLS tools.
B. Performance Evaluation
The synthetic global memory benchmarks STREAM and
RandomAccess are executed on three FPGAs. The measured
performance results are give in Table XIV. The size of the
data arrays is set to 229 items, which corresponds to 2 GB of
data per data array. This is the largest power of two that fits
into the 8 GB HBM2 of the Alveo U280 board.
The theoretical peak performance for DDR memory is
19.2 GB/s per bank for both the 520N and the U280 boards.
That said, the STREAM benchmark achieves an efficiency
between 87.2% and 90.1% of the theoretical peak perfor-
mance for both devices for all four operations. The local
memory buffer allows to use memory bursts to load and
store data, which increases the efficiency of memory accesses.
Nevertheless, the local memory buffer has to be chosen small
enough to allow kernel frequencies above the frequency of
the memory controller. The HBM2 of the U280 board offers
a theoretical peak bandwidth of 460 GB/s. So the benchmark
achieves an efficiency of 82.4% on this board. In contrast to
the measurements on 520N and U280, where onboard memory
resources DDR or HBM2 are used, the SVM functionality of
the PAC card allows manipulating the array directly in the
DDR memory of the host. The full-duplex PCIe connection
makes it possible to define the local buffer very small to allows
the implementation of the same in FFs. Read and write bursts
can still be applied because the kernel will be implemented in
a single pipeline. For the Copy and Scale operation the kernel
achieves more than 20 GB/s.
For the Random Access benchmark, a data array of 229
items, which is a total of 4 GB, is equally split onto the
available memory banks. All kernels have to calculate all
addresses and only update the value if it is placed within
the range in the data array they are assigned to. This leads
to a compute-bound implementation for a high number of
kernel replications since the pipeline stalls caused by memory
accesses per kernel will decrease. Still, the amount of random
numbers that have to be generated stays the same. So the
maximum achievable updates per second are limited by the
kernel frequency. The measured performance will show a
higher gap to the maximum performance for designs with
only a small amount of replications because the random
memory accesses will consume a considerable amount of
time. The RandomAccess benchmark shows huge performance
differences between the used boards. One reason for that is
the difference in the kernel design that allows the creation
of a single pipeline in case of the 520N board using the
Intel-specific ivdep pragma. This optimization also slightly
decreased the calculation error that is introduced by the buffer.
Nevertheless, for the PAC SVM tests this optimization had to
be disabled because with 98% the error drastically exceeds
the allowed range. For Xilinx there is to our knowledge no
optimization flag similar to ivdep.
Table XVI contains the results for the remaining bench-
marks executed on a subset of the previously used FPGA
boards. The effective bandwidth of the network that is used
as a performance metric by the b eff benchmark combines
the latency and throughput of a channel into a single met-
ric. The sent messages are of the size 20, 21, . . . , 220 Byte.
The used optical switch does only add a constant latency
to the transmission. So for the performance model we will
use the channel latency of 520 ns that is given in the BSP
documentation for the 520N board and the 19.4 BSP [20].
Also, the channel frequency of 156.25 MHz and the maximum
channel width of 256 bit is taken from this document. We
use a ring topology in our implementation, which means
every board is connected to two other boards. With a total
of four channels on each FPGA this results in two channels
per connection and a combined width of 512 bit. For the
transfer of a single message with size m the implementation
will need cm = d m512 bite clock cycles. This can be used to
calculate the transmission time tm = cm156.25 MHz + 520 ns
which is the number of clock cycles plus the latency of the
channel. The effective bandwidth can then be calculated with
b eff =
∑
m∈L
m
tm
21 . This results in an effective bandwidth of
8.139 GB/s per FPGA. Since there is no congestion on the
network, the bandwidth is expected to increase linearly when
more FPGAs are added to the network. Thus, for 8 FPGA the
expected effective bandwidth is 8 ∗ 8.14 GB/s = 65.11 GB/s.
The synthesized kernel executes the send and receive pipeline
sequentially instead of simultaneously for both kernels. This
prevents the utilization of the full-duplex channels and is
TABLE XII
SYNTHESIS CONFIGURATIONS OF ALL BENCHMARKS
Benchmark Parameter 520N U280
DDR
U280
HBM2
PAC
SVM
Benchmark Parameter 520N U280
DDR
PAC
SVM
STREAM
DATA_TYPE float float float float PTRANS BLOCK_SIZE 512 512 512
VECTOR_COUNT 16 16 16 16 GLOBAL_MEM_UNROLL 16 16 16
GLOBAL_MEM_UNROLL 1 1 1 1
NUM_REPLICATIONS 4 2 32 1
GEMM
BLOCK_SIZE 256 256 256
DEVICE_BUFFER_SIZE 4,096 16,384 2,048 1 GEMM_SIZE 8 8 8
GLOBAL_MEM_UNROLL 16 16 16
Random NUM_REPLICATIONS 4 2 32 1
Access DEVICE_BUFFER_SIZE 1 1,024 1,024 1,024
LINPACK
LOCAL_MEM_BLOCK_LOG 5 5
REGISTER_BLOCK_LOG 3 3
b eff CHANNEL_WIDTH 32 GLOBAL_MEM_UNROLL 16 16
FFT LOG_FFT_SIZE 12 12
TABLE XIII
RESOURCE USAGE OF THE SYNTHESIZED STREAM AND RANDOMACCESS BENCHMARK KERNELS
Benchmark Board LUTs FFs BRAM DSPs Frequency
[MHz]
STREAM
520N 176,396 (25%) 449,231 (25%) 4,029 (34%) 128 (2%) 316.67
U280 DDR 20,832 (1.90%) 39,002 (1.39%) 558 (34.19%) 160 (1.78%) 300.00
U280
HBM2
331,904 (20.69%) 574,976 (27.24%) 1,408 (77.70%) 2,560 (28.38%) 370.00
PAC SVM 103,628 (14.53%) 244,354 (7.42%) 74 (0.66%) 32 (0.56%) 346.00
RandomAccess
520N 115,743 (18%) 253,578 (18%) 489 (4%) 14 (< 1%) 329.17
U280 DDR 7,256 (0.65%) 11,716 (0.50%) 38 (2.23%) 14 (0.16%s) 446.00
U280
HBM2
116,096 (10.68%) 187,456 (8.76%) 608 (33.55%) 224 (2.48%) 450.00
PAC SVM 103,397 (12%) 225,293 (12%) 535 (5%) 0 (0%) 322.00
TABLE XIV
MEASUREMENT RESULTS FOR STREAM AND RANDOMACCESS ON
DIFFERENT FPGA BOARDS
Benchmark 520N U280
HBM2
U280
DDR
PAC
SVM
STREAM Copy [GB/s] 67.01 377.42 33.94 20.15
STREAM Scale [GB/s] 67.24 365.80 33.92 20.04
STREAM Add [GB/s] 68.90 374.03 34.58 15.04
STREAM Triad [GB/s] 68.90 378.88 34.57 11.66
STREAM PCIe read
[GB/s]
6.41 6.66 5.68 –
STREAM PCIe write
[GB/s]
6.32 6.03 5.47 –
Random Access Updates
[MUOP/s]
245.0 128.1 40.3 0.5
Random Access Error 0.0099% 0.0106% 0.0106% 0.0106%
effectively halving the channel bandwidth. So the kernel will at
most achieve an effective bandwidth of 4.07 GB/s per FPGA.
Thus, the kernel achieves 96.2% of its modeled performance
but only 48.1% of the maximum bandwidth of the network.
The GEMM benchmark uses 4096× 4096 matrices for the
calculation. The actual calculation is done on 8 × 8 matrices
defined by the GEMM_BLOCK parameter. So the kernel can
initialize the calculation of 1024 floating-point multiplica-
tions and additions per clock cycle. Together with the kernel
frequency of 320.84 MHz, this leads to a theoretical kernel
peak performance of 328.54 GFLOP/s. All other latencies that
might be introduced by the memory or the calculation can
be neglected since the kernel is fully pipelined, and they are
hidden by the pipelined execution. The best execution result
of 10 runs achieves 321.59 GFLOP/s, which corresponds to an
efficiency of 97.9% of the theoretical kernel performance. The
kernel for PAC SVM only clocks with 296 MHz and achieves
241.76 GFLOP/s. This corresponds to 79.8% efficiency of the
theoretical kernel performance. With the normalized perfor-
mance to a kernel frequency of 100 MHz, another efficiency
metric is given in the table for the GEMM benchmark. The
efficiency of the theoretical performance for the U280 board
with DDR is similar to the PAC SVM.
The PTRANS benchmark transposes an 8192×8192 matrix
and adds the result to a matrix of the same size. The global
memory interface can load and store 16 values per clock cycle
resulting in 16 floating-point operations per clock cycle. With
the maximum clock frequency of the global memory interface
of 300 MHz, this results in a theoretical peak performance
of 4.8 GFLOP/s. The benchmark achieves a performance of
3.56 GFLOP/s, which corresponds to 74.2% efficiency. The
high gap between the theoretical and measured performance
is caused by pipeline stalls. Also, for the PAC SVM and U280
DDR, the performance efficiency is low and will be discussed
in more detail in Section V.
The FFT benchmark calculates the 1D-FFT of 4096 single-
TABLE XV
RESOURCE USAGE OF THE SYNTHESIZED BENCHMARK KERNELS
Benchmark Board LUTs Registers BRAM DSPs Frequency
[MHz]
b eff 520N 114,064 (17%) 241,619 (17%) 403 (3%) 0 (0%) 286.67
PTRANS
520N 118,885 (17%) 249,516 (17%) 2,475 (22%) 19 (< 1%) 350.00
PAC SVM 116,179 (14%) 249,601 (14%) 2,649 (23%) 19 (< 1%) 302.00
U280
DDR
15,655 (1.41%) 19,886 (0.86%) 279 (16.54%) 40 (0.44%) 300.00
FFT 520N 108,107 (17%) 246,448 (17%) 1,026 (9%) 312 (5%) 366.67
PAC SVM 121,535 (14%) 268,981 (14%) 1,064 (9%) 312 (5%) 327.00
GEMM
520N 136,585 (20%) 353,107 (20%) 1,469 (13%) 726 (13%) 320.84
PAC SVM 139,639 (17%) 351,249 (17%) 1,629 (14%) 726 (13%) 296.00
U280
DDR
154,810 (14.07%) 231,897 (10.07%) 222 (13.30%) 2566 (28.47%) 250.00
LINPACK 520N 203,339 (32%) 667,965 (32%) 3,310 (28%) 786 (14%) 166.25
PAC SVM 208,603 (27%) 654,653 (27%) 3,453 (29%) 786 (14%) 276.00
precision complex floating-point values in a batched manner.
So the benchmark will calculate the FFT for 5000 different
data sets. This allows the kernel to fill the pipeline and
hide the latency of the calculation. Similar to PTRANS, the
performance of the kernel is global memory bound. Eight
values are loaded and stored to global memory per clock
cycle. With every value, 12 complex floating-point multi-
plications are calculated, which corresponds to five single-
precision floating-point operations. This leads to a theoretical
kernel peak performance of 144 GFLOP/s. The measured
performance is 116.67 GFLOP/s and thus, 81.0% of the
theoretical performance. Also, on the PAC D5005 with SVM,
the FFT benchmark achieves a similar efficiency compared to
the STREAM Copy results.
The LINPACK benchmark uses a 4096 × 4096 matrix for
calculation. Its synthesized kernel shows a high resource usage
and a low kernel frequency. This indicates a complex kernel
design that increases the difficulty for the compiler to place
and route the kernel components on the FPGA efficiently.
Some more optimization efforts are necessary to simplify the
design and allow higher kernel frequencies which will also
increase the performance of the kernel.
V. FURTHER FINDINGS AND INVESTIGATIONS
In the previous section, we presented benchmark measure-
ments on different FPGA platforms, BSPs and SDKs and
included comparisons with simple performance models. Now,
we go one step further and discuss how the benchmark suite al-
lows to capture specific issues at the interplay between FPGA
targets and tools, with focus on the STREAM benchmark.
A. Impact of Memory Access Patterns and Data Layout
The results of the STREAM benchmark for the 520N and
U280 using DDR show that the design allows to utilize all
memory banks to a high degree, also for different FPGA archi-
tectures and tools. Nevertheless, the PAC SVM performance
TABLE XVI
MEASUREMENT RESULTS FOR THE REMAINING BENCHMARK
APPLICATIONS
Benchmark Board Result Error
b eff 520N 31.32 GB/s –
PTRANS
520N 3.56 GFLOP/s
(42.79GB/s)
3.81470e-06
PAC SVM 0.28 GFLOP/s
(3.36GB/s)
2.39471e-07
U280 DDR 0.48 GFLOP/s
(3.67GB/s)
3.81470e-06
FFT
520N 116.67 GFLOP/s
(31.11GB/s)
3.17324e-01
PAC SVM 60.30 GFLOP/s
(16.08GB/s)
3.17324e-01
GEMM
520N 321.59 GFLOP/s
(100MHz norm:
100.23 GFLOP/s)
1.54499e-06
PAC SVM 241.76 GFLOP/s
(100MHz norm:
81.68 GFLOP/s)
2.39471e-07
U280 DDR 202.62 GFLOP/s
(100MHz norm:
81.05 GFLOP/s)
1.43683e-06
LINPACK 520N 7.51 GFLOP/s 5.96278e+02
PAC SVM 3.46 GFLOP/s 6.54650e+04
varies a lot among the four operations. A reason for this is
the imbalance in read and write accesses for the Add and
Triad operation. Two arrays are read, but only one array is
written, so only half of the available write bandwidth of the
PCIe connection can be utilized, which leads to a performance
of 15 GB/s. Still, the Triad operation shows a performance
significantly below this value. We were able to reproduce a
similar effect for the Add operation by changing the allocation
order of the three arrays from A, B, C to C, B, A. This
behavior suggests an issue with the banking of the DDR
memory between the arrays that are allocated in second and
third place. Separating the two arrays by allocating a fourth
array between B and C resolved the measured performance
differences. However, since the placement of the arrays can
not be controlled as precisely for SVM as it can be done with
OpenCL buffers, these effects have to be considered when
multiple global memory buffers are used with SVM kernels.
The RandomAccess benchmark showed that non-subsequent
memory accesses do not perform well in the SVM mode. This
also reflects in the low performance for PTRANS and GEMM,
which are using strided memory accesses in global memory
because of their block-wise calculation of the final result. The
strided memory access leads to a reduction of the length of
the memory bursts and lower memory bandwidth. PTRANS
does only achieve 16.7% of the peak memory bandwidth
measured with STREAM. A lower memory bandwidth also
affects the overall performance of the compute-bound GEMM
benchmark, since the kernel executes global memory accesses
and calculation sequentially. Because of the heavily decreased
memory bandwidth, the read and write overhead to global
memory increases and harms the overall kernel performance.
A similar effect can also be observed for GEMM executed on
the U280 board, which achieves a comparable performance
efficiency. The performance impact could be reduced by
reordering the strided memory on the CPU before writing it
to the FPGA. This would allow larger memory bursts and
bandwidth similar to the values measured with STREAM and
would also increase the performance of the named benchmarks
on the boards. Nevertheless, this would also mean that the
CPU has to do a considerable amount of pre-processing, which
would exceed the goal of the benchmarks to measure the pure
FPGA performance.
B. Impact of Kernel Frequencies
An additional synthesis for the 520N board with an in-
creased local buffer size to 16,384 resulted in the resource
usage and performance given in Table XVII. The high BRAM
usage decreased the maximum kernel frequency to 280 MHz,
which is 93.34% of the frequency of the memory controller.
At the same time, the benchmark achieves up to 92.15%
of the performance compared to the kernel running with
more than 300 MHz, which correlates with the frequency
reduction. This shows that the benchmark is also capable
of measuring the performance of the board and tools since
the kernel frequency is not least depending on the place and
route of the kernel components by the compiler. Another
interesting insight in the measurement results is the difference
in the performance of Copy/Scale and Add/Triad. Since the
kernel has a lower frequency than the memory controller, the
kernel itself becomes the performance bottleneck. Execution
overheads introduced by switching the pipelines now directly
affect the kernel performance and are no longer hidden by
latencies of the slower memory controller.
TABLE XVII
STREAM BENCHMARK RESULTS ON THE 520N BOARD WITH 1 MB
LOCAL MEMORY BUFFER
Synthesis Execution
LUTs 203,607 (26%) Copy 63.48GB/s
FFs 436,516 (26%) Scale 63.49GB/s
BRAM 7,409 (63%) Add 58.96GB/s
DSPs 128 (2%) Triad 59.00GB/s
Freq. 280.00 MHz PCIe Write 6.40GB/s
PCIE Read 6.32GB/s
C. Kernel Scheduling on the U280 with HBM2
In earlier measurements, the U280 board showed a very low
efficiency of below 30% for all operations of the STREAM
benchmark so we further investigated the kernel performance
with additional experiments. The STREAM benchmark mea-
sures the time from starting the first kernel until the last kernel
has finished execution to calculate the bandwidth. This is
similar to using the slowest runtime of all kernels to calculate
the bandwidth. For a more detailed look at the performance
of the benchmark, we used the profiling information provided
by the OpenCL API to collect the execution times separately
for every kernel during the benchmark execution. The used
profiling events are the kernel start and end time. The thus
observed kernel execution times for the Copy operations are
given in Figure 1 in groups of 15 kernels with regards to
the chronological sequence they were enqueued for execution.
For this measurement, the kernels are enqueued sequentially
in two different orders, and the measurements were repeated
20 times. The second and third group show approximately
double and three times the execution time of the first 15
kernels. This leads to the hypothesis that only 15 kernel
executions can be maintained simultaneously and that the
reported kernel execution time also contains the wait time until
the kernel is started. This insight triggered further research into
the kernel scheduler of the Xilinx OpenCL runtime system,
which actually contains three scheduler implementations. After
disabling in a configuration file, the two schedulers variants
ERT and KDS that at this time don’t support more than 15
concurrent kernels, execution falls back to a scheduler that
does not contain these limitations. This mode was used to
generate the results in Table XIV. This insight illustrates the
importance of a benchmark suite defined on the OpenCL level
that measures not only the raw performance potential of FPGA
hardware, but also the impact of compilation and synthesis
tools and, in this case, runtime environments.
D. Power Measurements of the STREAM benchmark on FPGA
and CPU
Another, increasingly important metric in HPC is power
efficiency. By default, the presented HPCC FPGA benchmarks
do not measure the power consumption since there is no
standardized way to retrieve this data. Nevertheless, we created
measurement scripts for each FPGA to measure the power
consumption of STREAM to investigate the power efficiency
First 15
kernels
Next 15
kernels
Last 2
kernels
0
10
20
30
tim
e 
in
 m
s
Fig. 1. Kernel execution times for the Copy operation on the U280 board
with HBM2 memory using 32 kernel replications. The times are combined in
blocks of 15 kernels in the order they where enqueued for execution
TABLE XVIII
POWER CONSUMPTION VS BENCHMARK RESULTS FOR THE STREAM
BENCHMARK
520N U280
HBM2
U280
DDR
PAC
SVM
2x
Xeon
Average [W] 68.82 41.44 36.15 54.43∗ 323.29
Peak [W] 76.78 59.98 47.09 55.21∗ 344.91
Performance per
Watt [GB/(sW)]
0.91 6.31 0.74 0.37∗ 0.53
Copy [GB/s] 69.09 372.27 34.06 20.43 167.15
Scale [GB/s] 69.10 374.82 34.05 20.43 173.44
Add [GB/s] 70.08 378.51 34.67 15.10 183.99
Triad [GB/s] 70.02 378.55 34.66 15.12 183.40
PCIe read [GB/s] 6.40 7.46 5.44 – –
PCIe write [GB/s] 6.30 7.18 5.47 – –
∗Additional power consumed by using host DDR memory not included
of recent FPGA boards during execution. The scripts measured
the power consumption during the benchmark execution in
intervals of 100 ms, and we calculated the average over the
whole execution time. To reduce the FPGA idle time during
the measurement, we increased the array sizes to the biggest
power of two that fit the device and increased the number of
repetitions to 100.
As a comparison, we executed STREAM v5.10 on the two-
socket CPU system equipped with Intel Xeon Gold 6148 that is
also used as host for the Nallatch 520N and PAC D5005. We
measured the CPU + DDR power consumption using PCM
Tools3 (version 202005) in a 1 s interval during execution.
Table XVIII presents the power consumption along with the
benchmark performance of FPGA targets and CPU reference.
The power consumption of the benchmark execution is given
in the upper part of the table. The average power consumption
is calculated for the whole benchmark execution time. It needs
to be noted that for all FPGAs executions except for the
PAC SVM, this also includes buffer transfer times for every
benchmark repetition. During this time the FPGA is nearly
idle and consumes less power. Therefore, the performance
3https://github.com/opcm/pcm
per Watt is calculated using the peak measured bandwidth
divided by the peak power consumption. This metric can be
used to compare the power efficiency of the devices during
the execution of STREAM. FPGA platforms using on-board
DDR memory are up to 1.7x more power efficient during this
memory-intensive benchmark, the FPGA with HBM2 is 11.9x
more efficient.
E. Limitations and Future Work
FFT is a memory-bound application that uses subsequent
memory accesses to read and write the data to global memory.
This allows a direct comparison of the performance results to
the synthetic STREAM benchmark. Only two memory banks
are used by the kernel so only half of the total bandwidth can
be utilized. As it can be seen in Table XVI, this corresponds
to 90.3% of the 34.45 GB/s that are given as upper bound
by the STREAM benchmark. Considering this bandwidth
reduction, the FFT benchmark already achieves a high memory
utilization. Nevertheless, the application only uses half of the
total available global memory banks. This could partially be
resolved by using memory interleaving or kernel replication.
This also holds for GEMM, where the relatively low resource
usage is limiting the kernel performance. An additional config-
uration option for replications can be a way to allow matching
configurations of the base implementation on different FPGAs
and better utilization of the available resources.
The benchmark execution for LINPACK on two different
devices showed that the kernel design could not meet the
expected performance of this algorithm. Especially the design
of a kernel for the gefa routine is a challenging problem
because of the complexity of the algorithm. Previous works
show that it is possible to achieve well-performing designs for
specific FPGAs [12], [18], [21]. Still, these designs are limited
in the matrix size or the accuracy because of missing pivoting.
The existing implementations show that the development of
a broadly usable base implementation for LINPACK is a
challenging task. Until then, it is also possible and explicitly
allowed to execute the benchmark with customized kernels to
collect performance results.
VI. CONCLUSION
In this paper, we proposed HPCC FPGA, a novel FPGA
OpenCL benchmark suite for HPC. Therefore, we provide
configurable OpenCL base implementations and host codes
for all benchmarks of the well-established HPCC benchmark
suite. We showed that the configuration options allow the
generation of efficient benchmark kernels for Xilinx and
Intel FPGAs using the same source code without manual
modification. We executed the benchmarks on up to three
FPGAs with four different memory setups and compared the
results with simple performance models. Most benchmarks
showed a high-performance efficiency when compared to the
models. Nevertheless, the evaluation showed that the base
implementations are often unable to utilize the available re-
sources on an FPGA board fully. Hence, it is important to
discuss the base implementations and configuration options
with the community to create a valuable and widely accepted
FPGA performance characterization tool for HPC. We made
the code open-source and publicly available to simplify and
encourage contributions to future versions of the benchmark
suite.
ACKNOWLEDGEMENTS
The authors gratefully acknowledge the support of this
project by computing time provided by the Paderborn Center
for Parallel Computing (PC2). We also thank Xilinx for the
donation of an Alveo U280 card, Intel for providing a PAC
D5005 loaner board and access to the reference design BSP
with SVM support, and the Systems Group at ETH Zurich
as well as the Xilinx Adaptive Compute Clusters (XACC)
program for access to their Xilinx FPGA evaluation system.
REFERENCES
[1] J. L. Henning, “SPEC CPU2006 benchmark descriptions,” ACM
SIGARCH Computer Architecture News, vol. 34, no. 4, pp. 1–
17, September 2006. [Online]. Available: http://doi.acm.org/10.1145/
1186736.1186737
[2] J. J. Dongarra and P. Luszczek, “Introduction to the HPCChallenge
Benchmark Suite,” Defense Technical Information Center, Fort
Belvoir, VA, Tech. Rep., Dec. 2004. [Online]. Available: http:
//www.dtic.mil/docs/citations/ADA439315
[3] TOP500.org, “TOP500 Supercomputer Sites,” https://www.top500.org/
lists/2019/11/, accessed: 2020-03-31.
[4] G. Juckeland, W. Brantley, S. Chandrasekaran, B. Chapman, S. Che,
M. Colgrove, H. Feng, A. Grund, R. Henschel, W.-M. W. Hwu, H. Li,
M. S. Mu¨ller, W. E. Nagel, M. Perminov, P. Shelepugin, K. Skadron,
J. Stratton, A. Titov, K. Wang, M. van Waveren, B. Whitney, S. Wienke,
R. Xu, and K. Kumaran, “Spec accel: A standard application suite
for measuring hardware accelerator performance,” in High Performance
Computing Systems. Performance Modeling, Benchmarking, and Simu-
lation, S. A. Jarvis, S. A. Wright, and S. D. Hammond, Eds. Cham:
Springer International Publishing, 2015, pp. 46–67.
[5] M. Owaida and G. Alonso, “Application Partitioning on FPGA Clusters:
Inference over Decision Tree Ensembles,” in 2018 28th International
Conference on Field Programmable Logic and Applications (FPL).
IEEE, 2018, pp. 295–2955.
[6] K. Sano, Y. Hatsuda, and S. Yamamoto, “Multi-FPGA accelerator for
scalable stencil computation with constant memory bandwidth,” IEEE
Transactions on Parallel and Distributed Systems (TPDS), vol. 25, no. 3,
pp. 695–705, March 2014.
[7] T. De Matteis, J. de Fine Licht, J. Bera´nek, and T. Hoefler, “Streaming
message interface: High-performance distributed memory programming
on reconfigurable hardware,” in Proc. Int. Conf. for High Performance
Computing, Networking, Storage and Analysis, ser. SC 19. New York,
NY, USA: Association for Computing Machinery, 2019. [Online].
Available: https://doi.org/10.1145/3295500.3356201
[8] R. Kobayashi, Y. Oobata, N. Fujita, Y. Yamaguchi, and T. Boku,
“OpenCL-ready high speed FPGA network for reconfigurable high
performance computing,” in Proceedings of the International Conference
on High Performance Computing in Asia-Pacific Region, ser. HPC Asia
2018. New York, NY, USA: Association for Computing Machinery,
2018, p. 192201. [Online]. Available: https://doi.org/10.1145/3149457.
3149479
[9] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee,
and K. Skadron, “Rodinia: A Benchmark Suite for Heterogeneous
Computing,” in 2009 IEEE Int. Symp. on Workload Characterization
(IISWC). Ieee, 2009, pp. 44–54.
[10] W.-c. Feng, H. Lin, T. Scogland, and J. Zhang, “OpenCL and the
13 dwarfs: A work in progress,” in Proc. Int. Conf. on Performance
Engineering (ICPE), ser. ICPE 12. New York, NY, USA: Association
for Computing Machinery, 2012, p. 291294.
[11] A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spaf-
ford, V. Tipparaju, and J. S. Vetter, “The scalable heterogeneous comput-
ing (shoc) benchmark suite,” in Proc. of the 3rd Workshop on General-
Purpose Computation on Graphics Processing Units, ser. GPGPU-3.
New York, NY, USA: Association for Computing Machinery, 2010, p.
6374.
[12] H. R. Zohouri, N. Maruyama, A. Smith, M. Matsuda, and S. Matsuoka,
“Evaluating and optimizing OpenCL kernels for high performance
computing with FPGAs,” in SC’16: Proceedings of the International
Conference for High Performance Computing, Networking, Storage and
Analysis. IEEE, 2016, pp. 409–420.
[13] Y. Zhou, U. Gupta, S. Dai, R. Zhao, N. Srivastava, H. Jin, J. Featherston,
Y.-H. Lai, G. Liu, G. A. Velasquez, W. Wang, and Z. Zhang, “Rosetta:
A Realistic High-Level Synthesis Benchmark Suite for Software-
Programmable FPGAs,” Int. Symp. on Field-Programmable Gate Arrays
(FPGA), Feb 2018.
[14] G. Ndu, J. Navaridas, and M. Luja´n, “Cho: Towards a benchmark
suite for OpenCL fpga accelerators,” in Proceedings of the 3rd
International Workshop on OpenCL, ser. IWOCL 15. New York, NY,
USA: Association for Computing Machinery, 2015. [Online]. Available:
https://doi.org/10.1145/2791321.2791331
[15] Q. Gautier, A. Althoff, P. Meng, and R. Kastner, “Spector: An OpenCL
FPGA Benchmark Suite,” in 2016 International Conference on Field-
Programmable Technology (FPT), Dec 2016, pp. 141–148.
[16] J. D. McCalpin, “Memory bandwidth and machine balance in current
high performance computers,” 1995.
[17] P. Gorlani, T. Kenter, and C. Plessl, “OpenCL Implementation of
Cannons Matrix Multiplication Algorithm on Intel Stratix 10 FPGAs,”
in 2019 International Conference on Field-Programmable Technology
(ICFPT), 2019, pp. 99–107.
[18] W. Zhang, V. Betz, and J. Rose, “Portable and Scalable FPGA-
based Acceleration of a Direct Linear System Solver,” ACM Trans.
Reconfigurable Technol. Syst., vol. 5, no. 1, pp. 6:1–6:26, Mar. 2012.
[Online]. Available: http://doi.acm.org/10.1145/2133352.2133358
[19] E. Luebbers, S. Liu, and M. Chu, “Simplify software integration
for FPGA accelerators with OPAE,” White Paper, Intel, 2019.
[Online]. Available: https://01.org/sites/default/files/downloads/opae/
open-programmable-acceleration-engine-paper.pdf
[20] BittWare OpenCL S10 BSP Reference Guide, February 2020, rev. 1.3.
[21] K. Turkington, K. Masselos, G. A. Constantinides, and P. Leong, “FPGA
Based Acceleration of the Linpack Benchmark: A High Level Code
Transformation Approach,” in 2006 International Conference on Field
Programmable Logic and Applications, Aug. 2006, pp. 1–6.
