FBLAS: Streaming Linear Algebra on FPGA by De Matteis, Tiziano et al.
FBLAS: Streaming Linear Algebra on FPGA
Tiziano De Matteis, Johannes de Fine Licht and Torsten Hoefler
Department of Computer Science, ETH Zurich, Switzerland
Emails: {tdematt, definelj, htor}@inf.ethz.ch
Abstract—Spatial computing architectures pose an attractive
alternative to mitigate control and data movement overheads
typical of load-store architectures. In practice, these devices
are rarely considered in the HPC community due to the steep
learning curve, low productivity, and the lack of available
libraries for fundamental operations. High-level synthesis (HLS)
tools are facilitating hardware programming, but optimizing for
these architectures requires factoring in new transformations and
resources/performance trade-offs. We present FBLAS, an open-
source HLS implementation of BLAS for FPGAs, that enables
reusability, portability and easy integration with existing software
and hardware codes. FBLAS’ implementation allows scaling
hardware modules to exploit on-chip resources, and module
interfaces are designed to natively support streaming on-chip
communications, allowing them to be composed to reduce off-
chip communication. With FBLAS, we set a precedent for FPGA
library design, and contribute to the toolbox of customizable
hardware components necessary for HPC codes to start produc-
tively targeting reconfigurable platforms.
Index Terms—Spatial architectures, high level synthesis, hard-
ware library
I. INTRODUCTION
The end of Dennard scaling [1] and Moore’s law [2] has
exhibited the limitations of traditional load-store architectures
(LSA), where data movement has come to dominate both
energy and performance. In these systems, more than 90%
of the energy consumed by a floating point instruction is
spent on register files, cache, and control logic [3]. These
inherent inefficiencies of LSAs breathe life into the research
field of computer architecture [4]. On the one hand, domain
specific architectures (DSAs) are often proposed to accelerate
specific part of computations or application domains, such as
Google TPU [5] and NVIDIA Tensor Cores [6]). The hardware
is tailored to a specific class of computations, and domain
specific languages are offered to program it. On the other
hand, spatial computing devices, such as field-programmable
gate arrays (FPGAs), gained attention as a way to allow rapid
prototyping of application specific circuits that are driven by
data movement itself.
The reconfigurability of spatial computing devices have
traditionally come with a trade-off in computing performance.
However, modern high-performance FPGAs ship with native
floating point units (e.g., Intel Stratix 10), high-bandwidth
memory (e.g., Xilinx Alveo U280), and high-speed net-
work connections, rendering them competitive also on HPC
workloads [7], [8]. In addition to this, high-level synthesis
(HLS) tools have ushered in a new wave of interest from
the community, where frameworks such as the Intel FPGA
SDK for OpenCL [9] and Xilinx Vivado HLS [10] enable
Fig. 1: Overview of the FBLAS 1library.
the programmers to use high-level languages when targeting
FPGAs, using familiar languages such as C, C++, or OpenCL
to synthesize hardware circuits.
Although HLS tools reduces the development cycle of
FPGA programs, optimizing for FPGAs is more challenging
than for load-store architectures, as the addition of space
utilization as a metric for code transformations and optimiza-
tions leads to a (chip) space/time trade-off, which must be
considered by the programmer.
Designing and optimizing libraries of reusable hardware
components with HLS, poses a series of challenges that
are similar with DSAs: How to efficiently exploit massive
hardware parallelism? How to properly dimension hardware
modules? How to compose hardware components to exploit
inter-module parallelism? What are the right abstractions
that must be offered for programming the hardware? With
respect to traditional hardware development workflow, high-
level synthesis allows the designer to focus on these crucial
questions, removing the burden of dealing with low level
technicalities (e.g, register transfer optimization, management
of external interfaces), and it produces register transfer logic
(RTL), that is still amenable to be realized on silicon.
We present FBLAS, a flexible and customizable implemen-
tation of the Basic Linear Algebra Subroutines (BLAS) for
FPGAs. Our objective is twofold. First, we want to enable
the rapid development of numerical computations that target
spatial architectures, giving the HLS programmer access to a
set of customizable routines that can be re-used, composed,
and integrated with other program logic. Second, we want to
investigate the aforementioned questions, providing guidelines
and good practices for the development of purpose-built plat-
forms using HLS tools. The contributions of this paper are:
• FBLAS, the first portable and open source BLAS imple-
mentation on FPGA, realized entirely with state-of-the-art
1FBLAS is publicly available at: https://github.com/spcl/fblas
Preprint version.
ar
X
iv
:1
90
7.
07
92
9v
5 
 [c
s.D
C]
  1
 Se
p 2
02
0
HLS tools. It promotes productivity, reusability, and easy
integration with existing hardware and software code;
• a characterization of HLS routines by the key characteristics
that affect the performance, resource usage, and memory
bandwidth consumption of the resulting design;
• models to investigate space/time trade-offs of hardware
modules and to enable the user to choose desirable combina-
tions of parameters to optimize performance and/or resource
usage of her circuit design; and
• guidelines for composing modules that communicate
through on-chip resources, avoiding costly reads from off-
chip memory, providing significant performance improve-
ments for I/O bound computations, and reducing the total
communication volume.
We believe that the insights obtained from developing FBLAS
can be exploited in the development of other hardware li-
braries. Although in this work we focus on FPGA as a
platform for the implementation of specialized hardware, we
conjecture that our library-development workflow can be used
to implement execution units for DSAs that offer high-level
domain-specific functions similar to tensor units.
II. THE FBLAS LIBRARY
FBLAS currently targets Intel devices, and exposes two
layers of functionality to the programmer (see Fig. 1): HLS
modules, produced by a provided code generator, which can
be integrated into existing hardware designs; and a high-
level host API conforming to the classical BLAS interface.
This distinction is a key contrast to software libraries, as it
facilitates two ways of interacting with the library, depending
on the use-case and level of abstraction required.
A. HLS modules
HLS modules are independent computational entities that
implement a library function (i.e., a routine), with fixed seman-
tics and interface. They could be seen as specialized hardware
operators. Thanks to the massive spatial parallelism available,
different modules can execute in parallel; and we enable mod-
ules to exchange data using direct on-chip resources, rather
than resorting to DRAM. In FBLAS, modules implement
BLAS routines (DOT, GEMV, GEMM, etc.). To express on-
chip communication, HLS tools provide the FIFO abstraction
(also referred to as channels or streams) to represent typed
and bounded single-producer/single-consumer queues imple-
mented in hardware. HLS Modules have a streaming interface:
data is received and produced through FIFO buffers. Users
of the library can invoke independent functions, or integrate
FBLAS modules directly in a streaming setting.
Modules have been designed with compute performance in
mind, exploiting the spatial parallelism and fast on-chip mem-
ory on FPGAs. In addition, HLS modules can be characterized
by a precise performance and space/time model (see Sec. IV),
allowing the user to optimize them for performance and/or
resource consumption.
B. Host API
The Host API allows the user to invoke routines directly
from the host program. The API is written in C++, and
provides a set of library calls that match the classical BLAS
calls in terms of signature and behavior. Following the stan-
dard OpenCL programming flow, the host programmer is
responsible to transferring data to and from the device, can
invoke the desired FBLAS routines working on the FPGA
memory, then copy back the result from the device. Library
calls can be synchronous (return when the computation is
done) or asynchronous (return immediately).
C. Code Generator
HLS modules in isolation work on streaming interfaces,
which must be integrated with consumers and producers.
Often these must be connected to DRAM, requiring dedicated
modules to interact with off-chip memory. To produce the
HLS modules required by both the high-level and low-level
API, FBLAS provides a template-based code generator, that
produces synthesizable OpenCL kernels. If the data is stored
in DRAM, helper kernels must be created to read and inject
data to the modules, and to write results back to memory. The
code generator accepts a routines specification file, which is a
JSON file provided by the programmer, specifying the routines
that she wants to invoke. The programmer can customize
functional and non-functional parameters of the routine. Func-
tional parameters affect the logic of the routine, by specifying,
for example, if a Level 2 routine accepts a transposed or
non-transposed matrix. Non-functional parameters regulate
vectorization widths and tile sizes, and can be tailored by the
user according to her performance and/or resource occupation
needs. The code generator will produce a set of OpenCL files
that can be used as input to the HLS compiler to synthesize
the bitstream for reprogramming the FPGA.
III. MODULE DESIGN
FBLAS modules come pre-optimized with key HLS trans-
formations, but are configurable to allow the user to spe-
cialize them according to desired performance or utilization
requirements (see Sec. IV). This is facilitated by tweaking
the parameters given to the employed HLS transformations,
described below. This can in turn affect the streaming behavior
of kernels depending on the tiling strategy used, as detailed in
Sec. III-B.
A. Applied HLS Optimizations
To utilize the massive spatial parallelism offered by FPGAs
we must exploit pipelining of computations. FPGAs primarily
rely on deep pipelines to achieve pipeline parallelism in a
“multiple instruction, single data”-fashion. This is true both
within single compute kernels, such as individual BLAS
routines, and between kernels, where data can be streamed to
achieve more reuse and increase pipeline parallelism. While
HLS tools reduce the complexity of FPGA programming,
writing optimized code that results in efficient hardware design
is still a challenging task. Traditional program optimizations
are insufficient, as they do not consider pipelining and spatial
replication of hardware circuits [11]. To optimize FBLAS
circuits, we employ a set of FPGA-targeted optimizations,
divided into three classes.
1) Pipeline-enabling Transformations: Achieving perfect
pipelining is crucial for efficient hardware design. For a loop,
this implies an Initiation Interval (“II”, or just I) of 1, meaning
that a new loop iteration is started every clock cycle. To allow
this, the programmer must resolve loop-carried dependencies
and hardware resource contention (usually to memory blocks),
which can prevent the tool from scheduling the loop with an
II of 1. To overcome these issues, we apply iteration space
transposition, loop strip-mining, and accumulation interleav-
ing [11]. This is particularly relevant when targeting double
precision, as the target FPGA used in this work does not
support double precision accumulation natively, requiring a
two-stage circuit to fully pipeline.
2) Replication: Parallelism on FPGA is achieved by repli-
cating compute units, either “horizontally” (SIMD-style vec-
torization) or “vertically” (parallelism through data reuse).
We achieve this by unrolling loop nests. Loop unrolling is
applied by strip-mining the computations to expose unrolling
opportunities for parallelism. We define the vectorization width
W the unrolling factor for inner loops. As this directly affects
the generated hardware, it must be a compile-time constant.
When the input data size is small and known a priori, the
routine loops can be fully unrolled. This enable the modules
to start a new routine computation on each clock cycle, at the
expense of a high resource utilization (see Sect. IV)
3) Tiling: Loop tiling is a well-known code optimization
used to increase both spatial and temporal locality of compu-
tations. In HLS, we use the transformation to organize loop
schedules such that reused data fits into fast on-chip memory,
saving costly accesses to DRAM. In FBLAS, tiling is used for
Level 2 and Level 3 routines, where there is opportunity for
data reuse. Tiling is implemented by strip-mining loops and
reordering them to the desired reuse pattern, and explicitly
instantiating buffers for the reused data using C-arrays. Tile
sizes must be defined at compile-time, as they affect the
number of memory blocks instantiated to hold the data. Fully
unrolled routines do not need tiling.
Because FBLAS modules are implemented with streaming
interfaces, tiling routines has implications for how data must
be sent to, from, and between modules. To avoid mismatches,
this must be treated explicitly by the framework.
B. Impact of Tiling on Streaming Kernels
The module streaming interface specifies how input data is
received and how output data is produced. BLAS routines ac-
cept three classes of input data: scalars; vectors; and matrices.
Scalars are passed once when invoking a module, while vectors
and matrices are streamed between modules. As vectors are
only tiled along a single dimension, the tile size, and optionally
the number of repetitions, are the only interface parameters.
Matrices, on the other hand, are tiled in 2D, where both the tile
elements and the order of tiles can be scheduled by rows or by
3
21
4
1
4
2
3
1
2
3
4
N
M
TM
TN
A x y
TNTM 3
2
1
4
1
4
2
3
1
2
3
4
N
M
TM
TN
A x y
TNTM
Fig. 2: Different implementation of GEMV: A received in tiles by rows
(left), and A received in tiles by columns (right). Numbers indicates
arrival order, green circles are reads, orange squares are reads/writes.
columns. This results in 4 possible modes of streaming across
a matrix interface. We choose to adopt a 2D tiling schema
also for Level 2 routines as this i) open the possibility to
have different I/O complexities for the same routine, ii) favors
module composition both between routines in the same BLAS
Level (see Sec. V) and across different Levels.
FBLAS routines must take into account that data may be
streamed in different ways. Consider the GEMV routine that
computes y = αAx+ βy, where A is an N ×M matrix and
x and y are M and N elements vectors, and A is optionally
transposed. If A is not transposed, the routine receives A by
tiles (by rows or by columns), x and y, and pushes results to
the output stream. This can be implemented in two possible
ways: in the first case (see Fig. 2, left part), A is received in
tiles of size TN×TM , in a row-major fashion. For each tile of
A, a range of x (of size TM ) is used to update a block of y (of
size TN ). An entire row of tiles (i.e., MTN elements) is needed
to compute a block of y. This tiling scheme achieves reuse over
y, but requires receiving the x-vector dN/TNe times. We say
that this implementation requires that vector x is replayed.
The number of I/O operations for this computation is NM +
MN/TN + 2N , where only the vertical tile size contributes
to reducing the overall I/O, as it reduces the number of times
x must be replayed.
Another possibility could be to stream the tiles of A by
columns (see Fig. 2, right part). With this solution, we use an
entire column of tiles (i.e., NTM elements) and a block of
x to update all the blocks of y. The resulting y is produced
only at the end of the computation. In this case, y must be
replayed: since each block is updated multiple times, we need
to output it and re-read it dM/TMe times. The number of I/O
operations for this configuration is NM +M + 2NM/TM ,
where TM is now the primary factor affecting overall I/O.
This example shows how different ways of streaming input
data result in different way of computing the result. Handling
them in a single implementation would lead to the generation
of large designs due to the presence of multiples branch in the
control flow. Therefore, while we offer the traditional high-
level BLAS interface, it is clear that a spatial BLAS library
must offer configurable specialized streaming modules that
can be integrated with existing designs. Each version handles
a specific scheme affecting the order of input and output
elements, and varies in I/O complexity. Being BLAS routines
characterized by a set of restricted input data types and
parameters, this results in a limited number of variants. While
the specialization with lowest I/O can be easily determined for
a single module, additional constraints on feasible specializa-
tions can be introduced when integrating with existing work,
or composing with other kernels (see Sec. V).
While we offer the traditional high-level BLAS interface,
C. Systolic Implementation of GEMM
For the GEMM routine, naively unrolling loops (see
Sec. III-A) could result in an architecture with high fan-in/fan-
out, which prevents its scalability. Instead, we organize the
computation using a 2D systolic array [12], shown in Fig. 3.
FEED-A1
FEED-A0
FEE -A1
FEED-APR-1
FEED-B0 FEED-B1 FEED-BPC-1
DRAIN-C0 DRAIN-C1 DRAIN-CPC-1
READ A
READ B
STORE C
PE
(0,0)
PE
(PR-1,PC-1)
PE
(0,1)
PE
(0,PC-1)
PE
(1,0)
PE
(1,1)
PE
(1,PC-1)
PE
(PR-1,0)
PE
(PR-1,1)
DR
AM
Fig. 3: GEMM systolic implementation.
A grid of PR × PC processing elements (PEs) is in charge
of computing a tile of size TR × TC of the result matrix C.
We say that TR and TC represent the size of the compute tile,
and they are multiples of PR and PC respectively (the memory
tile). Each PE is this responsible for evaluating TRTC/PRPC
elements of the C-tile. Input elements are read from DRAM
(by Read A and Read B helper kernels) and forwarded by a set
of feeders, according to the used tiling schema, to the first row
and column of PEs. Then, each column of PEs forwards the
elements of A, and each row of PEs forwards the elements of
B, in a pipelined fashion. On each clock cycle, a PE receives
one element of A and one element of B, multiplies them and
accumulates over the resulting element of C. The PE will
accumulate on the same elements of C after TRTC/PRPC
clock cycles. When the result is complete, each PE sends its
contributions to a set of drainers. Final results are eventually
collected by the helper kernel Store C and written in memory.
This systolic architecture is characterized by the fact that each
PE has a constant fan-out (6 data connections).
How to implement the systolic architecture depends on the
target architecture. For Intel FPGAs, which are considered in
this first release of FBLAS, a systolic array can be described
by using a single kernel, using a function for the processing
element (PE), and a fully-unrolled nested loop to represent
the systolic array. Data distribution and draining is achieved
by using shift registers [13].
IV. SPACE AND TIME TRADE-OFFS
Performance and space utilization are the two main metrics
that must be considered while optimizing code for spatial hard-
ware. This space/time trade-off, i.e, the compromise between
resource consumption and performance, must be understood
to optimize the performance and/or resource usage of the
resulting design. In FBLAS, HLS modules implement hard-
ware numerical routines by means of fully pipelined nested
loops. Inner loops perform the actual processing: they are
unrolled and synthesized as circuits that implement the desired
computation. Outer loops are derived from loop unrolling and
tiling, and they can be considered as the schedule in which
computations are performed (i.e., to order in which operands
are sent to the circuit). Vectorization widths and tile sizes
represent the two knobs on which a programmer can act to
change module performance and used resources.
To capture the space/time trade-off, we introduce models
to analyze the interplay between parallelism, resource con-
sumption, and performance of an FPGA design. At first, we
consider a single module in isolation, assuming that input data
is always available in its input channels. Then, we show how
the presence of other modules can affect circuit dimensioning.
Code is shown as written using a generic HLS tool, in which
pragmas apply to the following loop, and FIFO buffers are
channels accessible through pop and push operations.
A. Modeling the computational circuit
We model the computational resource consumption and per-
formance of a circuit by using the work and depth model [14].
The cost of an algorithm is determined by considering the
work, i.e., the total number of operations that are performed
in a computation, and the depth, i.e., the length of the longest
shortest path from any input to any output. In the following,
we will refer to these quantities as the application work (AW )
and application depth (AD). The application depth is defined
as the minimum time needed to perform the computation
with a sufficient number of resources. We introduce the
additional concepts of circuit work (CW ) and circuit depth
(CD), to analyze the circuit implementing the inner loop of
the module (where the computation is actually performed).
The circuit work is linked to the number of resources required
to implement the computation in hardware, while the circuit
depth represents the latency of the circuit.
In FBLAS HLS modules, computations performed in the
inner loops can be viewed either as a map (loop iterations work
on different data) or as a map-reduce (intermediate results are
accumulated) computation. Modules that implement routines
such as SCAL, AXPY, GER, or SYR fall in the first category;
modules that implement DOT, GEMV, TRSV, or GEMM belong
to the second. In the following, we focus the analysis on SCAL
and DOT, as representative for each of the two cases.
The SCAL routine takes an input vector of N elements
and scales them by using a user-defined factor. Since each
operation is independent from the others, the application work
and depth are AW = N and AD = LM , where LM is the
latency of a multiplication for the given data type. An excerpt
of the HLS implementation code is shown in Fig. 4, where
alpha is the scaling factor. The computation loop has been
strip-mined with a factor W , which is the vectorization width.
For the circuit work and depth analysis, we consider the
body of the outer loop (Lines 3-7). At each iteration, it per-
forms the scaling of W distinct input elements, implemented
1 void scal(float alpha, int N, chan ch_x, chan ch_out) {
2 for (int it = 0; it < N / W; it++) {
3 #pragma unroll
4 for (int i = 0; i < W; i++) {
5 x[i] = alpha ∗ pop(ch_x);
6 push(ch_out, x[i]);
7 } } }
Fig. 4: (Top) SCAL implementation: data is received/sent using
channels ch_x/ch_out. (Bottom) SCAL: circuit work and depth
analysis with W = 4.
through the unrolled inner loop. Fig. 4 (bottom part) shows
the computation performed with W = 4. Nodes represent
operations. The circuit work CW is equal to the vectorization
width W , while the circuit depth CD is equal to LM .
In general, the number of cycles C required to execute a
pipeline with latency L, initiation interval I , and taking M
elements as input, is:
C = L+ IM
In FBLAS all routines have been implemented using pipeline-
enabling transformations (see Sect. III-A), resulting in mod-
ules with I = 1. Given that the latency is the circuit depth,
the number of cycles required to execute the pipeline becomes
C = CD+M , where M is the number of iterations of the inner
loop. Therefore, for SCAL, we will have C = LM +N/W : if
we increase the vectorization width we will linearly reduce the
number of loop iterations, and consequently reduce the time
to completion.
The DOT routine performs the dot product between two N -
element vectors. A high throughput implementation can be
implemented using a binary tree structure: the leaves represent
input data, the first level consists of multiplications, and the
remaining levels are additions. This results in an application
work and depth of AW = 2N − 1 and AD = log2(N)LA +
LM , where LA and LM are the latencies for addition and
multiplication, respectively. The HLS implementation code is
shown in Fig. 5, where x and y are the two vectors received
from two channels, and W is the vectorization width.
If synthesized for Intel Arria 10 or Stratix 10 FPGAs, the
two loops will have I = 1, due to native support for single
precision floating point accumulation in these architectures.
Fig. 5 (bottom part) shows the computation performed by the
body of the outer loop (Line 6-10) with W = 4. Edges are data
dependencies between operations. In this case, both circuit
work and depth depend on the vectorization width, CW = 2W
and CD = log2(W )LA + LM . This results in a computation
times of C = log2(W )LA+LM +N/W cycles. In this case,
if we increase the vectorization width, we will linearly reduce
the number of loop iterations, but only increase the depth CD
logarithmically.
To show how the circuit work and depth capture the
characteristics of the circuit, we synthesize the DOT and SCAL
1 void dot(int N, chan ch_x, chan ch_y, chan ch_res) {
2 float res = 0;
3 for (int it = 0; it < N / W; it++) {
4 float acc = 0;
5 #pragma unroll
6 for (int i = 0; i < W; i++) {
7 x[i] = pop(ch_x); y[i] = pop(ch_y);
8 acc += x[i] ∗ y[i];
9 }
10 res += acc;
11 }
12 push(ch_res, res);
13 }
*
x[i] y[i]
*
x[i+1] y[i+1]
*
x[i+2] y[i+2]
*
x[i+3] y[i+3]
+ +
+ DEPTH
WIDTH
+ acc
Fig. 5: (Top) DOT implementation: data is read from channels ch_y
and ch_y, result is sent to channel ch_res. (Bottom) Circuit
work/depth analysis of DOT inner loop with W = 4.
routines discussed above. Table I reports empirical compu-
tational resource consumption and circuit latency obtained
by varying the vectorization width, considering modules that
operate in single precision. These figures are obtained from
the Intel FPGA Offline Compiler (v19.1) targeting an Intel
Stratix 10 GX 2800 FPGA. The DSPs of this FPGA are
able to start one addition and one multiplication per clock
cycle, and the latency for both addition and multiplication is
6 clock cycles. For SCAL, computational resources (i.e. look-
up tables, LUTs, registers, FFs, and hardened floating point
units, DSPs) linearly increase with respect to the vectorization
width, while latency remains constant. In particular we have
that LUT = 49CW , FF = 96CW , and DSP = CW . For DOT,
resource consumption also grows proportionally with respect
to the width, and the latency increases linearly when the
vectorization width is doubled. The compiler introduced some
optimization in laying down the circuit design, but we can still
see a linear relation between circuit work and computational
resources, resulting in LUT ' 18CW , FF ' 40CW , and
DSP = CW /2. Even though the specific linear factors and
constant terms are tool and device-specific (and so they are
not incorporated in the model), this shows how the work and
depth analysis can be used to qualitatively correlate circuit
characteristics and resources.
SCAL DOT
W LUTs FFs DSPs Lat LUTs FFs DSPs Lat
2 98 192 2 50 174 192 2 82
4 196 384 4 50 242 320 4 85
8 392 768 8 50 378 640 8 89
16 784 1,536 16 50 650 1,280 16 93
32 1,568 3,072 32 50 1,194 2,560 32 97
64 3,136 6,144 64 50 2,474 5,120 64 105
TABLE I: Resource consumption and latency.
B. Optimal circuit dimensioning
From the work and depth model we know that to increase
the performance of the hardware module, the user can increase
the vectorization width until the desired performance require-
ments are met (e.g., complete the computation within a time
budget) and/or has enough computational resources. In the
extreme case, modules can be fully unrolled (see Sect. III-A),
so C becomes equal to the circuit depth.
In practice, multiple hardware modules are combined to
implement a given computation, and/or must receive data from
off-chip data sources such as DRAM. Therefore, the user can
be interested in the optimal vectorization width: i.e., how to
properly dimension the circuit so that the module is not a
bottleneck, while not overprovisioning resources.
A hardware module is able to accept a certain number of
data elements into the pipeline every clock cycle. This number
depends on the inputs of the operation, and is proportional to
the vectorization width. For example, SCAL takes in input W
operands per clock cycle, while DOT takes 2W operands per
clock cycle. If more data arrives to the pipeline per cycle than
what the pipeline can accept, the module is a bottleneck in the
global computation, causing upstream operations to slow down
due to backpressure. Conversely, if the arrival rate is lower
than the module service rate, the module is underutilized,
resulting in resources being underused and thus wasted.
When the arrival rate is known, the user can take it into
account to compute the optimal vectorization width. For
example, DOT requires 2W operands to be input per clock
cycle. If the module is fed with data arriving at B (bytes/sec),
the design runs at a frequency F (Hz) and operands have size
S (bytes), then we can derive the optimal vectorization width
as W = dB/(2SF )e. In this way, the module is dimensioned
so that it uses the minimal required number of resources to
sustain the arrival data rate.
Modules that implement Level 2 and Level 3 routines
use tiling to reduce the communication volume. Tiling is
expressed by outer loops of the hardware module. Using
tiling, we can lower the memory bandwidth requirements and
therefore it must be taken into consideration when the circuit is
dimensioned. Consider the case of a non-tiled implementation
of GEMV, shown in Listing 1. In this case, the circuit serves
1 void gemv(int N, int M, float alpha, float beta,
2 chan ch_x, chan ch_y, chan ch_A, chan ch_out){
3 for (int i = 0; i < N; i++) {
4 float y = beta ∗ pop(ch_y);
5 float acc = 0;
6 for(int j=0; j < M/W; j++){
7 #pragma unroll
8 for (int w = 0; w < W; w++)
9 acc += pop(ch_A) ∗ pop(ch_x);
10 y += alpha ∗ acc;
11 }
12 push (ch_out, y);
13 }
Listing 1: Non-tiled implementation of GEMV. ch_x and ch_y
are used for vectors, ch_A for the input matrix.
W elements each clock cycle coming from A and W elements
from x. Assuming that matrix and vector data arrives with the
same rate, the optimal vectorization width can be computed
with the same formula used for DOT.
By applying tiling, such requirements can change. For
example, considering a tiled implementation of GEMV that
works on A received in tiles by row (Sec. III-B), new elements
for x are required every TNTM/W clock cycles (i.e. every
time that we start computing on a new tile of the matrix
A). Therefore, on average, the circuit needs W elements
from A and W/TNTM elements from x per clock cycle.
The optimal vectorization width can now be computed as:
W = d BTNTMFS(1+TNTM )e. For large value of tile sizes, this can
be approximated as W = d BFS e: double the non-vectorized
implementation. This shows how tiling reduces the memory
bandwidth requirements of a module and enables a higher
performance implementation.
V. STREAMING COMPOSITION
Numerical computations may involve two or more hardware
modules that share or reuse data. The input required for
one module may be produced from another one, or two
modules may accept the same input data. When the order in
which such data is produced and consumed is the same, the
streaming interface introduced in Sec. II-A enables modules
to communicate through on-chip memory, rather than through
off-chip DRAM. This has two key advantages: 1) it reduces
costly off-chip memory accesses, as data is streamed directly
by the producer module to the consumer module, rather than
storing and loading it to DRAM; and 2) it allows pipeline
parallel execution of different modules that are configured si-
multaneously on the FPGA. Avoiding off-chip communication
is key for I/O bound computations, such as BLAS Level 1
and Level 2 routines. In this section we analyze the benefit of
streaming linear algebra using FBLAS routines.
We model a computation as a module directed acyclic graph
(MDAG), in which vertices are hardware modules, and edges
represent data streamed between modules. Source and sink
vertices (represented as circles in the following figures) are
interface modules, that are responsible for off-chip memory
accesses. Other nodes (rectangles) are computational modules,
e.g., FBLAS routines. Edges are implemented with FIFO
buffers of a finite size. The number of elements consumed and
produced at the inputs and outputs of a node is defined by the
FBLAS routine configuration (e.g., GEMV in Sec. III-B). Stalls
occur when a module is blocked because its output channel is
full or an input channel is empty. We consider an MDAG to
be valid if it expresses a composition that will terminate, i.e.,
it does not stall forever. Additionally, an edge in the MDAG
between module A and B is valid if:
1) the number of elements produced is identical to the number
of elements consumed; and
2) the order in which elements are consumed corresponds to
order in which they are produced.
For example, if B is the GEMV module discussed in Sec. III-B,
and A produces the input vector x, we can compose them only
if B operates on a matrix received in tiles by columns, as it
will otherwise need to replay the reading of x, thus violating
(1) above. If present, tiling schemes must be compatible, i.e.,
tiles must have the same size and must be streamed in the
same way between consecutive modules.
In the following, we will evaluate different module com-
positions patterns that can be found in real computations.
A full general case analysis of MDAGs, that could help
user in deriving valid FBLAS compositions is left as future
work. Here, we target the updated set of BLAS subprograms
introduced by Blackford et al. [15]. These routines can be
implemented by using two or more BLAS calls, and are
utilized in various numerical applications. We will study the
feasibility and benefits of a streaming implementation com-
pared to executing the composed BLAS-functions sequentially
based on these examples. We distinguish between two cases:
1) the MDAG is a multitree: that is, there is at most one path
between any pair of vertices, and 2) all other MDAGs.
A. Composition of multitrees
Generally, a multi-tree module composition, with valid
edges, is always valid. The simplest streaming FBLAS com-
position is a linear sequence of compute modules, where each
module receives data from one or more interface modules,
and (at most) one other computational module. Consider,
for example, AXPYDOT, which computes z = w − αv and
β = zTu, where w, v, and u are vectors of length N . To
implement this computation with BLAS, we need a COPY (to
not overwrite input vector y), an AXPY, and a DOT routine.
The number of memory I/O operations (reads/write from
memory) necessary to compute the result is then equal to
2N + 3N + 2N = 7N . We can exploit module composition
by chaining the AXPY and the DOT modules: the output of
AXPY (z), will be directly streamed to the DOT module (see
Fig. 6). This also allows omitting the first copy of w. The
axpy dot
z
v
w u
𝜷
Fig. 6: AXPYDOT streaming implementation.
number of I/O operations is then equal to 3N+1, the minimal
number required to perform the computation. In addition, the
AXPY and DOT modules are executed in parallel, reducing the
number of cycles to completion from
Csequential = (Lcopy +N) + (Ldot +N) + (Laxpy +N)
to just Lcopy + Laxpy + Ldot +N , under the assumption that
the optimal vectorization width is used so that all memory
interfaces are fully saturated during execution (N is adjusted
for the vectorization width W ). If N is sufficiently large,
the computation time is reduced from 3N to just N . Such
a composition will always be valid, assuming all edges are
independently valid.
In many cases, the output of a computational or interface
module is shared between two (or more) computational mod-
ules. Consider the BICG computation, used in the biconjugate
gradient stabilized method. Given matrix A of size N ×M ,
BICG computes q = Ap and s = AT r, where p and s
are vectors of size M , and q and r are vectors of size
N . The computation is implemented with two independent
GEMV routines, that can be executed in parallel. Both routines
read A, but with different access patterns. Using a streaming
composition we can read A only once (see Fig. 7), assuming
that the two GEMV modules accept data streamed in the same
way. Although one GEMV expects AT , we can achieve the
same access pattern by setting their schedule accordingly
through tiling patterns. The two modules compute in parallel
and the results are sent to interface modules that write them in
memory. In this case, we reduce the number of I/O operations
related to the matrix A from 2NM to NM , but do not affect
the number of cycles to completion NM .
gemv gemvT
Ap
q s
r
Fig. 7: BICG streaming implementation
B. Composition of non-multitrees
If the MDAG is not a multitree (i.e., there is more than
one path between two nodes in the graph), invalid graphs can
occur. Consider the case of ATAX, that computes y = ATAx,
where A is an M×N matrix, x and y are vectors of size N . A
streaming implementation similar to the previous example is
shown in Fig. 8. In this case, the two computational modules
gemv gemvT
A
x y
Fig. 8: ATAX: invalid streaming implementation.
share one interface module and the first GEMV module streams
its results to the second one. Given that replaying data is
not allowed between two computational modules (condition
1. of being a valid edge), the right GEMV, must receive A
in tiles by rows. However, the left GEMV produces a block
of results only after it receives an entire row of tiles of A,
i.e., NTN elements. Therefore, the composition would stall
forever, unless the channel between the A interface module
and the second GEMV has a size ≥NTN . If N is not fixed
and known a priori, the composition is invalid. In general,
a similar situation occurs in all the cases in which, given
two vertices of the MDAG vi and vj , there are at least
two vertex-disjoint paths (except vi and vj) from vi to vj .
Invalid stream compositions can be handled by the user by
either a) setting the channel size appropriately (according
to the size of input data) or b) breaking the MDAG into
multiple valid components, communicating through DRAM in
between. In the latter case, this leads to a higher number of I/O
operations with respect to the fully streamed implementation,
less or equal to the one of the non-streamed version of the
program. For ATAX, we could let the two GEMV receive the
matrix elements independently. In this way, we have the same
number of I/O operations of the non-streamed version, but the
completion time can still benefit from a reduction given the
pipelined execution of the two matrix-vector multiplications.
C. Complex compositions
In complex computations, we can compose modules in
different ways. Choosing the most suitable implementation
is critical for both validity and performance. For example,
GEMVER computes B = A + u1vT1 + u2v
T
2 , x = βB
T y + z,
and w = αBx, where α and β are two scalars, A and
B are N × N matrices, and u1, u2, v1, v2, x, y, z, and w
are vectors of length N . With classic BLAS, this requires
calling two GER, two GEMV and two copies. In a streaming
implementation, the computation of B can be realized using
a linear sequence of two GER calls. Then B is used for
the computation of x and w. This leads to a non-multitree
composition similar to the one of ATAX, which we know to
be an invalid configuration for dynamic values of N . Although
this prevents a full streaming implementation, we can resort to
multiple sequential multitree streaming composition (Fig. 9).
The first component ( 1 ) streams between the two GER calls
ger ger gemvTA
u1 v1 u2 v2 B y z
x
1
gemv
B x
w
2
Fig. 9: GEMVER: a possible streaming implementation.
and one GEMV call, producing B and x and storing them in
DRAM. After this component has terminated, B and x are
present in DRAM, and the final GEMV can be executed ( 2 ).
For this composition, the number of I/O operations is reduced
from 8N2 + 10N ≈ 8N2 to 3N2 + 9N ≈ 3N2, and number
of cycles to completion is reduced from 5N2 + N to 2N2;
a significant improvement, despite resorting to sequentializing
the two components.
VI. EVALUATION
FBLAS currently targets Intel devices and offers all level-1
routines (ROTG, ROTMG, ROT, ROTM, SWAP, SCAL, COPY,
AXPY, DOT, SDSOT, NRM2, ASUM, IAMAX), and all generic
level-2/-3 routines (GEMV, TRSV, GER, SYR, SYR2, GEMM,
SYRK, SYR2K, and TRSM), for a total of 22 routines, in single
and double precision. Specialized matrix routines (triangular
and symmetric matrices) must currently be implemented in
terms of the generic routines. While vectorization widths and
tile sizes can be tuned according to user requirements and
available on-chip resources (see Sec. IV), FBLAS routines
accept input data of arbitrary size, assuming the inputs fit in
the off-chip DRAM of the device. As future work, we intend
to extend FBLAS to cover Xilinx FPGAs. Maintaining the
same module interface, we can reuse the core functionality of
the host code and code generation, but we need to adapt to
the SDAccel kernel model.
To evaluate FBLAS, we show the scaling of a representative
set of single HLS modules, and the behavior and benefits of
streaming composition. We also include comparison to a state-
of-the-art BLAS implementation on CPU.
A. Experimental Setup
We performed experiments on two Bittware boards equipped
with two different FPGAs, described in Tab. II. The device
resources can be divided in: Adaptive Logic Modules (ALM),
flip-flops (FFs), memory blocks (M20K), and Digital Signal
Processing units (DSPs). Part of these resources are reserved
by the Board Support Package (approximately 25% for the
Stratix 10), therefore for both devices we indicate also the
resources that are effectively available for a design. In both
cases, the card is connected to the host machine via a PCIe
bus in x8 mode. The host has a 10 cores Intel Xeon E5-2630
v4, operating at 2.2GHz (no Hyper-Threading), and 64 GB
of 4-channel DDR4 memory. The peak bandwidth of a single
DDR module is 17 GB/s.
FPGA ALM FF M20K DSP DRAM
Arria 10 GX 1150 Total 427 K 1.7 M 2.7 K 1518 2×8GBAvail. 392 K 1.5 M 2.4 K 1518
Stratix 10 GX 2800 Total 933 K 3.7 M 11.7 K 5760 4×8GBAvail. 692 K 2.8 M 8.9 K 4468
TABLE II: FPGA boards used for evaluation.
For synthesizing FPGA kernels, we use the Intel FPGA
SDK for OpenCL v19.1. In the Stratix FPGA, automatic
memory interleaving is disabled (per advice of the vendor),
data must be manually allocated to one of the DDR banks.
The peak bandwidth of a single bank is 19.2GB/s. All designs
are compiled with the -no-interleaving=default,
-fp-relaxed, and -fpc flags. CPU code is compiled using
gcc 7.2, with optimization flag -O3, and we use Intel MKL
2019 (update 2). For measuring power on the FPGA, we use
the aocl utility provided by Intel, that reports the power
drain of the entire board (not only the FPGA chip). For CPU
measurements, we use Mammut [16], and consider the power
consumed by the processor and by the DRAM only.
B. Individual Module Evaluation
In this section, we evaluate the impact of vectorization and
tiling on the performance of individual FBLAS modules. To
capture different computational and communication complex-
ities, we show modules that implement the DOT, GEMV and
GEMM routines, as representative samples of BLAS Level 1,
2, and 3, respectively. Input data is generated directly on the
FPGA, to test the scaling behavior of the memory bound
applications DOT and GEMV, considering vectorization width
that can exploit memory interfaces faster than the one offered
by the testbed (e.g., HBM).
Performance is reported in floating point operations per
second (Ops/s) based on the averaged execution time. In
all cases the 99% confidence interval is within 5% of the
measured mean. Expected performance is computed by taking
the number of used DSPs and multiplying by the frequency of
the synthesized designed. This assumes that all DSPs are fully
utilized, and therefore can be used to evaluate the efficiency of
 1
 10
 100
16 32 64 128 256
G
O
ps
/s
Vectorization width
Arria - SDOT
Arria - DDOT
Stratix - SDOT
Stratix - DDOT
 1
 10
 100
16 32 64 128 256
G
O
ps
/s
Vectorization width
Arria - SGEMV
Arria - DGEMV
Stratix - SGEMV
Stratix - DGEMV
 0
 300
 600
 900
 1200
 1500
 1800
3 6 9 12
G
O
ps
/s
Compute/Memory Tile Ratio
Arria - SGEMM 32x32
Arria - DGEMM 16x8
Stratix - SGEMM 40x80
Stratix - DGEMM 16x16
Fig. 10: Performance of modules implementing DOT, GEMV, and GEMM. Horizontal bars indicate expected performance.
the implemented routines (e.g., loops are correctly pipelined
with II equal to 1, on-chip memory is accessed efficiently).
Fig. 10 (left) shows the evaluation for the DOT modules
that operate on single and double precision. The vectorization
width spans from 16 to 256, and the input data size is fixed
at 100M elements. For double precision routine, the compiler
is able to place and route designs with a maximum width
of 128, due to an higher number of resources required to
implement 64 bits operations. For both testbeds, synthesized
designs are able to achieve the expected performance, implying
that the instantiated compute is running at full throughput.
The evaluation for GEMV is shown in Fig. 10 (middle), for
square tiles of size 1024×1024. The running frequencies differ
slightly between designs with the same vectorization width,
but different precision. Also in this case, the resulting designs
achieved the expected performance. Finally, Fig. 10 (right)
shows the results obtained for GEMM, realized with a systolic
implementation (see Sec. III-C). Due to the different number
of available resources, in the evaluation, we used a systolic
array of size 32 × 32 (single precision) and 16 × 8 (double
precision) for the Arria testbed, and a width of 40×80 (single
precision) and 16×16 (double precision) for the Stratix FPGA.
These are the highest values for which the compiler is able
to generate the design without failing placement or routing.
Fixed the compute tile size (i.e., the size of the systolic
array), for each design, we vary the memory tile sizes, to
evaluate various compute/memory tile size ratios, and their
impact on the computation. For each combination, squared
matrices have been used as input data, with size equal to 5
times the memory tile size. Smaller systolic arrays achieved
expected performance already with small tile sizes. With larger
designs, by increasing the tiles size we were able to approach
the expected performance given by the number of processing
elements instantiated, obtaining a peak performance of 1.28
TFlop/s single precision on the Stratix FPGA. Table III shows
the used resources, frequency and power consumed of the syn-
thesized designs for modules with highest performance, that
is width 256 (128) for SDOT (DDOT) and SGEMV (DGEMV),
biggest tiles for GEMM (the systolic arrays have different size
for the two platforms)
In this section, we evaluated the modules in isolation, and
DSP usage directly correlates with the achieved performance
(on both testbeds, a DSP can initiate a multiply-and-add on
each clock cycle). For Stratix 10, DOT and GEMV modules
can take advantage of HyperFlex optimization, a register tech-
nology introduced by Intel to increase design frequency [17].
Compared to the Arria testbed (for which such technology is
not available), Stratix designs can achieve higher frequency
but also a higher logic and BRAM utilization. Depending on
the user needs (e.g., limit the used resources), the HyperFlex
optimization can be disabled at the synthesis stage, resulting in
lower frequency and resource utilization. With the used version
of the compiler, HyperFlex is not enabled for all routines due
to some striped memory accesses inferred as being unaligned.
Successive versions of the tool2 should improve this aspect,
allowing HyperFlex in all the cases.
Both evaluation testbeds do not have hardened double
precision units. Therefore, modules that operate on double
precision use more DSPs (4 per operation), as well as more
logic (one order of magnitude higher) to implement arithmetic
operations while guaranteeing a loop initiation interval of one.
A
R
R
IA
ALMs [K] FFs [K] M20Ks DSPs F P
SDOT 9.756 (2.5%) 15.62 (1%) 1 (0%) 331 (21.8%) 150 47.3
DDOT 121.4 (31%) 208.3 (13.3%) 3 (0.1%) 512 (33.7%) 150 47.9
SGEMV 21.56 (5.5%) 40 (2.6%) 210 (8.6%) 284 (18.7%) 145 48.1
DGEMV 135.9 (34.7%) 286.7 (18.3%) 216 (9.9%) 520 (34.3 %) 132 48.6
SGEMM 102.4 (26.1%) 263.6 (16.8%) 1970 (81.1%) 1086 (71.5%) 197 52.1
DGEMM 135.8 (34.7%) 280 (17.9%) 658 (27 %) 622 (41%) 222 49.1
ST
R
A
T
IX
SDOT 123.1 (17.8%) 386.3 (13.9%) 1028 (11.4%) 328 (7.34%) 358H 68.7
DDOT 235.1 (33.9%) 682.7 (24.6%) 773 (8.6%) 512 (11.5%) 366H 68.8
SGEMV 123.4 (17.8%) 352.6 (12.7%) 1246 (13.9%) 274 (6.1%) 347H 68
DGEMV 275.7 (39.8%) 831.9 (30%) 999 (11.2%) 520 (11.6%) 347H 69.7
SGEMM 328.5 (47.4%) 1031 (37.2%) 7767 (86%) 3270 (73.2%) 216 70.5
DGEMM 450.9 (65%) 1054 (38 %) 2077 (23.2%) 1166 (26.1%) 260 67.5
TABLE III: Resource consumption for the different modules. The
percentages of used resources are in brackets. Frequency (F) is
reported in MHz, Power (P) in Watts. H indicates designs synthesized
with HyperFlex optimization enabled.
C. Streaming Composition Evaluation
We used the applications discussed in Sec. V to evaluate
the performance gains achieved by module composition. The
streaming compositions are compared to calling the modules
one-by-one via the host layer. Due to BSP limitation, designs
have no automatic memory interleaving. For all the used
modules we fixed the vectorization width to 16, sufficient to
saturate the memory bandwidth of the single DDR module,
and, when relevant, tiles of size 1024× 1024. Fig. 11 reports
the speedups with different input data sizes on the Stratix
FPGA, computed as the ratio between the execution time of
the host layer version over the execution time of the streaming
composition. Similar results hold for the Arria testbed.
According to the analysis done in Sec. V-A, for AXPYDOT
we expected a speedup of 3. However, given the limitations of
the BSP, the vector z used by the AXPY routine is read/written
in the same memory module. This results in a slow-down,
2Preliminary experiments have shown this is solved with Quartus v19.3.
 0
 1
 2
 3
 4
 5
2M 4M 8M 16M
Sp
ee
du
p
Input data size
AXPYDOT
 0
 0.5
 1
 1.5
 2
1Kx
1K
2Kx
2K
4Kx
4K
8Kx
8K
Input data size
BICG
 0
 1
 2
 3
 4
1Kx
1K
2Kx
2K
4Kx
4K
8Kx
8K
Input data size
GEMVER
Fig. 11: Streaming composition: speedup over individual kernels.
that does not affect the streaming version. This increases the
achieved speedup to 4. For BICG, the expected gain is due to
the reduced number of memory accesses that are performed
by the streaming version. Considering the frequency of the
synthesized design (260MHz), the interface module of the
streaming version is able to saturate the 87% of the memory
bandwidth of the module. This gives an expected speedup
of 1.7. The measured speedup is at most 1.45. Speedups for
GEMVER confirm the analysis performed in Sec. V-B.
Overall, these experiments validate our performance anal-
ysis and highlight the performance benefits of pipelining
computational modules exploiting on-chip FIFO buffers, in
particular in cases where the chained kernels can execute
completely in parallel, yielding maximum pipeline parallelism.
Additionally, thanks to the reduction of interface modules,
module composition uses a lower amount of resources (up
to −40%) with respect to the non-streamed implementation.
D. Comparison with CPU
In this section we compare the performance of the Host CPU
and of the Stratix FPGA, using the single precision format. In
CPU tests, we considered the best parallel execution time.
Tab. IV reports the execution time for the individual routines
working with single and double precision. For single precision,
FPGA designs have been compiled using a vectorization width
of 32 for DOT, width 64 and squared tile of size 2048 for
GEMV, and a systolic array of 40 × 80, squared memory tile
of size 960 for GEMM. For double precision, the vectorization
width is set to 16 (DOT) and 32 (GEMV, tile size 2048), while
the systolic array has size 16× 16 (GEMM, tile size 384). Data
is interleaved across the different DDR modules.
CPU FPGA
Rout. P N Time [usec] P [W] Time [usec] F [MHz] P [W]
DOT
S 16M 2,050 76.5 1,866 370H 59.1S 256M 35,131 77.8 28,272
D 16M 4,079 77.4 3,627 370H 60.3D 128M 35,124 77.6 28,250
GEMV
S 8Kx8K 5,402 77.8 4,091 366H 61.4S 64Kx64K 323,795 81 241,038
D 8Kx8K 9,810 83.8 7,831 354H 63.2D 32Kx32K 163,510 83.05 120,357
GEMM
S 8Kx8K 1.56 (sec) 79.9 1.01 (sec) 192.5 70.5S 48Kx48K 300.7 (sec) 81.6 181 (sec)
D 8Kx8K 3.14 (sec) 85 8.43 (sec) 260 67.5D 24Kx24K 75.78 (sec) 88.4 203 (sec)
TABLE IV: Comparison to CPU for single routines using single
(P=S) and double (P=D) precision. H indicates designs synthesized
with HyperFlex optimization enabled.
Considering memory bound routines (DOT and GEMV),
FBLAS execution time is up to 25% lower compared to the
CPU, in both single and double precision, despite the Stratix
testbed has a memory bandwidth only 13% higher than the one
offered by the host. For GEMM, FBLAS outperforms the CPU
in single precision, but it is penalized by the lack of hardened
double precision units. This prevents the use of larger systolic
arrays when computing with 64 bits floating point numbers,
due to the high demand of resources required to implement
arithmetic operations. Thanks to its interface, FBLAS can be
easily extended, favoring the explorations of different variants
of the same routine. Therefore, optimized version of particular
routines can be easily added to the codebase.
FBLAS modules can be fully unrolled to realize circuits
similar to tensor units (Sect. III-A). Given the high resource
and bandwidth requirements implied, such solutions could
be adopted for small sized data. This is common in various
computations, where problems are decomposed into batches
containing thousands of computation working on smaller input
data (e.g., neural networks, factorization algorithms). Table V
reports the execution time of fully unrolled GEMM and TRSM
of size 4 (enough to saturate DRAM bandwidth), comparing
them against the batched version of the same routine offered
by MKL, for different number of invocations (N ). The results
show how this kind of computation could be a good fit for
FPGAs, provided that enough memory bandwidth is available.
Finally, Table VI reports the execution time for the stream-
ing applications, with different input sizes. For the FPGA,
single precision designs have vectorization width 32 and tiles
size 2048 × 2048, with BICG as the only exception, being
compiled with a width of 64, to allows it to exploit the memory
bandwidth of the 4 DDR modules. For double precision
designs, the vectorization width is half the value of the single
precision cases. Being memory intensive computations, thanks
to streaming composition FBLAS can obtain execution times
lower or comparable to the CPU version, both in single and
double precision. Regarding power consumption, FPGA board
uses up to ∼30% less power for the measured workloads with
respect to the considered CPU (we note that the reported power
drain for FPGA consider the full board).
In general, we believe that the most viable and interesting
use-case for linear algebra on FPGA in an HPC context
are applications in which the device is the main execution
CPU FPGA
Rout. P N Time [usec] P [W] Time [usec] F [MHz] P [W]
GEMM
S 8K 128.2 62.2 144.7 297.5H 58.6S 32K 457.4 62 275.3
D 8K 108.3 59.2 187.52 297.5H 60.1D 32K 404.9 61.4 461
TRSM
S 8K 248.4 59.8 144 335H 58.8S 32K 749.9 61.8 341.6
D 8K 248.4 62 184.1 350H 63.1D 32K 731.6 60.8 589.2
TABLE V: Comparison to batched CPU routines, single (P=S)
and double (P=D) precision. H indicates designs synthesized with
HyperFlex optimization enabled.
CPU FPGA
Appl. P N Time [usec] P [W] Time [usec] F [MHz] P [W]
AXPYDOT
S 4M 1,376 81.9 1,101 370H 59.1S 16M 8,556 81.7 3,783
D 4M 4,295 77.2 2,023 370H 60.6D 16M 17,130 82.4 7,297
BICG
S 2Kx2K 218 69 550 220 59.9S 8Kx8K 5,796 71 5,879
D 2Kx2K 467.8 67.4 795.7 238 62.8D 8Kx8K 11,724 70.5 9,939
GEMVER
S 2Kx2K 895 76.8 2,407 236 61.5S 8Kx8K 43,291 78.5 37,094
D 2Kx2K 4,728 80.8 4,425 275 64.4D 8Kx8K 88,160 78.2 64,115
TABLE VI: Comparison to CPU for composed kernels, single (P=S)
and double (P=D) precision. H indicates designs synthesized with
HyperFlex optimization enabled.
platform, so vectors and matrices mainly reside in FPGA
memory. Due to the lack of proper hardened units, compute-
intensive routines that operate with double precision numbers
do not benefit from execution on the FPGA. We thus rec-
ommend to use FBLAS in scenarios that are a good fit to
the FPGA architecture: namely deep pipelines with a high
degree of spatial parallelism, using FBLAS routines and/or
compositions as stages of the pipeline.
VII. RELATED WORK
Usually, hardware description languages have been the
preferred choice for implementing dense numerical routines
for reconfigurable hardware [18], [19], [20]. Moss et al. [8]
present a matrix multiplication SystemVerilog template, able
to achieve close to peak performance for GEMM on the target
FPGA (800 GOPs/s on an Arria 10 1150). More recently, there
has been a wider adoption of HLS tools for implementing
linear algebra. de Fine Licht et al. [21] present a matrix
multiplication accelerator based on a performance model that
optimizes for maximum performance and minimum off-chip
communication, achieving 409 GOP/s for single precision on
a Xilinx VCU1525 UltraScale+ FPGA. In [22], Gorlani et al.
implemented Cannon’s matrix multiplication algorithm for
squared matrices, reaching 1.32 TOP/s on a Stratix 10. Our
proposed implementation performs similarly, and supports full
GEMM for arbitrary sized matrices. In general, all the afore-
mentioned works address one or a few numerical routines,
and does not treat composition between routines. In contrast,
FBLAS is extendable, offers the full set of BLAS routines,
and exposes native streaming composition to the user.
Systolic arrays are often used for the implementation of
high throughput dense linear algebra on FPGA [8], [21], [23]:
Processing Elements (PEs) are organized in a 1D/2D grid,
and each PE locally computes a partition of the final result.
They communicate with each other only by passing along
input (feeding) and output (draining) data. Other application
domains use a different systolic architecture. The Smith-
Waterman genomics algorithm for sequence alignment can
be implemented using systolic arrays on FPGA [24], [25],
[26]. In these implementations, a linear sequence of PEs is
used to exploit the wavefront parallelism of the algorithm, by
letting them work in parallel for computing the anti-diagonal
of the alignment matrix. In these cases, the result of the local
computation performed by a PE on each clock cycle can
contribute to the inputs of other PEs in the following clock
cycle. This is different from the dense linear algebra systolic
arrays, as PEs exchange not only feeding/draining data, but
also collaborate on intermediate results.
Some previous work focuses on design space exploration
(DSE), where HLS programmers are assisted to find good
combinations of pipelining, vectorization, initiation interval
and memory usage, to achieve a given resource/performance
goal. DSE tools are based on analytical models coupled with
static analysis [27], estimated provided by HLS tools [28],
or machine learning methods [29]. Usually these tools require
code instrumentation and output hints to drive the programmer
optimizations. Similar to this work, Zhuo and Prasanna [18]
analyze the design trade-off between used resources and
performances. Gautier et al. propose Spector [30], a bench-
mark suite targeted at design space exploration for HLS,
that includes matrix-matrix multiplication. They show how
designs can be affected not only by individual parameters
such as unrolling factor and blocking, but also by complex
interactions between these parameters that are difficult to
model mathematically. In FBLAS, HLS modules can be tuned
by acting on only two aspects: vectorization and tiling. To
guide this, we propose models to analyze the space/time trade-
offs that arise when selecting vectorization widths and tile size.
Works in different application domains have benefited from
pipeline parallelism exploiting on-chip resources for streaming
between modules [31], [32], [33]. In FBLAS, all routines
communicate via streaming interfaces, enabling the benefits
of composing HLS modules.
VIII. CONCLUSION
In this paper we presented FBLAS, the first publicly avail-
able BLAS implementation for FPGA, implemented purely in
HLS. We propose a method of spatial library design, where
configurable components with flexible interfaces are offered to
the user, which can be plugged into existing designs to exploit
streaming computations. In FBLAS, modules are designed
such that resource usage and performance is tunable according
to a model of their space/time trade-off, allowing them to be
specialized to the user’s application, and expose streaming
interfaces, enabling pipelined composition favoring on-chip
data movement. While in this paper we focused on dense linear
algebra computations, we believe that these aspects can be key
considerations in the design of other spatial hardware libraries,
and execution units for Domain Specific Architectures.
Future research directions could involve a full general case
analysis of MDAGs, deriving valid FBLAS compositions for
a dense linear algebra program, as well as the study of more
complex numerical computations such as factorization algo-
rithms, that could re-use FBLAS routines as building blocks.
By releasing the code as open source, we hope to involve the
community in the continued development of FBLAS, targeting
both current and future OpenCL-compatible devices.
ACKNOWLEDGMENTS
This work has been supported from the European Re-
search Council (ERC) under the European Union’s Horizon
2020 programme, Grant Agreement No. 678880 (DAPP), and
Grant Agreement No. 801039 (EPiGRAM-HS). We gratefully
acknowledge support from ETH Zurich. We would like to
thank the Swiss National Supercomputing Center (CSCS) for
providing the computing resources and for their excellent
technical support.
REFERENCES
[1] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and
D. Burger, “Dark silicon and the end of multicore scaling,” IEEE Micro,
vol. 32, no. 3, pp. 122–134, May 2012.
[2] M. M. Waldrop, “The chips are down for moore’s law,” Nature News,
vol. 530, no. 7589, p. 144, 2016.
[3] M. Horowitz, “1.1 computing’s energy problem (and what we can do
about it),” in 2014 IEEE International Solid-State Circuits Conference
Digest of Technical Papers (ISSCC), Feb 2014, pp. 10–14.
[4] J. L. Hennessy and D. A. Patterson, “A new golden age for computer
architecture,” Commun. ACM, vol. 62, no. 2, p. 48âA˘S¸60, Jan. 2019.
[Online]. Available: https://doi.org/10.1145/3282307
[5] N. P. Jouppi, C. Young, N. Patil, and D. Patterson, “A domain-
specific architecture for deep neural networks,” Commun. ACM,
vol. 61, no. 9, p. 50âA˘S¸59, Aug. 2018. [Online]. Available:
https://doi.org/10.1145/3154484
[6] J. Choquette, O. Giroux, and D. Foley, “Volta: Performance and pro-
grammability,” IEEE Micro, vol. 38, no. 2, pp. 42–52, 2018.
[7] E. Nurvitadhi, G. Venkatesh, J. Sim, D. Marr, R. Huang, J. Ong
Gee Hock, Y. T. Liew, K. Srivatsan, D. Moss, S. Subhaschandra, and
G. Boudoukh, “Can fpgas beat gpus in accelerating next-generation
deep neural networks?” in Proceedings of the 2017 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays, ser.
FPGA ’17. New York, NY, USA: ACM, 2017, pp. 5–14. [Online].
Available: http://doi.acm.org/10.1145/3020078.3021740
[8] D. J. Moss, S. Krishnan, E. Nurvitadhi, P. Ratuszniak, C. Johnson,
J. Sim, A. Mishra, D. Marr, S. Subhaschandra, and P. H.
Leong, “A customizable matrix multiplication framework for the
intel harpv2 xeon+fpga platform: A deep learning case study,”
in Proceedings of the 2018 ACM/SIGDA International Symposium
on Field-Programmable Gate Arrays, ser. FPGA ’18. New York,
NY, USA: ACM, 2018, pp. 107–116. [Online]. Available: http:
//doi.acm.org/10.1145/3174243.3174258
[9] Intel, “Intel fpga sdk for opencl,” https://www.intel.com/content/www/
us/en/software/programmable/sdk-for-opencl/overview.html, 2018.
[10] Xilinx, “Vivado hls,” https://www.xilinx.com/products/design-tools/
vivado/integration/esl-design.html, 2018.
[11] J. de Fine Licht, S. Meierhans, and T. Hoefler, “Transformations of
High-Level Synthesis Codes for High-Performance Computing,” CoRR,
vol. abs/1805.08288, May 2018.
[12] H. Kung, C. Leiserson, C.-M. U. P. P. D. of COMPUTER SCIENCE.,
and C.-M. U. C. S. Department, Systolic Arrays for (VLSI), ser. CMU-
CS. Carnegie-Mellon University, Department of Computer Science,
1978.
[13] Intel, Intel FPGA SDK for OpenCL Pro Edition - Best Practices Guide,
Intel.
[14] G. E. Blelloch and B. M. Maggs, “Algorithms and theory of computation
handbook,” M. J. Atallah and M. Blanton, Eds. Chapman & Hall/CRC,
2010, ch. Parallel Algorithms, pp. 25–25.
[15] S. Blackford, J. Demmel, J. Dongarra, I. Duff, S. Hammarling, G. Henry,
M. Heroux, L. Kaufman, A. Lumsdaine, A. Petitet, R. Pozo, K. Reming-
ton, and C. Whaley, “An updated set of basic linear algebra subprograms
(BLAS),” ACM Trans. Math. Softw., vol. 28, no. 2, pp. 135–151, Jun.
2002.
[16] D. D. Sensi, M. Torquati, and M. Danelutto, “Mammut: High-level
management of system knobs and sensors,” SoftwareX, vol. 6, pp. 150
– 154, 2017. [Online]. Available: http://www.sciencedirect.com/science/
article/pii/S2352711017300225
[17] Understanding How the New Intel HyperFlex FPGA Architecture En-
ables Next-Generation High-Performance Systems. Intel Whitepaper.
Accessed Jun 2, 2020.
[18] L. Zhuo and V. K. Prasanna, “High-performance designs for linear
algebra operations on reconfigurable hardware,” IEEE Transactions on
Computers, vol. 57, no. 8, pp. 1057–1071, Aug 2008.
[19] S. Kestur, J. D. Davis, and O. Williams, “Blas comparison on
fpga, cpu and gpu,” in Proceedings of the 2010 IEEE Annual
Symposium on VLSI, ser. ISVLSI ’10. Washington, DC, USA:
IEEE Computer Society, 2010, pp. 288–293. [Online]. Available:
http://dx.doi.org/10.1109/ISVLSI.2010.84
[20] Z. Jovanovic and V. Milutinovic, “Fpga accelerator for floating-point
matrix multiplication,” IET Computers Digital Techniques, vol. 6, no. 4,
pp. 249–256, July 2012.
[21] J. de Fine Licht, G. Kwasniewski, and T. Hoefler, “Flexible Communica-
tion Avoiding Matrix Multiplication on FPGA with High-Level Synthe-
sis,” Feb. 2020, in Proceedings of the 28th ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays.
[22] P. Gorlani, T. Kenter, and C. Plessl, “Opencl implementation of can-
nonâA˘Z´s matrix multiplication algorithm on intel stratix 10 fpgas,”
in 2019 International Conference on Field-Programmable Technology
(ICFPT), 2019, pp. 99–107.
[23] N. Srivastava, H. Rong, P. Barua, G. Feng, H. Cao, Z. Zhang, D. Al-
bonesi, V. Sarkar, W. Chen, P. Petersen, G. Lowney, A. Herr, C. Hughes,
T. Mattson, and P. Dubey, “T2s-tensor: Productively generating high-
performance spatial hardware for dense tensor computations,” in 2019
IEEE 27th Annual International Symposium on Field-Programmable
Custom Computing Machines (FCCM), 2019, pp. 181–189.
[24] E. Houtgast, V. Sima, and Z. Al-Ars, “High performance streaming
smith-waterman implementation with implicit synchronization on intel
fpga using opencl,” in 2017 IEEE 17th International Conference on
Bioinformatics and Bioengineering (BIBE), 2017, pp. 492–496.
[25] L. Di Tucci, K. O’Brien, M. Blott, and M. D. Santambrogio, “Archi-
tectural optimizations for high performance and energy efficient smith-
waterman implementation on fpgas using opencl,” in Design, Automation
Test in Europe Conference Exhibition (DATE), 2017, 2017, pp. 716–721.
[26] R. Ben Abdelhamid and Y. Yamaguchi, “A block-based systolic array on
an hbm2 fpga for dna sequence alignment,” in Applied Reconfigurable
Computing. Architectures, Tools, and Applications, F. Rincón, J. Barba,
H. K. H. So, P. Diniz, and J. Caba, Eds. Cham: Springer International
Publishing, 2020, pp. 298–313.
[27] G. Zhong, A. Prakash, Y. Liang, T. Mitra, and S. Niar, “Lin-analyzer:
A high-level performance analysis tool for fpga-based accelerators,” in
2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC),
June 2016, pp. 1–6.
[28] Y. Choi, P. Zhang, P. Li, and J. Cong, “Hlscope+,: Fast and accurate
performance estimation for fpga hls,” in 2017 IEEE/ACM International
Conference on Computer-Aided Design (ICCAD), Nov 2017, pp. 691–
698.
[29] K. O’Neal, M. Liu, H. Tang, A. Kalantar, K. DeRenard, and P. Brisk,
“Hlspredict: Cross platform performance prediction for fpga high-level
synthesis,” in 2018 IEEE/ACM International Conference on Computer-
Aided Design (ICCAD), Nov 2018, pp. 1–8.
[30] Q. Gautier, A. Althoff, P. Meng, and R. Kastner, “Spector: An opencl
fpga benchmark suite,” in 2016 International Conference on Field-
Programmable Technology (FPT), Dec 2016, pp. 141–148.
[31] T. Kenter, J. Förstner, and C. Plessl, “Flexible fpga design for fdtd using
opencl,” in 2017 27th International Conference on Field Programmable
Logic and Applications (FPL), Sep. 2017, pp. 1–7.
[32] Z. Wang, J. Paul, B. He, and W. Zhang, “Multikernel data partitioning
with channel on opencl-based fpgas,” IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, vol. 25, no. 6, pp. 1906–1918, June
2017.
[33] H. R. Zohouri, A. Podobas, and S. Matsuoka, “Combined spatial and
temporal blocking for high-performance stencil computation on fpgas
using opencl,” in Proceedings of the 2018 ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays, ser. FPGA ’18.
New York, NY, USA: ACM, 2018, pp. 153–162. [Online]. Available:
http://doi.acm.org/10.1145/3174243.3174248
