Flexible Communication Avoiding Matrix Multiplication on FPGA with
  High-Level Synthesis by Licht, Johannes de Fine et al.
Flexible Communication Avoiding Matrix Multiplication
on FPGA with High-Level Synthesis
Johannes de Fine Licht
ETH Zurich
definelicht@inf.ethz.ch
Grzegorz Kwasniewski
ETH Zurich
gkwasnie@inf.ethz.ch
Torsten Hoefler
ETH Zurich
htor@inf.ethz.ch
ABSTRACT
Data movement is the dominating factor affecting performance
and energy in modern computing systems. Consequently, many
algorithms have been developed to minimize the number of I/O op-
erations for common computing patterns. Matrix multiplication is
no exception, and lower bounds have been proven and implemented
both for shared and distributed memory systems. Reconfigurable
hardware platforms are a lucrative target for I/O minimizing algo-
rithms, as they offer full control of memory accesses to the program-
mer. While bounds developed in the context of fixed architectures
still apply to these platforms, the spatially distributed nature of
their computational and memory resources requires a decentralized
approach to optimize algorithms for maximum hardware utilization.
We present a model to optimize matrix multiplication for FPGA
platforms, simultaneously targeting maximum performance and
minimum off-chip data movement, within constraints set by the
hardware. We map the model to a concrete architecture using a
high-level synthesis tool, maintaining a high level of abstraction,
allowing us to support arbitrary data types, and enables maintain-
ability and portability across FPGA devices. Kernels generated from
our architecture are shown to offer competitive performance in
practice, scaling with both compute and memory resources. We
offer our design as an open source project1 to encourage the open
development of linear algebra and I/O minimizing algorithms on
reconfigurable hardware platforms.
1 INTRODUCTION
The increasing technological gap between computation and mem-
ory speeds [1] pushes both academia [2–5] and industry [6, 7] to
develop algorithms and techniques tominimize datamovement. The
first works proving I/O lower bounds for specific algorithms, e.g.,
Matrix Matrix Multiplication (MMM), date back to the 80s [8]. The
results were later extended to parallel and distributed machines [9].
Since then, many I/O minimizing algorithms were developed for lin-
ear algebra [2, 10, 11], neural networks [12], and general programs
that access arrays [13]. Minimizing I/O impacts not only perfor-
mance, but also reduces bandwidth usage in a shared system. MMM
is typically used as a component of larger applications [14, 15],
where it co-exists with other algorithms, e.g., memory bound linear
algebra operations such as matrix-vector/vector-vector operations,
which benefit from a larger share of the bandwidth, but do not
require large amounts of compute resources.
FPGAs are an excellent platform for accurately modeling per-
formance and I/O to guide algorithm implementations. In contrast
to software implementations, the replacement of cache with ex-
plicit on-chip memory, and isolation of the instantiated architecture,
1https://github.com/spcl/gemm_hls (10.5281/zenodo.3559536)
matrix A
matrix B
loads required! data reuse memory �le
(a) MMM CDAG (b) Memory �le, dominator & reuse sets c.f. Figure 2
b
t t t
t t t
t t t
t t t
p
p
p
p
p
p
p
p
p
p
p
p
c
c
c
c
c
c
c
c
c
c
c
cno store requiredof par�al products
Figure 1: (a) MMM CDAG, and (b) subcomputation Vi .
yields fully deterministic behavior in the circuit: accessing memory,
both on-chip and off-chip, is always done explicitly, rather than by
a cache replacement scheme fixed by the hardware. The models es-
tablished so far, however, pose a challenge for their applicability on
FPGAs. They often rely on abstracting away many hardware details,
assuming several idealized processing units with local memory and
all-to-all communication [2, 5, 8, 9]. Those assumptions do not
hold for FPGAs, where the physical area size of custom-designed
processing elements (PEs) and their layout are among most impor-
tant concerns in designing efficient FPGA implementations [16].
Therefore, performance modeling for reconfigurable architectures
requires taking constraints like logic resources, fan-out, routing,
and on-chip memory characteristics into account.
With an ever-increasing diversity in available hardware plat-
forms, and as low-precision arithmetic and exotic data types are
becoming key in modern DNN [17] and linear solver [18] appli-
cations, extensibility and flexibility of hardware architectures will
be crucial to stay competitive. Existing high-performance FPGA
implementations [19, 20] are implemented in hardware description
languages (HDLs), which drastically constrains their maintenance,
reuse, generalizability, and portability. Furthermore, the source
code is not disclosed, such that third-party users cannot benefit
from the kernel or build on the architecture.
In this work, we address all the above issues. We present a high-
performance, communication avoiding MMM algorithm, which is
based on both computational complexity theory [8] (Section 3.2),
and on detailed knowledge of FPGA-specific features (Section 4).
Our architecture is implemented in pure C++ with a small and
readable code base, and to the best of our knowledge, is the first open
source, high-performance MMM FPGA code. We do not assume the
target hardware, and allow easy configuration of platform, degree
of parallelism, buffering, data types, and matrix sizes, allowing
kernels to be specialized to the desired scenario. The contributions
of this paper are:
• We model a decomposition for matrix multiplication that si-
multaneously targets maximum performance and minimum
off-chip data movement, in terms of hardware constants.
ar
X
iv
:1
91
2.
06
52
6v
1 
 [c
s.D
C]
  1
3 D
ec
 20
19
memory tile (M)
b b b
b b b
b b b
b b b
block tile (b)
MM
t t t
t t t
t t t
t t t
C COMPUTATION OPTIMALITYI/O OPTIMALITY
M
M M
Mk sequential 
executions
k
n
A
k Bm
compute tile (t)
p
p
p
p
p
p
p
p
p
p
p
p
c
c
c
c
c
c
c
c
c
c
c
c
processing element (p)
co
m
p
u
te
u
n
it
 (
c)
Figure 2: Decomposition of MMM achieving both performance and I/O optimality in terms of hardware resources.
• We design a mapping that allows the proposed scheme to be
implemented in hardware, using the model parameters to
lay out the architecture.
• We provide a plug-and-play, open source implementation of
the hardware architecture in pure HLS C++, enabling porta-
bility across FPGA and demonstrating low code complexity.
• We include benchmarks for awide range of floating point and
integer types, and show the effect of adjusting parallelism
and buffer space in the design, demonstrating the design’s
flexibility and scalability.
2 OPTIMIZATION GOALS
In this section we introduce what we optimize. In Sections 3.2, 3.3,
and 4 we describe how this is achieved. We consider optimizing the
schedule of a classical MMM algorithm, that is, given a problem
of finding C , where C = AB,A ∈ Rm×k ,B ∈ Rk×n ,C ∈ Rm×n ,
an algorithm performs F = mnk multiplications and additions
(pseudocode shown in Lst. 1). We therefore exclude Strassen-like
routines [21] from our analysis, as the classical algorithms often
perform better on practical problems and hardware [22]. We require
that the optimal schedule: (1) achieves highest performance (takes
least time-to-solution time), while (2) performing the least number
of I/O operations, by (3) making most efficient use of resources.
1 for (i = 0; i < M; i++)
2 for (j = 0; j < N; j++)
3 for (k = 0; k < K; k++)
4 C[i, j] = C[i, j] + A[i, k]*B[k, j];
Listing 1: Classical MMM algorithm.
Computation On general purpose CPUs, as well as on modern
GPUs, optimizing for computation is often a straightforward task.
Exploiting techniques like vectorization or data alignment [23]
can mostly be offloaded to compilers. Thus, most of the effort is
spent on I/O minimization. When targeting FPGAs, designing an
architecture that can optimally exploit available logic, even for
computationally simple algorithms such as matrix multiplication,
requires significant engineering effort. We must thus maintain a
decomposition that is efficiently implementable in hardware, while
achieving the desired theoretical properties.
I/O Schedule optimization on a parallel machine determines both
the domain decomposition (which computations are assigned to
which compute units), and sequential execution (the order in which
each compute unit executes its tasks). The former impacts com-
munication between compute units (a.k.a. horizontal I/O), and the
latter is responsible for the communication between a compute
unit and a main memory (a.k.a. a vertical I/O). Both aspects of
the parallel schedule are constrained by available resources and
their interdependencies (e.g., NUMA domains or limited fan-out on
FPGAs).
Resources When targeting a high utilization design on FPGA, it
is critical to maintain characteristics that aid the routing process.
Routing reacts poorly to large fan-in or fan-out, which typically
occurs when these are dependent on the degree of parallelism: that
is, if N determines the degree of parallelism in the program, 1-to-
N and N -to-1 connections in the architecture should be avoided.
This is true both on the granularity of individual logic units, and
on the granularity of coarse-grained modules instantiated by the
programmer. To accommodate this, we can regulate the size of PEs,
and favor PE topologies that are easily mapped to a plane, such as
grids or chains. Furthermore, mapping of a hardware architecture
to the chip logic and interconnect (placement and routing) may
reduce the clock frequency due to long routing paths. Due to the
intractable size of the configuration space, this cannot be efficiently
modeled and requires empirical evaluation of designs. The routing
challenges are exasperated in FPGA chips that consist of multiple
“chiplets”, such as the Xilinx UltraScale+ VU9P chip used in this
paper, which hosts three “super-logical regions” (SLRs). Crossing
the chiplets consumes highly limited routing resources and carries
a higher timing penalty. Limiting these crossings is thus key to
scaling up resource utilization.
Throughout this paper, we use the two-level notation for naming
parameters (Table 1). Most of the parameter names are in the form
of αβ , where α refers to some quantity, such as the total number of
objects, and β determines what is the object of interest. E.g., Nc ,Nb ,
sb are: total number of (N ) compute units (c), memory blocks (b),
and a size of each memory block (s), respectively.
The target hardware contains d types of different logic resources.
This typically consists of general purpose logic, such as lookup tables
(LUTs), and more specialized arithmetic units, such as digital signal
processing units (DSPs). We represent a quantity of these resources
as a vector rmax = [r1,max, ..., rd,max]. As a basic logical entity, we
consider a “compute unit”, which is a basic circuit able to perform
a single multiply-addition operation in a single cycle. Each unit is
implemented on the target hardware using some combination of
logic resources rc . Depending on the numerical precision, different
M
M
M i, j, k Unit vectors spanning 3D iteration space.m, n, k Matrix sizes in i, j, and k dimensions, respectively.
A, B Input matrices A ∈ Rm×k and B ∈ Rk×n .
C = AB Output matrix C ∈ Rm×n .
na
m
in
g αβ Parameter naming convention. α is some quantity
(i.e., size or number of objects), β is an object that
α refers to.
αβ ,max Hardware limit on a parameter αβ .
αtot =
∏
β αβ The product of all tile sizes.
α
N Total number of objects.
x, y Number of objects in i (or j) dimension in the
enveloping tile.
s Intrinsic size of an object.
w Bit width of object (e.g., of port or data type).
r Vector of logic resources.
β
c Compute units in a processing element.
p Processing elements in a compute tile.
t Compute tiles in a block tile.
b Block tiles in a memory tile.
op
ti
m
iz
at
io
n S = Nb · sb Total size of available on-chip memory.
Nc ≤ Nc,max Total number of compute units.
Q Total number of off-chip memory transfers.
F = n ·m · k Total number of multiply-addition operations
required to perform MMM.
f ≤ fmax Achieved and maximum design frequency.
T = Ff ·Nc Design total execution time.
Table 1: The most important symbols used in this paper.
number of computation resources rc are needed to form a single
compute unit, so the maximum number of compute units that can
be instantiated, Nc , may vary. The compute units are organized into
Np processing elements (PEs), which encapsulate a closed logical
task (e.g., a vector operation) of xc · yc compute units. Each PE
additionally consumes rp logic resources as orchestration overhead.
This gives us the following constraint, which enforces that the
total amount of resources consumed by compute units and their
encompassing PEs should not exceed the total resources available:
∀1≤i≤dNcri,c + Npri,p ≤ ri,max,
or equivalently ∀1≤i≤dNp (ri,p + ri,c · xcyc ) ≤ ri,max (1)
where d is the dimensionality of the resource vector (illustrated in
the top-right side of Fig. 2).
3 OPTIMIZATION MODELS
3.1 Computation Model
To optimize computational performance we minimize the total
execution runtime, which is a function of achieved parallelism
(total number of compute units Nc ) and the design clock frequency
f . The computational logic is organized intoNp PEs, and we assume
that every PE holds xc ·yc compute units in dimensions x and y (see
Tab. 1 for an overview of all symbols used). We model the factor
Nc directly in the design, and rely on empirically fixing f , which
is limited by the maximum size of data buses between PEs (i.e.,
xcwc ≤ wp,max and ycwc ≤ wp,max, wherewp,max depends on the
architecture, and typically takes values up to 512 bit). Formally, we
can write the computational optimization problem as follows:
minimize T = F
f · Nc =
mnk
f · Np · xcyc
subject to:
∀1≤i≤dNp (ri,p + ri,c · xc · yc ) ≤ ri,max
xcwc ≤ wp,max
ycwc ≤ wp,max
f ≤ fmax
(2)
That is, the time to completion T is minimized when f ·Nc is max-
imized, where the number of parallel compute units Nc is con-
strained by the available logic resources rmax of the design (this can
be the full hardware chip, or any desired subset resource budget).
We respect routing constraints by observing a maximum bus width
wp,max, and must stay within the target frequency fmax.
3.2 I/O Model
3.2.1 State-of-the-art of Modeling I/O for MMM.
Minimizing the I/O cost is essential for achieving high performance
on modern architectures, even for traditionally compute-bound
kernels like MMM [24]. In this section, we sketch a theoretical
background from previous works which lays foundation for our
FPGA model. Following the state-of-the-art I/O models [2, 8, 25]
we assume that a parallel machine consists of p processors, each
equipped with a fast private memory of size S words. To perform an
arithmetic operation, processor pi is required to have all operands
in its fast memory. The principal idea behind the I/O optimal sched-
ule is to maximize the computational intensity, i.e., the number of
arithmetic operations performed per one I/O operation. This natu-
rally expresses the notion of data reuse, which reduces both vertical
(through memory hierarchy) and horizontal (between compute
units) I/O (Section 2).
Algorithm as a Graph We represent an entire execution of an
algorithm as a computation directed acyclic graph (CDAG) [8, 24,
26] G = (V ,E), where every vertex v ∈ V corresponds to some
unique value during the execution, and edges e ∈ E represent
data dependencies between them. Vertices without incoming edges
are inputs, and the ones without outgoing edges are outputs. The
remaining vertices are intermediate results. In the context of MMM,
matrices A and B formm × k and k × n input vertices, respectively,
and partial sums of C formmnk intermediate vertices, with inputs
both from corresponding vertices of A and B, and previous partial
sums of C . The output vertices are formed by them × n vertices of
C which represent the last of k partial sums. The MMM CDAG is
shown in Fig. 1a.
I/O as Graph Pebbling Hong and Kung [8] introduced the red-
blue pebble game abstraction to model the I/O cost of a sequential
schedule on a CDAG. We refer a reader to the original paper for
the formal definition and details of this game: here we just draw its
simplistic sketch. The key idea is to play a pebbling game with a
limited number of red pebbles (corresponding to the small-but-fast
memory) and an unlimited number of blue pebbles (large, slow
memory). The rules are that one can put a red pebble on a vertex
only if all its direct predecessors also have red pebbles (which
represent computing a value, while all operands are in the fast
memory). Placing a blue pebble on a red one corresponds to a store
operation, and putting a red pebble on a blue corresponds to a load.
Initially, only input vertices have blue pebbles on them. The goal is
to put blue pebbles on the output vertices. In this model, the I/O
optimal schedule is a sequence of pebbling moves which minimizes
the load and store operations.
Schedule as a Graph Partition In our work, we extend the
methodology and results of COSMA [24], which is based on the
red-blue pebble game. We partition the CDAG G = (V ,E) into h
disjoint subsets (a.k.a. subcomputations)Vi , i = 1 . . .h,Vi ⊂ V , such
that each Vi has a constant number X of input and output vertices
(which form the Dominator set andMinimum set, respectively). The
collection of all {Vi },⋃Vi = V is called an X -partition of G. The
authors show that an I/O optimal scheme can be derived from find-
ing an X -partition {Vi } of a smallest cardinality, for some value of
X . We will use this result to build our model.
Optimal Graph Partition as Maximizing Computational In-
tensity In COSMA [24], it is shown that the I/O optimal MMM
schedule maximizes the computational intensity of each subcom-
putation Vi , that is, the number of arithmetic operations per I/O
operation. Formally:
maximize |Vi ||Dom(Vi )| − |VR,i | + |WB,i |
subject to: |VR,i | ≤ S ,
(3)
where |Vi | is a number of values updated in subcomputation Vi ,
|Dom(Vi )| is the number of inputs of Vi (the Dominator set), |VR,i |
is the number of inputs that are already in the on-chip memory
(data reuse), and |WB,i | is the number of partial results that have to
be stored back to off-chip memory. Therefore, we aim to maximize
utilization of all available on-chip memory, to increase the reuse
of data |VR,i | that is already loaded. The sets Vi , Dom(Vi ), and VR,i
are depicted in Fig. 1b.
3.2.2 Extending to the FPGA Scenario.
I/O Optimal Schedule vs. FPGA Constraints State-of-the-art
I/O models [2, 8, 25] assume that a parallel machine consists of p
processors, each equipped with a small private memory of constant
size S words (two-level memory model). Under these assumptions,
COSMA establishes that the I/O optimal MMM schedule is com-
posed of h =mnk/S subcomputations, each performing S multiply-
addition operations while loading 2
√
S elements from matrices A
and B. However, in the FPGA setting these assumptions do not
hold, as the number of compute units Nc is a variable depending on
both hardware resources (which is constant), and on the process-
ing element design. Furthermore, the mapping between processing
elements and available BRAM blocks is also constrained by ports
and limited fan-out. We impose the additional requirement that
compute and memory resources must be equally distributed among
PEs, posing additional restrictions on a number of available re-
sources and their distribution for each subcomputationVi to secure
maximum arithmetic throughput and routing feasibility:
(1) The number of parallel compute units Nc is maximized.
(2) The work is load balanced, such that each compute unit
performs the same number of computations.
(3) Each memory block is routed to only one compute unit (i.e.,
they are not shared between compute units).
(4) Each processing element p performs the same logical task,
and consumes the same amount of computational and mem-
ory block resources.
Memory resources To model the memory resources of the FPGA,
we consider the bit-length ofwc , depending on the target precision.
The machine contains Nb on-chip memory blocks, each capable of
holding sb words of the target data type, yielding a maximum of
S = Nb · sb
words that can be held in on-chip memory. sb takes different values
depending on wc (e.g., 16 bit for half precision floating point, or
a 64 bit long unsigned integer). Each memory block supports one
read and one write of up towb bits in a single cycle in a pipelined
fashion.
FPGA-constrained I/O Minimization We denote each Vi as a
memory tile M , as its size in i and j dimensions determines the
memory reuse. To support a hierarchical hardware design, eachM is
further decomposed into several levels of tiling. This decomposition
encapsulates hardware features of the chip, and imposes several
restrictions on the final shape ofM . The tiling scheme is illustrated
in Fig. 2. We will cover the purpose and definition of each layer in
the hierarchy shortly in Sec. 3.3, but for now use that the dimensions
of the full memory tileM are:
xtot = xc · xp · xt · xb
ytot = yc · yp · yt · yb , (4)
and we set |Vi | = xtotytot. Following the result from [24], a schedule
that minimizes the number I/O operations, loads xtot elements of
one column of matrix A, ytot elements of one row of matrix B and
reuses xtotytot previous partial results ofC , thus computing an outer
product of the loaded row and column. We now rewrite Eq. 3 as:
maximize xtotytot
xtot + ytot
subject to: xtot + ytot ≤ S
xtotytot ≤ S ,
(5)
and the total number of I/O operations as:
Q =mn
(
1 + k
(
1
xtot
+
1
ytot
))
. (6)
This expression is minimized when:
xtot = ytot =
√
S (7)
That is, a memory tile is a square of size S . Eq. 6 therefore gives
us a theoretical lower bound on Q ≤ 2mnk/√S , assuming that all
availablememory can be used effectively. However, the assumptions
stated in Sec. 3.2 constrain the perfect distribution of hardware
resources, which we model in Sec. 3.3.
3.3 Resource Model
Based on the I/O model and the FPGA constraints, we create a
logical hierarchy which encapsulates various hardware resources,
which will guide the implementation to maximize I/O and per-
formance. We assume a chip contains rmax = {r1,max, . . . , rt,max}
different hardware resources (see Sec. 2). The dimensionality and
1 // Memory tiles m
2 for (im = 1; im ≤ n; im = im + xtot)
3 for (jm = 1; jm ≤ m; jm = jm + ytot)
4 for (k = 1; k ≤ k ; k = k + 1) // Full dimension k
5 // [Sequential] Block tiles b in memory tile
6 for (ib = im ; ib ≤ im + xtot; im = im + xt xc xp )
7 for (jb = jm ; jb ≤ jm + ytot; jm = jm + ytypyc )
8 // [Sequential] Compute tiles t in block tile
9 for (it = ib ; it ≤ ib + xt xpxc ; ib = ib + xc xp )
10 for (jt = yb ; jt ≤ jb + ytypyc ; jt = jt + ycyp )
11 // [Parallel] Processing elements p in compute tile
12 forall (ip = it ; ip ≤ it + xpxc ; it = it + xc )
13 forall (jp = jt ; jp ≤ jt + ypyc ; jp = jp + yc )
14 // [Parallel] Compute units c in processing element
15 forall (ic = ip ; ic ≤ ip + xc ; ic = ic + 1)
16 forall (jc = jp ; jc ≤ jp + yc ; jc = jc+1)
17 C (ic , jc ) = C (ic , jc ) +A(ic , k ) · B(k, jc )
Listing 2: Pseudocode of the tiled MMM algorithm.
length of a vector depends on the target hardware – e.g., Intel Ar-
ria 10 and Stratix 10 devices expose native floating point DSPs,
each implementing a single operation, whereas a Xilinx UltraScale+
device requires a combination of logic resources. We model fast
memory resources separately as memory blocks (e.g., M20K blocks
on Intel Stratix 10, or Xilinx BRAM units). We consider a chip con-
tains Nb memory blocks, where each unit can store sb elements of
the target data type and has a read/write port width ofwb bits. The
scheme is organized as follows (shown in Fig. 2):
(1) A compute unit c consumes rc hardware resources, and can
throughput a single multiplication and addition per cycle.
Their maximal number
Nc,max ≤ min1≤i≤t
( ri,max
ri,c
)
for a given numerical precision is a hardware constant, given
by the available resources rmax.
(2) A processing element p encapsulates xc · yc compute units.
Each processing element requires additional rp resources for
overhead logic.
(3) A compute tile t encapsulates xp · yp processing elements.
One compute tile contains all available compute units
xc · yc · xp · yp = Nc .
(4) A block tile b encapsulates xt · yt = sb compute tiles, filling
the entire internal capacity sb of currently allocated memory
blocks.
(5) A memory tileM encapsulates xb ·yb =
⌊
Nb
Nb,min
⌋
block tiles
(discussed below), using all available Nb memory blocks.
Pseudocode showing the iteration space of this decomposition is
shown in Lst. 2, consisting of 11 nested loops. Each loop is either a
sequential for-loop, meaning that no iterations will overlap, and
will thus correspond to pipelined loops in the HLS code; or a parallel
forall-loop, meaning that every iteration is executed every cycle,
corresponding to unrolled loops in the HLS code. We require that
the sequential loops are coalesced into a single pipeline, such that
no overhead is paid at iterations of the outer loops.
3.4 Parallelism and Memory Resources
The available degree of parallelism, counted as a number of simul-
taneous computations of line 17 in Listing 2, is determined by the
number of compute units Nc . Every one of these compute units
must read and write an element of C from fast memory every cycle.
This implies a minimum number of parallel fast memory accesses
0 512 1024 1536 2048
Available memory blocks
0
512
1024
1536
2048
M
em
or
y
bl
o
ck
s
us
ed
fo
r
F
P
32
N
b,
m
a
x
Nb = 0.6 ·Nb,max = 1152 blocks
jcjp = 8, icip = 16
jcjp = 8, icip = 32
jcjp = 8, icip = 64
jcjp = 8, icip = 144
Figure 3: Utilization of memory blocks with memory tile
size. For ic jc = 8 and ip jp = 144, we can utilize 60.4% · Nb,max.
that must be supported in the architecture. Memory blocks expose
a limited access widthwb (measured in bits), which constrains how
much data can be read from/written to them in a single cycle. We
can thus infer a minimum number of memory blocks necessary to
serve all compute units in parallel, given by:
Nb,min = xpyp ·
⌈
wc · xcyc
wb
⌉
, (8)
where wc is the width of the data type in bits, and xcyc denotes
the granularity of a processing element. Because all xcyc accesses
within a processing element happen in parallel, accesses to fast
memory can be coalesced into long words of size wc · xcyc bits.
For cases where wb is not a multiple of wc , the ceiling in Eq. 8
may be significant for the resulting Nb,min. When instantiating
fast memory to implement the tiling strategy, Eq. 8 defines the
minimum “step size” we can take when increasing the tile sizes.
Within a full memory tile, each updated value C[i, j] is reused
after all xtot · ytot elements in a single memory tile are evaluated,
and computation proceeds to the next iteration of the k-loop (line 4
in Listing 2). Given the intrinsic size of each memory block sb , we
can thus perform sb iterations of the compute tile before a single
batch of Nb,min allocated memory blocks has been filled up. If the
total number of memory blocks Nb,max ≥ 2Nb,min, i.e., the number
of blocks required to support the parallel access requirements is less
than the total number of blocks available, we can perform additional⌊
Nb
Nb,min
⌋
iterations of the block tile, using all available memory
blocks (up to the additive factor of Nb mod Nb,min). However,
for large available parallelism Nc , this additive factor may play a
significant role, resulting in a part of available on-chip memory not
being used. This effect is depicted in Fig. 3 for different values of
Nc for the case of single precision floating point (FP32) in Xilinx
BRAM blocks, where sb = 1024 andwb = 36 bit. The total number
of memory blocks that can be efficiently used, without sacrificing
the compute performance and load balancing constraints, is then:
Nb =
⌊
Nb,max
Nb,min
⌋
Nb,min. (9)
In the worst case, this implies that onlyNb,max/2+1memory blocks
are used. In the best case, Nb,max is a multiple of Nb,min, and all
memory block resources can be utilized. When Nc > Nb,max/2, the
memory tile collapses to a single block tile, and the total memory
block usage is equal to Eq. 8.
4 HARDWARE MAPPING
With the goals for compute performance and I/O optimality set
by the model, we now describe a mapping to a concrete hardware
implementation.
4.1 Layout of Processing Elements
For Eq. 2 to hold, all Np PEs must run at full throughout execution
of the kernel, computing distinct contributions to the output tile.
In terms of the proposed tiling scheme, we must evaluate a full
compute tile t (second layer in Fig. 2) every cycle, which consists
of xp · yp PE tiles (first layer in Fig. 2), each performing xc · yc
calculations in parallel, contributing a total of Nc multiplications
and additions towards the outer product currently being computed.
Assuming that Np elements of A and a full row of ytot elements of
B have been prefetched, we must – for each of the xp rows of the
first layer in Fig. 2 – propagate xc values to all yp horizontal PEs,
and equivalently for columns of B. If this was broadcasted directly,
it would lead to a total fan-out of xp · yp for both inputs.
Rather than broadcasting, we can exploit the regular grid struc-
ture, letting each column forward values ofA, and each row forward
values of B, in a pipelined fashion. Such an architecture is some-
times referred to as a systolic array, and is illustrated in Fig. 4. In this
setup, each processing element has three inputs and three outputs
(for A, B, andC), and dedicated Feed A and Feed B modules send
prefetched contributions to the outer product at the left and top
edges of the grid, while Store C consumes the output values ofC
written back by the PEs. The number of inter-module connections
for this design is 3xpyp , but more importantly, the fan-out of all
modules is now constant, with 6 data buses per PE. Each PE is
responsible for fully evaluating xtotytot/Np elements of the output
tile ofC . The elements of each PE tile in Fig. 2 are stored contigu-
ously (the first layer), but all subsequent layers are not – only the
compute tile as a whole in contiguous inC . Final results must thus
be written back in an interleaved manner to achieve contiguous
writes back toC .
Collapsing to a 1D array. Although the 2D array of PEs is intu-
itive for performing matrix multiplication, it requires a grid-like
structure to be routed on the chip. While this solves the issue of
individual fan-out – and may indeed be sufficient for monolithic de-
vices with all logic arranged in a rectangular structure – we wish to
map efficiently onto general interconnects, including non-uniform
and hierarchical structures, as well as multiple-chiplet FPGAs (or,
potentially, multiple FPGAs). To achieve this, we can optionally
collapse the 2D array of PEs into a 1D array by fixing yp = 1, re-
sulting in Np = xp PEs connected in sequence. Since this results
in a long, narrow compute tile, we additionally fix xc = 1, relying
on yc to regulate the PE granularity. Forming narrow compute
tiles is possible without violating Eq. 7, as long as xtot and ytot are
kept identical (or as similar as possible), which we can achieve by
regulating the outer block and tiling layers (the memory and block
tile layers in Fig. 2).
Double buffering. Since each PE in the 1D array now computes
one or more full rows of the compute tile, we can buffer values of
A in internal registers, rather than from external modules. These
can be propagated through the array from the first element to the
last, then kept in local registers and applied to values of B that
Off-chip 
memory
.........
...
...
...
...
...
...
Figure 4: Compute arranged in a 2D grid.
Of
f-c
hi
p 
m
em
or
y
Transpose
Feed B
Re
ad
 B
Re
ad
 A
...
Write C
Figure 5: Module layout of final kernel architecture.
are streamed through the array from a buffer before the first PE.
Since the number of PEs in the final design is large, we overlap the
propagation of new values of A with the computation of the outer
product contribution using the previous values ofA, by using double
buffering, requiring two registers per PE, i.e., 2Np total registers
across the design.
By absorbing the buffering of A into the PEs, we have reduced
the architecture to a simple chain of width 1, reducing the total
number of inter-module connections for the compute to 3Np , with
3 buses connecting each PE transition. When crossing intercon-
nects with long timing delays or limited width, such as connections
between chiplets, this means that only 3 buses must cross the gap,
instead of a number proportional to the circumference of the num-
ber of compute units within a single chiplet, as was the case for
the 2D design. As a situational downside, this increases the number
of pipeline stages in the architecture when maximizing compute,
which means that the number of compute tiles must be larger than
the number of PEs, i.e., ytxt ≥ Np . This extra constraint is easily
met when minimizing I/O, as the block tile size is set to a multiple
of sb (see Sec. 3.3), which in practice is higher than the number of
PEs, assuming that extreme cases like xc = yc = 1 are avoided for
large Nc .
4.2 Handling Loop-carried Dependencies
Floating point accumulation is often not a native operation on
FPGAs, which can introduce loop-carried dependencies on the
accumulation variable. This issue is circumvented with our decom-
position. Each outer product consists of xpxm ·ypym inner memory
tiles. Because each tile reduces into a distinct location in fast mem-
ory, collisions are separated by xpxm · ypym cycles, and thus do
not obstruct pipelining for practical memory tile sizes (i.e., where
xpxm · ypym is bigger than the accumulation latency).
For data types such as integers or fixed point numbers, or archi-
tectures that support (and benefit from) pipelined accumulation of
floating point types, it is possible to make k the innermost loop,
optionally tiling n andm further to improve efficiency of reads from
off-chip memory. The hardware architecture for such a setup is
largely the same as the architecture proposed here, but changes the
memory access pattern.
4.3 Optimizing Column-wise Reads
In the outer product formulation, the A-matrix must be read in a
column-wise fashion. For memory stored as row-major, this results
in slow and wasteful reads from DDR memory (in a column-major
setting, the same argument applies, but for B instead). For DDR4
memory, a minimum of 512 bits must be transferred to make up
for the I/O clock multiplier, and much longer bursts are required to
saturate DDR bandwidth in practice. To make up for this, we can
perform on-the-fly transposition ofA as part of the hardware design
in an additional module, by reading wide vectors and pushing them
to separate FIFOs of depth ≥xbxm , which are popped in transposed
order when sent to the kernel (this module can be omitted in the
implementation at configuration time if A is pre-transposed, or an
additional such module is added if B is passed in transposed form).
4.4 Writing Back Results
Each final tile ofC is stored across the chain of processing in a way
that requires interleaving of results from different PEs when writ-
ing it back to memory. Values are propagated backwards through
the PEs, and are written back to memory at the head of the chain,
ensuring that only the first PE must be close to the memory mod-
ules accessing off-chip memory. In previous work, double buffering
is often employed for draining results, at the significant cost of
reducing the available fast memory from S to S/2 in Eq. 6, resulting
in a reduction in the arithmetic intensity of
√
2. To achieve optimal
fast memory usage, we can leave writing out results as a sequen-
tial stage performed after computing each memory tile. It takes
nm/yc cycles to write back values of C throughout kernel execu-
tion, compared to nmk/Nc cycles taken to perform the compute.
When k/Nc ≫ 1, i.e., the matrix is large compared to the degree of
parallelism, this effect of draining memory tiles becomes negligible.
4.5 Final Module Layout
With the constraints and considerations accumulated above, we
fix the final hardware architecture. The module layout is shown in
Fig. 5, and consists of 4 + Np modules. The Feed B module buffers
the outer product row of B, whereas Np values of A are kept in PE
registers. The vast majority of fast memory is spent in buffering the
output tile of C (see Sec. 3.2), which is partitioned across the PEs,
with xtot ·ytotNp elements stored in each. The Read A and Transpose
modules are connected with a series of FIFOs, the number of which
is determined by the desired memory efficiency in reading A from
DRAM. In our provided implementation, PEs are connected in a
1D sequence, and can thus be routed across the FPGA in a “snake-
like” fashion [16] to maximize resource utilization with minimum
routing constraints introduced by the module interconnect.
PEi
+×PE i-1 PE i+1
I)
III)
IV)
II)
Figure 6: Architecture of a single PE.
The PE architecture is shown in Fig. 6. I) Each PE is respon-
sible for storing a single double-buffered value of A. Values are
loaded from memory and passed through the array, while the previ-
ous outer product is being computed. II) Values of B are streamed
through the chain to be used at every PE. III) Every cycle accumu-
lates into a different address of the output C until it repeats after
xtxb · ytyb cycles. IV) When the outer tile has been computed,
it is sent back through the PEs and written back at the memory
interface.
5 EVALUATION
5.1 Parameter Selection
Using the performance model and hardware mapping consider-
ations, parameters for kernel builds used to produce results are
chosen in the following way, in order to maximize performance and
minimize I/O based on available compute and memory resources,
respectively:
(1) The PE granularity is fixed at xc = 1, and yc is set as high as
possible without impairing routing (determined empirically).
(2) f Nc is maximized by scaling up parallelism Nc = Np ·yc (we
fixed xc = 1) when the benefit is not eliminated by reduction
in frequency, according to Eq. 2.
(3) Memory tile sizes are maximized according to Eq. 9 to satu-
rate on-chip memory resources.
For a given set of parameters, we build kernels in a fully automated
end-to-end fashion, leveraging the abstractions provided by the
high-level toolflow.
5.2 Code Complexity
The MMM kernel architecture used to produce the result in this
work is implemented in Xilinx’ Vivado HLS tool with hlslib [27]
extensions, and as of writing this paper, consists of 624 and 178
SLOC of C++ for kernel and header files, respectively. This is a
generalized implementation, and includes variations to support
transposed/non-transposed input matrices, variable/fixed matrix
sizes, and different configurations of memory bus widths. Addition-
ally, the operations performed by compute units can be specified,
e.g., to compute the distance product by replacing multiply and add
with add and minimum. The full source code is available on github
under an open source license (see footnote on first page).
5.3 Experimental Setup
We evaluate our implementation on a Xilinx VCU1525 accelerator
board, which hosts an Virtex UltraScale+ XCVU9P FPGA. The board
has four DDR4 DIMMs, but due to the minimal amount of I/O
required by our design, a single DIMM is sufficient to saturate
the kernel. The chip is partitioned into three chiplets, that have a
total of 1,033,608 LUTs, 2,174,048 flip-flops (FFs), 6834 DSPs, and
1906 BRAMs available to our kernels. This corresponds to 87%,
92%, 99.9%, and 90% of data sheet numbers, respectively, where the
remaining space is occupied by the provided shell.
Our kernels are written in Vivado HLS targeting the
xilinx:vcu1525:dynamic:5.1 platform of the SDAccel 2018.2
framework, and -O3 is used for compilation. We target 200MHz in
Vivado HLS and SDAccel, although this is often reduced by the tool
in practice due to congestion in the routed design for large designs,
in particular paths that cross between chiplets on the FPGA (see
Sec. 2). Because of the high resource utilization, each kernel build
takes between 8 and 24 hours to finish successfully, or between 4
and 24 hours to fail placement or routing.
On the Virtex UltraScale+ architecture, floating point operations
are not supported natively, and must be implemented using a combi-
nation of DSPs and general purpose logic provided by the toolflow.
The resource vector r thus has the dimensions LUTs, FFs, and DSPs.
The Vivado HLS toolflow allows choosing from multiple floating
point implementations, that provide different trade-offs between
LUT/FF and DSP usage. In general, we found that choosing im-
plementations of floating point addition that does not use DSPs
yielded better results, as DSPs replace little general purpose logic
for this operation, and are thus better spent on instantiating more
multiplications.
Memory blocks are implemented in terms of BRAM, where each
block has a maximum port width of 36 bit of simultaneous read and
write access to 18 kbit of storage. For wider data types, multiple
BRAMs are coalesced. Each BRAM can store sb,36 bit = 1024 ele-
ments in 36 bit configuration (e.g., FP32), sb,18 bit = 2048 elements
in 18 bit configuration (e.g., FP16), and sb,72 bit = 512 elements in
72 bit configuration (e.g., FP64). For this work, we do not consider
UltraRAM, which is a different class of memory blocks on the Ul-
traScale+ architecture, but note that these can be exploited with
the same arguments as for BRAM (according to the principles in
Sec. 3.3). For benchmarked kernels we report the compute and
memory utilization in terms of the hardware constraints, with the
primary bottleneck for I/O being BRAM, and the bottleneck for
performance varying between LUTs and DSPs, depending on the
data type.
5.4 Results
We evaluate the computational performance and communication
behavior of our approach by constructing kernels within varying
logic and storage budgets, based on our C++ reference implementa-
tion. To explore the scaling behavior with increased parallelism, we
measure strong scaling when increasing the number of PEs, shown
in Fig. 7, by increasing Nc for 16384×16384×16384 matrices. We
report the median across 20 runs, and omit confidence intervals,
as all kernels behaved deterministically, making errors negligible.
To measure power efficiency, we sample the direct current power
draw of the PSU in the host machine, then determine the FPGA
power consumption by computing the difference between the ma-
chine at idle with no FPGA plugged in, and the FPGA plugged in
while running the kernel. This method includes power drawn by
256 384 512 640 768 896 1024 1152
0
100
200
300
400
500
P
er
fo
rm
an
ce
[G
O
p/
s]
Kernel fits on
single chiplet
Min. two chiplets
utilized
All three chiplets
utilized
f from kernel
build unstable
Theoretical performance at 200 MHz
Measured performance
256 384 512 640 768 896 1024 1152
Parallel adders/multipliers (Nc)
0
2
4
6
8
P
ow
er
eff
.
[G
O
p/
J]
3.1
4.1
5.0
5.9 5.3
7.6
6.5
8.3
Figure 7: Strong scaling for single precision floating point,
n=m=k=16384matrices.
the full VCU1525 evaluation board, including the integrated fan. The
kernels compile to maximum performance given by each config-
urations at 200MHz until the first chiplet/SLR crossing, at which
point the clock frequency starts degrading. This indicates that the
chiplet crossings are the main contributor to long timing paths in
the design that bottleneck the frequency.
Tab. 2 shows the configuration parameters and measured results
for the highest performing kernel built using our architecture for
half, single and double precision floating point types, as well as
8-bit, 16-bit, and 32-bit unsigned integer types. Timing issues from
placement and routing are the main bottleneck for all kernels, as the
frequency for the final routed designs start to be unstable beyond
33% resource usage, when the number of chiplet crossings becomes
significant (shown in Fig. 7). When resource usage exceeds 80−90%,
kernels fail to route or meet timing entirely. Due to the large step
size in BRAM consumption for large compute tiles when targeting
peak performance (see Sec. 3.3), some kernels consume less BRAM
than what would otherwise be feasible to route, as increasing the
memory tile by another stage of Nb,min would exceed Nb,max.
In contrast to previous implementations, we achieve optimal us-
age of the on-chip memory by separating the drain phase of writing
out results from the compute phase. This requires the number of
computations performed per memory tile to be significantly larger
than the number of cycles taken to write the tile out to memory
(see Sec. 4.4). This effect is shown in Fig. 8 for small Nc (left) and
large Nc (right). For large Nc , the time spent in draining the result
is significant for small matrices. In either scenario, optimal compu-
tational efficiency is approached for large matrices, when there is
sufficient work to do between draining each result tile.
Fig. 9 demonstrates the reduction in communication volumewith
increasing values of the outer I/O tiles (i.e., xtxb · ytyb ). We plot
the arithmetic intensity, corresponding to 2× the computational
intensity in Eq. 3 (1 addition and 1 multiplication), and verify that
the communication volume reported by the runtime is verified to
match the analytical value computed with Eq. 6. We also report the
Data type xp yc xbxm ybym Frequency Performance Power eff. Arithm. int. LUTs FFs DSPs BRAM
fp16 112 16 1904 1920 171.3MHz 606GOp/s 15.1GOp/J 956Op/Byte 53% 24% 70% 90%
fp32 192 8 960 1632 145.7MHz 409GOp/s 10.9GOp/J 302Op/Byte 81% 46% 48% 80%
fp64 96 4 864 908 169.3MHz 122GOp/s 3.5GOp/J 112Op/Byte 38% 28% 80% 82%
uint8 132 32 1980 2176 186.5MHz 1544GOp/s 48.0GOp/J 2073Op/Byte 15% 8% 83% 51%
uint16 210 16 1680 2048 190.0MHz 1217GOp/s 33.1GOp/J 923Op/Byte 20% 11% 69% 88%
uint32 202 8 1212 1360 160.6MHz 505GOp/s 13.8GOp/J 320Op/Byte 58% 11% 84% 86%
Table 2: Highest performing kernels built for each data type.
0 2048 4096 6144 8192 10240 12288 14336 16384
Size of common dimension (k)
70
80
90
100
P
er
fo
rm
an
ce
[G
O
p/
s]
0.6
0.7
0.8
0.9
1.0
P
er
f.
E
ffi
ci
en
cy
fp32, Nc = 128
xtot=1024, ytot=1024
16384×k×16384
0 2048 4096 6144 8192 10240 12288 14336 16384
Size of common dimension (k)
400
500
600
P
er
fo
rm
an
ce
[G
O
p/
s]
0.6
0.7
0.8
0.9
1.0
P
er
f.
E
ffi
ci
en
cy
fp16, Nc = 1792
xtot=1904, ytot=1920
17136×k×17280
Figure 8: Fraction of maximum compute throughput for
varying matrix size.
1282 2562 3842 5122 6402 7682 8962 10242
Memory tile size (xtot · ytot)
0
50
100
150
200
250
A
ri
th
m
et
ic
in
te
ns
it
y
[O
p/
B
yt
e]
32
64
96
128
160
192
224
256Nc = 128, all kernels perform at ∼100 GOP/s.
0.0
0.5
1.0
1.5
2.0
2.5
B
an
dw
id
th
re
qu
ir
ed
[G
B
/s
]
2.50
1.33
0.91
0.69
0.56 0.46 0.40 0.35
Figure 9: FP32 arithmetic intensity with memory tile size.
average bandwidth requirement needed to run each kernel (in prac-
tice, the bandwidth consumption is not constant during runtime, as
memory accesses are done as bursts each time the row and column
for a new outer product is loaded). There is a slight performance
benefit from increasing memory tile size, as larger tiles increase the
ratio of cycles spent in the compute phase to cycles spent writing
back results, approaching perfect compute/DSP efficiency for large
matrices. For the largest tile size, the kernel consumes 350MB/s
at 100GOp/s, which corresponds to 35019200 = 1.8% of the maximum
bandwidth of a single DDR4 module. Even at the highest measured
single precision performance (Tab. 2) of 409GOp/s, the kernel re-
quires 1.35GB/s. This brings the I/O of matrix multiplication down
to a level where nearly the full bandwidth is left for other kernels
to utilize.
6 RELATEDWORK
Much of previous work focuses on the low level implementation
for performance [20], explores high-level optimizations [28], or
implements MMM in the context of neural networks [19, 29]. To
the best of our knowledge, this is the first work to minimize I/O of
matrix multiplication on FPGA in terms of hardware constants, and
the first work to open source our implementation to benefit of the
community. We relate this paper to the most relevant works below.
Tab. 3 shows a hybrid qualitative/quantitative comparison to
previously published MMM implementations on FPGA. Cells are
left empty when numbers are not reported by the authors, or when
the given operation is not supported. As our work is the only open
source implementation, we are unable to execute kernels from other
works on the same FPGA, and resort to comparing the performance
reported in papers for the respective benchmarked FPGA. These
FPGAs thus vary widely in vendor, technology and architecture.
Zhuo and Prasanna [30] discuss two matrix multiplication imple-
mentations on FPGA, and include routing in their considerations,
and support multiple floating point precisions. The authors suggest
two algorithms, where both require a number of PEs proportional
to the matrix size. While these only require loading each matrix
once, they do not support matrices of arbitrary size, and thus do
not scale without additional CPU orchestration.
Dou et al. [31] design a linear array of processing elements, im-
plementing 64-bit floating point matrix multiplication – no support
is offered for other data types, as the work emphasizes the low-level
implementation of the floating point units. The authors derive the
required off-chip bandwidth and buffer space required to achieve
peak performance on the target device, but do not model or opti-
mize I/O in terms of their buffer space usage, and do not report
their tile sizes or how they were chosen. Furthermore, the authors
double-buffer the output tile, reducing the maximum achievable
computational intensity by a factor
√
2 (see Sec. 4.4).
A customizable matrix multiplication implementation for deep
neural network applications on the Intel HARPv2 hybrid CPU/F-
PGA platform is presented by Moss et al. [19], targeting single
precision floating point (FP32), and fixed point/integer types. The
authors exploit native floating point DSPs on an Arria 10 device
to perform accumulation, and do not consider data types that can-
not be natively accumulated on their chip, such as half or double
precision. The I/O characteristics of the approach is not reported
quantitatively. Wu et al. [33] present a highly specialized archi-
tecture for maximizing DSP usage and frequency of 16 bit integer
Year Device % Logicutil.
Freq.
[MHz]
Perf. FP16
[GOp/s]
Perf. FP32
[GOp/s]
Perf. FP64
[GOp/s]
Energy eff.
FP32 [GOp/J]
Multiple
data types Lang. (Portable)
Open
source
I/O
model
Zhuo [30] 2004 Virtex-II Pro 98 128 - 2 2 -  HDL ( )
Dou [31] 2005 Virtex-II Pro 99 177 - - 39 - HDL ( )
Kumar [32] 2009 Virtex-5 61 373† - - 30† - HDL ( )
Jovanović [20] 2012 Virtex-6 100 403 - 203 - - HDL ( )
D’Hollander [28] 2016 Zynq-7000 99 100 - 5 - - HLS ( )
Guan [29] 2017 Stratix V 95 150 - 100 - 2.92 HDL/HLS ( )
Moss [19] 2018 HARPv2 99 313 - 800 - 22.0 HDL ( )
This work 2019 VCU1525 69−90 146−190 606 409 122 10.9 HLS ( )
Table 3: Comparison to previous FPGA implementations. †Simulation only.
matrix multiplication for DNN acceleration on two Xilinx Ultra-
Scale chips, showing how peak DSP utilization and frequency can
be reached, at the expense of generality, as the approach relies on
low-level details of the chips’ architecture, and as no other data
types are supported.
Kumar et al. [32] provide an analysis of the trade-off between
I/O bandwidth and on-chip memory for their implementation of
64-bit matrix multiplication. The authors arrive at a square output
tile when deriving the constraints for overlapping I/O, although the
derived computational intensity is reduced by a factor
√
2 as above
from double buffering. In our model, the fast memory utilization is
captured explicitly, and is maximized in terms of on-chip memory
characteristics of the target FPGA, allowing tile sizes that optimize
both computational performance and computational intensity to
be derived directly. Lin and Leong [34] model sparse MMM, with
dense MMM as a special case, and project that even dense matrix
multiplication may become I/O bound in future FPGA generations.
Our model guides how to maximize utilization in terms of available
on-chip memory to mitigate this, by capturing their characteristics
in the tiling hierarchy.
Finally, the works above were implemented in hardware descrip-
tion languages, and do not disclose the source code allowing their
findings to be reproduced or ported to other FPGAs. For the results pre-
sented here, we provide a high-level open source implementation,
to encourage reusability and portability of FPGA codes.
Designing I/O minimizing algorithms has been an active field
of research for more than 40 years. Starting with register alloca-
tion problems [26], through single processor, two-level memory
system [8], distributed systems with fixed [9] and variable memory
size [2]. Although most of the work focus on linear algebra [2, 9, 10]
due to its regular access pattern and powerful techniques like poly-
hedral modeling, the implication of these optimizations far exceeds
this domain. Gholami et al. [35] studied model and data parallelism
of DNN in the context of minimizing I/O for matrix multiplica-
tion routines. Demmel and Dinh [12] analyzed I/O optimal tiling
strategies for convolutional layers of NN.
7 CONCLUSION
We present a high-performance, open source, flexible, portable, and
scalable matrix-matrix multiplication implementation on FPGA,
which simultaneously maximizes performance and minimizes off-
chip data movement. By starting from a general model for com-
putation, I/O, and resource consumption, we create a hardware
architecture that is optimized to the resources available on a tar-
get device, and is thus not tied to specific hardware. We evaluate
our implementation on a wide variety of data types and configura-
tions, showing 409GOp/s 32-bit floating point performance, and
1.5 TOp/s 8-bit integer performance, utilizing >80% of hardware
resources. We show that our model-driven I/O optimal design is ro-
bust and high-performant in practice, yielding better or comparable
performance to HDL-based implementations, and conserving band-
width to off-chip memory, while being easy to configure, maintain
and modify through the high-level HLS source code.
ACKNOWLEDGMENTS
We thank Xilinx for generous donations of software and hard-
ware. This project has received funding from the European Re-
search Council (ERC) under the European Union’s Horizon 2020
programme (grant agreement DAPP, No. 678880).
REFERENCES
[1] W. A. Wulf and S. A. McKee, “Hitting the memory wall: Implications of the
obvious,” SIGARCH Comput. Archit. News, vol. 23, no. 1, pp. 20–24, Mar. 1995.
[2] E. Solomonik and J. Demmel, “Communication-optimal parallel 2.5D matrix mul-
tiplication and LU factorization algorithms,” in Euro-Par 2011 Parallel Processing,
E. Jeannot, R. Namyst, and J. Roman, Eds. Berlin, Heidelberg: Springer Berlin
Heidelberg, 2011, pp. 90–109.
[3] M. Scquizzato and F. Silvestri, “Communication lower bounds for distributed-
memory computations,” arXiv preprint arXiv:1307.1805, 2013.
[4] J. Demmel, D. Eliahu, A. Fox, S. Kamil, B. Lipshitz, O. Schwartz, and O. Spillinger,
“Communication-optimal parallel recursive rectangular matrix multiplication,”
in Proceedings of the 2013 IEEE 27th International Symposium on Parallel and
Distributed Processing, ser. IPDPS ’13, 2013, pp. 261–272.
[5] A. Aggarwal and S. Vitter, Jeffrey, “The input/output complexity of sorting and
related problems,” Commun. ACM, vol. 31, no. 9, Sep. 1988.
[6] Intel. (2007) Intel Math Kernel Library (MKL). [Online]. Available: https:
//software.intel.com/en-us/mkl
[7] E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, J. Dongarra, J. Du Croz,
A. Greenbaum, S. Hammarling, A. McKenney et al., LAPACK Users’ guide. SIAM,
1999.
[8] H. Jia-Wei and H.-T. Kung, “I/O complexity: The red-blue pebble game,” in Pro-
ceedings of the thirteenth annual ACM symposium on Theory of computing. ACM,
1981, pp. 326–333.
[9] D. Irony, S. Toledo, and A. Tiskin, “Communication lower bounds for distributed-
memory matrix multiplication,” Journal of Parallel and Distributed Computing,
vol. 64, no. 9, pp. 1017–1026, 2004.
[10] G. A. Geist and E. Ng, “Task scheduling for parallel sparse cholesky factorization,”
Int. J. Parallel Program., vol. 18, no. 4, pp. 291–314, Jul. 1990.
[11] M. Anderson, G. Ballard, J. Demmel, and K. Keutzer, “Communication-avoiding
QR decomposition for GPUs,” in Parallel & Distributed Processing Symposium
(IPDPS), 2011 IEEE International. IEEE, 2011, pp. 48–58.
[12] J. Demmel and G. Dinh, “Communication-optimal convolutional neural nets,”
arXiv preprint arXiv:1802.06905, 2018.
[13] M. Christ, J. Demmel, N. Knight, T. Scanlon, and K. Yelick, “Communication lower
bounds and optimal algorithms for programs that reference arrays–part 1,” arXiv
preprint arXiv:1308.0068, 2013.
[14] V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of neural net-
works on CPUs,” in Proc. Deep Learning and Unsupervised Feature Learning NIPS
Workshop, vol. 1. Citeseer, 2011, p. 4.
[15] M. Del Ben et al., “Enabling simulation at the fifth rung of DFT: Large scale RPA
calculations with excellent time to solution,” Comp. Phys. Comm., 2015.
[16] C. Lavin and A. Kaviani, “RapidWright: Enabling custom crafted implementa-
tions for FPGAs,” in 2018 IEEE 26th Annual International Symposium on Field-
Programmable Custom Computing Machines (FCCM). IEEE, 2018, pp. 133–140.
[17] T. Ben-Nun and T. Hoefler, “Demystifying parallel and distributed deep learning:
An in-depth concurrency analysis,” ACM Computing Surveys (CSUR), vol. 52, no. 4,
p. 65, 2019.
[18] A. Haidar, S. Tomov, J. Dongarra, and N. J. Higham, “Harnessing GPU tensor cores
for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers,”
in Proceedings of the International Conference for High Performance Computing,
Networking, Storage, and Analysis. IEEE Press, 2018, p. 47.
[19] D. J. Moss, S. Krishnan, E. Nurvitadhi, P. Ratuszniak, C. Johnson, J. Sim, A. Mishra,
D.Marr, S. Subhaschandra, and P. H. Leong, “A customizablematrixmultiplication
framework for the Intel HARPv2 Xeon+FPGA platform: A deep learning case
study,” in Proceedings of the 2018 ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays, ser. FPGA ’18. New York, NY, USA: ACM, 2018, pp.
107–116.
[20] Z. Jovanović and V. Milutinovic, “FPGA accelerator for floating-point matrix
multiplication,” IET Computers & Digital Techniques, vol. 6, no. 4, pp. 249–256,
2012.
[21] V. Strassen, “Gaussian elimination is not optimal,” Numer. Math., vol. 13, no. 4,
pp. 354–356, Aug. 1969.
[22] P. D’Alberto and A. Nicolau, “Using recursion to boost ATLAS’s performance,”
in High-Performance Computing. Springer, 2008, pp. 142–151.
[23] H. Zhou and J. Xue, “A compiler approach for exploiting partial SIMD parallelism,”
ACM Trans. Archit. Code Optim., vol. 13, no. 1, pp. 11:1–11:26, Mar. 2016.
[24] G. Kwasniewski, M. Kabić, M. Besta, J. VandeVondele, R. Solcà, and T. Hoefler,
“Red-blue pebbling revisited: Near optimal parallel matrix-matrix multiplication,”
in Proceedings of the International Conference for High Performance Computing,
Networking, Storage, and Analysis, ser. SC ’19, 2019.
[25] D. Irony et al., “Communication lower bounds for distributed-memory matrix
multiplication,” JPDC, 2004.
[26] G. J. Chaitin, M. A. Auslander, A. K. Chandra, J. Cocke, M. E. Hopkins, and P. W.
Markstein, “Register allocation via coloring,” Computer Languages, vol. 6, no. 1,
pp. 47 – 57, 1981.
[27] J. de Fine Licht and T. Hoefler, “hlslib: Software engineering for hardware design,”
arXiv preprint arXiv:1910.04436, 2019.
[28] E. H. D’Hollander, “High-level synthesis optimization for blocked floating-point
matrix multiplication,” SIGARCH Comput. Archit. News, vol. 44, no. 4, pp. 74–79,
Jan. 2017.
[29] Y. Guan, H. Liang, N. Xu, W.Wang, S. Shi, X. Chen, G. Sun, W. Zhang, and J. Cong,
“FP-DNN: An automated framework for mapping deep neural networks onto
FPGAs with RTL-HLS hybrid templates,” in 2017 IEEE 25th Annual International
Symposium on Field-Programmable Custom Computing Machines (FCCM), April
2017, pp. 152–159.
[30] L. Zhuo and V. K. Prasanna, “Scalable and modular algorithms for floating-point
matrix multiplication on FPGAs,” in 18th International Parallel and Distributed
Processing Symposium, 2004. Proceedings., April 2004, pp. 92–.
[31] Y. Dou, S. Vassiliadis, G. K. Kuzmanov, and G. N. Gaydadjiev, “64-bit floating-
point FPGA matrix multiplication,” in Proceedings of the 2005 ACM/SIGDA 13th
International Symposium on Field-programmable Gate Arrays, ser. FPGA ’05. New
York, NY, USA: ACM, 2005, pp. 86–95.
[32] V. B. Y. Kumar, S. Joshi, S. B. Patkar, and H. Narayanan, “FPGA based high
performance double-precision matrix multiplication,” in 2009 22nd International
Conference on VLSI Design, Jan 2009, pp. 341–346.
[33] E. Wu, X. Zhang, D. Berman, and I. Cho, “A high-throughput reconfigurable
processing array for neural networks,” in 2017 27th International Conference on
Field Programmable Logic and Applications (FPL). IEEE, 2017, pp. 1–4.
[34] C. Y. Lin, H. K.-H. So, and P. H. Leong, “A model for matrix multiplication perfor-
mance on FPGAs,” in 2011 21st International Conference on Field Programmable
Logic and Applications. IEEE, 2011, pp. 305–310.
[35] A. Gholami, A. Azad, P. Jin, K. Keutzer, and A. Buluc, “Integrated model, batch and
domain parallelism in training neural networks,” arXiv preprint arXiv:1712.04432,
2017.
