Efficiency analysis methodology of FPGAs based on lost frequencies, area and cycles by Lemeire, Jan et al.
Accepted Manuscript
Efficiency analysis methodology of FPGAs based on lost frequencies, area
and cycles
Jan Lemeire, Bruno da Silva, An Braeken, Jan G. Cornelis, Abdellah Touhafi
PII: S0743-7315(17)30324-6
DOI: https://doi.org/10.1016/j.jpdc.2017.11.012
Reference: YJPDC 3784
To appear in: J. Parallel Distrib. Comput.
Received date : 30 June 2016
Revised date : 20 October 2017
Accepted date : 15 November 2017
Please cite this article as: J. Lemeire, B. da Silva, A. Braeken, J.G. Cornelis, A. Touhafi, Efficiency
analysis methodology of FPGAs based on lost frequencies, area and cycles, J. Parallel Distrib.
Comput. (2017), https://doi.org/10.1016/j.jpdc.2017.11.012
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to
our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form.
Please note that during the production process errors may be discovered which could affect the
content, and all legal disclaimers that apply to the journal pertain.
HIGHLIGHTS 
• A methodology to study the impact of overheads on runtime performance is proposed. 
• Three types of efficiency are introduced - area efficiency, frequency efficiency and cycle 
efficiency - and combined to define a global efficiency. 
• Analytical formulas are presented to measure and to compute the respective efficiencies. 
 
*Highlights (for review)
Efficiency Analysis Methodology of FPGAs
based on Lost Frequencies, Area and Cycles.
Jan Lemeirea,b,c,∗, Bruno da Silvaa,b, An Braekena, Jan G. Cornelisb,c, Abdellah
Touhafia
aDept. of Industrial Sciences (INDI), Vrije Universiteit Brussel (VUB), Pleinlaan 2, B-1050 Brussels,
Belgium
bDept. of Electronics and Informatics (ETRO), Vrije Universiteit Brussel (VUB), Pleinlaan 2, B-1050
Brussels, Belgium
cData Science Dept., iMinds, Technologiepark 19, 9052 Zwijnaarde, Belgium
Abstract
We propose a methodology to study and to quantify efficiency and the impact of over-
heads on runtime performance. Most work on High-Performance Computing (HPC)
for FPGAs only studies runtime performance or cost, while we are interested in how
far we are from peak performance and, more importantly, why. The efficiency of run-
time performance is defined with respect to the ideal computational runtime in absence
of inefficiencies. The analysis of the difference between actual and ideal runtime re-
veals the overheads and bottlenecks. A formal approach is proposed to decompose the
efficiency into three components: frequency, area and cycles. After quantification of
the efficiencies, a detailed analysis has to reveal the reasons for the lost frequencies,
lost area and lost cycles. We propose a taxonomy of possible causes and practical meth-
ods to identify and quantify the overheads. The proposed methodology is applied on
a number of use cases to illustrate the methodology. We show the interaction between
the three components of efficiency and show how bottlenecks are revealed.
Keywords: FPGA, lost cycle analysis, performance efficiency, High-Performance
Computing, High-Level Synthesis, Vivado HLS
∗Corresponding author
Email address: jlemeire@etrovub.be (Jan Lemeire)
Preprint submitted to Journal of Parallel and Distributed Computing October 20, 2017
*Manuscript
Click here to view linked References
1. Introduction
Field-Programmable Gate Arrays (FPGAs) can be used as accelerators that provide
a high computational performance combined with power efficiency. In this context, in-
sight into all performance aspects is crucial. Traditionally, a performance analysis is,
however, mostly limited to measuring or estimating performance (e.g. throughput),5
comparing performance with that of CPUs or GPUs, identifying the overheads and
measuring the performance per watt. See for instance the HPC applications reported
in [1]. Insight into how good the performance of a proposed design is, is often lacking.
The goal of our endeavor is to put forward a formal methodology to analyze the effi-
ciency of FPGA implementations. The methodology intends to explain and quantify10
why peak performance is not obtained, i.e. why the efficiency is lower than 100%. As
we want to get the maximal performance out of an FPGA for a certain algorithm, we
want know to how far we are from the peak performance, why the peak performance
is not reached and whether improvement is possible. The factors that cause inefficien-
cies are the overheads. Insight into efficiency will help developers in improving FPGA15
implementations and comparing different implementations for a given algorithm.
The proposed methodology starts with defining the peak performance and the effi-
ciency of an FPGA implementation. The global efficiency is then decomposed into 3
components (frequency, area and cycles) which can be used to quantify the efficiency
losses and steer the identification of the different reasons for efficiency losses. We show20
that lost frequencies, area or cycles can be identified and analyzed separately although
they are not independent when optimizing the performance: changing one component
might affect another component.
The main scenario discussed in this paper is to achieve maximal computational
performance. We focus on maximum computational throughput regardless of cost,25
power or other considerations. Nevertheless, our methodology can also be used for
alternative scenarios such as striving to deliver the maximum possible performance
within a space and/or power budget (Performance-per-Watt). Because we focus on
computational performance, we limit ourselves to implementations that are compute
bound rather than memory bound.30
2
We start by discussing related work. Then, in Section 3 we propose our methodol-
ogy. It is followed by discussion of its practical usage. Section 5 analyzes the overheads
responsible for inefficiencies. Application on FPGA implementations is demonstrated
in Section 6.
2. Related work35
FPGAs are used for HPC in several domains, such as bioinformatics [2], linear
algebra [3], stock market analysis [4] and image processing. It is shown that FPGAs
can provide significant speedups in these domains [1]. The researchers report runtime
performance and speedups when compared to CPUs. Efficiency and the corresponding
bottlenecks (limiting factors) are often not addressed, as the performance analysis is40
in most cases only devoted to the exploration of the design space. Skalicky et al. [5]
for instance provides an analytical model of the performance of several pipelined lin-
ear algebra designs which intends to identify design bottlenecks and improve perfor-
mance. They focus on estimating execution time. They state that the ‘performance of
a computation depends on the implementation’s efficient use of available resources’,45
but are unable to give figures on the efficiency [6]. Another methodology, called Re-
configurable computer Amenability Test (RAT), is intended to model the critical set of
algorithms and platform attributes in order to estimate the performance of a specific
design, not a generic algorithm [7]. By measuring the resource consumption, RAT
seeks to determine the scalability of an application design. In summary, RAT provides50
a methodology for rapidly analyzing an application’s design compatibility with a spe-
cific FPGA platform. It estimates the throughput of an accelerator based on parameters
like the interconnect’s speed, amount of data to be transferred, the number of opera-
tions performed per data element and the clock frequency of the accelerator. Again,
they do not estimate efficiency, although their work could easily be extended with the55
efficiency analysis proposed in this paper. On the other hand, they also consider band-
width, while we concentrate on computational performance.
Interesting work in efficiency analysis is [8]. They designed a framework for run-
time performance analysis of High-Level Language applications for FPGAs, including
3
an automated tool for performance analysis. It is able to determine the main bottlenecks60
and to recognize common performance problems such as potentially slow communica-
tion functions or idle hardware processes through instrumentation, measurement, anal-
ysis, and visualization. Indeed, instrumentation enables access to application data at
runtime. Through tracing, occurrences of events are logged together with any associ-
ated data. In this way they are able to count the number of cycles spent waiting for65
a transfer to complete. It enables the tracking of overheads due to control hardware
employed to maintain program order, pipelines, and communication channels. Our
methodology provides the overarching metrics which allows to put the overheads in
context and quantify their impact on the global performance.
The work of Koehler et al. [9, 10] on bottleneck detection is related to the search70
for the causes of the bottlenecks, which belongs to the second part of our methodology.
They define a bottleneck as some portion of the application that reduces performance
for the application as a whole. It is the most important work on enumerating and cat-
egorizing bottlenecks for FPGA applications. In our methodology we link bottlenecks
to the decrease in efficiency and as such quantify the impact of overheads on the run-75
time performance. Koehler et al. [9, 10] estimate the possible speedup of removing
bottlenecks by a kind of simulation which is based on the traced execution profile. The
same approach is possible in our methodology, as will become clear when analyzing
the lost cycles.
Also the work on design space exploration [18, 19, 23] lacks estimates for the80
efficiency. Sirowy and Forin [18] show the impact of optimization strategies on the
global runtime but do not discuss the effect on area and cycle efficiencies. Zhong et
al. [23] propose two definitions for area efficiencies. One as a sum over all component
types of the ratio of used components versus number of components (Eq. 18). A second
definition takes the maximum over all these ratios. We prove in Sec. 3.6.1 that the85
impact of these ratios on global efficiency is more complex than a sum or a maximum
(Eq. 25). Our work adds a quantified efficiency analysis in a rigorous and ‘complete’
way.
As we will discuss, our analysis is influenced by concepts of a performance anal-
ysis in parallel computing. The work of Beltran et al. [11] defines performance and90
4
efficiency metrics for FPGA-based multiprocessor systems. As base reference they
consider uniprocessor performance. Efficiency is defined as speedup divided by the
number of processors. This is one of the fundamental definitions in parallel comput-
ing [12]. We are different, we define efficiency as compared to what performance the
FPGA’s computing elements could deliver in ideal situations.95
Our methodology is also based on the philosophy of Crovella and LeBlanc’s Lost
Cycle Analysis [13]. The idea is that a number of instructions have to be executed
(called the useful work). The reference ideal situation is based on these useful instruc-
tions: a number of cycles (Lopt) are needed to execute these instructions. We assume
no overhead. It relates to the sequential implementation of the algorithm which runs on100
a single processor. This ideal situation is compared with the actual number of cycles
(Limp) of the implementation under study. Then the overhead is Limp − Lopt/p with
p the number of processors. This overhead is expressed in lost cycles: all p processors
consume Limp cycles of which Lopt/p are necessary to execute the useful instructions.
The other cycles could have been used. Consequently, these cycles present the over-105
head of the implementation. A performance analysis should study these lost cycles and
try to identify their causes.
3. Formal definitions of the efficiency analysis methodology
Before defining and decomposing performance efficiency, we start by defining a
model of an FPGA and the implementation under study.110
Table 1 summarizes all concepts of the methodology.
3.1. FPGA model
An FPGA consists of Rj components of type j. The maximum frequency at which
the FPGA can run is fpeak. An FPGA requires a certain number of components to
implement an operation. As there are several types of components and several config-115
urations possible, each type of operation i is mapped onto vectors of resources. This is
also referred to as the cost vector of an operation. For the sake of simplicity, we assume
that only 1 type of component is needed to execute an operation instead of considering
5
FPGA
Rj number of components of type j p.5
fpeak maximum frequency p.5
rij number of components of type j needed to execute instruction type i p.7
λi,j latency of issuing instruction of type i on component type j p.7
λi,jop operational latency (= r
i
j · λi,j) p.8
Implementation
N iop useful operations of type i p.7
U percentage of FPGA area that is used p.9
Ri,jimp number of components j that are used for the useful computations i p.9
Optimal performance
Topt optimal runtime based on the total FPGA p.8
T ′opt optimal runtime when using U percentage of the FPGA area p.9
Riopt the number of components that each instruction uses for the optimal configuration p.12
Execution
Trun actual runtime of the implementation p.9
fimp the actual frequency p.9
Limp the number of cycles used to execute the implementation p.9
Efficiency
E total FPGA efficiency p.9
E ′ occupied FPGA efficiency p.9
Efreq frequency efficiency p.10
Ejarea area efficiency p.10
E ′jarea used area efficiency p.10
E i,jcycle cycle efficiency p.10
Table 1: Overview of the different parameters of the efficiency analysis with page number of definition. Index
i refers to instruction type and j to component type. The index is dropped when representing aggregated
values or when only one type is considered.
6
complex cost vectors. We denote with rij the number of components of type j that
are needed to execute an operation of type i. The parameter rij is set to infinity if the120
component in unable to execute the operation. In reality components of multiple types
might be needed to execute an operation. Another simplification is the assumption that
cost vectors are constant, while in practice they sometimes depend on factors such as
the target frequency. Each rij comes with a certain issue latency λ
i,j . The issue latency
is defined as the number of cycles after which the execution of the next operation can125
be initiated. Note that the issue latency is different from the completion or end-to-end
latency, which equals the total number of cycles required to terminate the complete
execution.
As we will see, for analyzing the efficiency it is sufficient to concentrate on the
components and operations that limit the performance. The others can be disregarded130
in the analysis. In many cases this will greatly simplify the performance analysis.
3.2. Useful work of an implementation
For the implementation under study, we first have to identify how many operations
have to be executed for each operation type i (e.g. additions, multiplications, ...). This
is denoted by N iop. We focus on the operations inherently present in the algorithm135
while discarding all overheads that are caused by the implementation. We call them
the useful operations. The choice of which instructions are useful is somewhat arbi-
trary. For instance, considering loop control operations as useful is a matter of choice.
Operations that are not counted as useful contribute to the overhead and decrease the
overall efficiency. Sometimes it is interesting to know the loop control overhead. In140
parallel computing we regard the sequential algorithm as the reference algorithm of
which all instructions are necessary and ‘useful’ [13]. Additional instructions intro-
duced by a parallel implementation are considered overhead and not useful. Here we
can also consider a sequential C-implementation of the algorithm as the reference. The
useful operations are independent from the implementation. They are ‘inevitable’ and145
the minimal number of operations required to execute the algorithm.
To explain our methodology and its philosophy, we start by assuming that the im-
plementation under study consists of 1 operation type (e.g. floating-point) which can be
7
executed by just 1 component type. Later we will extend it to heterogeneous operations
and components.150
3.3. Peak performance
The efficiency of an implementation will be defined with respect to the optimal
performance that would be reached in the ideal case. The implementation under study
has to execute Nop useful operations. If there are R components available and r com-
ponents are needed to execute the operation under study with an issue latency of λ (as
we focus on 1 type of operation and component, we drop the subscripts i and j), the
ideal run time would be:
Topt =
Nop · λ
fpeak · bR/rc ≈
Nop · λ · r
fpeak ·R =
Nop · λop
fpeak ·R (1)
where bR/rc gives the number of computational units that can be made. Since r is
often small andR large, the approximation error will remain small. The approximation
eases the further elaboration. λop is defined as the product of issue latency and the
number of components required to execute the operation (λop , λ · r). We call it155
the operational latency. For the analysis it is equivalent whether two components are
needed to execute an operation with an issue latency of one cycle, or one component
can do it with an issue latency of two cycles.
Despite the fact that attaining the theoretical peak performance is not realistic [21],
it offers a reference or yardstick. In the ideal case, each component can start a useful160
operation each λ cycles. At first we do not consider any practical issues that would
prevent a component from doing so. This is done in the next step, when we analyze
the efficiency, i.e. the reasons why the ideal case is not possible. By comparing the
actual execution with the ideal case this gives us the necessary insight into all issues to
consider when using FPGAs for HPC. The values obtained for peak performance using165
this method should not be used to represent the achievable performance of a given
device. We do not want to put forward this peak performance as a realisable peak
performance, but as a yardstick to compare the performance of actual implementations
and to be used for the discussion of inefficiencies. Another advantage of this definition
8
is that the proposed approach offers a methodological way of performing a qualitative170
and quantitative efficiency analysis.
3.4. Total FPGA Efficiency
The total FPGA efficiency of an implementation is defined as
E , Topt/Trun (2)
where Trun is the actual runtime of the implementation. Alternatively, the efficiency
can be calculated based on the performance (expressed in operations per second).
Performancepeak = fpeak ·R/λop (3)
Performanceimp = Nop/Trun (4)
E = Performanceimp
Performancepeak
=
Nop · λop
Trun · fpeak ·R. (5)
3.5. Occupied FPGA Efficiency
The performance efficiency targets an effective use of the whole FPGA. However,
often one wants to optimize the used area. For instance, when for energy reasons, one
wants to limit area consumption. With U we denote the fraction of the FPGA area that
is used for the implementation. If we define R′ = U · R, then the ideal run time of
Eq. 1 becomes
T ′opt =
Nop · λop
fpeak ·R′ . (6)
Occupied FPGA efficiency is then defined as175
E ′ , T ′opt/Trun . (7)
It follows that E = U · E ′. Depending on the main optimization goal, one of both effi-
ciencies should be considered. This consideration will be further discussed in Sec. 6.1.
3.6. Efficiency Decomposition
The efficiency is decomposed into three basic components (time, area and fre-
quency). Define Rimp (≤ R′) as the number of components that are used for the
9
useful computations. Define fimp as the actual frequency at which the FPGA executes
the implementation, and Limp the number of cycles used to execute. The relation with
the runtime is given by:
Trun = Limp/fimp (8)
It follows that the performance efficiency can be decomposed as
E = Nop · λop/fpeak/R
Limp/fimp
(9)
=
Nop · λop
Limp ·Rimp ·
Rimp
R
· fimp
fpeak
(10)
= Efreq · Earea · Ecycle (11)
These components allow us to analyse overheads in detail. The frequency efficiency is
known at design time:
Efreq , fimp/fpeak . (12)
The area efficiency is defined as the number of FPGA components that participate in
doing the useful computations, divided by the total number of FPGA components:
Earea , Rimp/R . (13)
If occupied FPGA efficiency is considered, R′ should be considered in the definitions
and area efficiency becomes:
E ′area , Rimp/R′ . (14)
In the following subsections one has to substitute R with R′ to retrieve the definitions
for used are efficiency. The cycle efficiency is defined as
Ecycle , Nop · λop
Limp.Rimp
=
λop
λimp
(15)
With λimp the average operational latency for the useful computations on the Rimp
components:
λimp =
Limp.Rimp
Nop
(16)
10
Note that Nop ·λop/Rimp equals Lopt, the number of cycles needed at peak perfor-
mance, given only Rimp components are used. It follows that Ecycle = Lopt/Limp.180
So, besides measuring the global efficiency by comparing the runtimes (Eq. 2),
the 3 basic components of efficiency can be obtained separately with Eq. 12, 13 and
15. Multiplying the 3 efficiency components should give the same value for the global
efficiency. Each loss in efficiency is due to some overheads. The second phase of the
analysis consists of identifying the causes for losses in frequency, area and cycles. This185
is tackled in Sec. 5.
3.7. Extensions to multiple operations and component types
Our methodology supports multiple operations and multiple component types. Sub-
script i is used to denote operation type and j to denote component type. For the general
heterogeneous case, the efficiency components are defined as:
Efreq , fimp/fpeak (17)
Ejarea ,
I∑
i
Ri,jimp/R
j (18)
E ′jarea ,
I∑
i
Ri,jimp/(U.R
j) (19)
E i,jcycle ,
N i,jop · λi,jop
Limp ·Ri,jimp
(20)
where N i,jop is the number of instructions of type i executed on component j such that
N iop =
∑
j N
i,j
op . The same computational unit can be constructed out of different
components. Frequency efficiency is an overall efficiency, while area efficiency is per190
component type and cycle efficiency per component and operation type.
In the following we derive the peak performance and the equations that relate the
efficiency components with the global efficiency (defined by Eq. 2).
3.7.1. One operation type and multiple component types
Given the component types, the ideal runtime becomes (subscript j denotes com-195
ponent type, we drop subscript i):
Topt = Nop/(
∑
j
Rj/λjop)/fpeak (21)
11
The global efficiency turns out to be a kind of weighted average of area and cycle
efficiency, where N jop/Nop is the weight.
Efficiency =
Topt
Trun
(22)
=
Nop/(
∑
j R
j/λjop)/fpeak
Limp/fimp
(23)
=
1∑
j
Rj ·Limp
λjop·Nop
· fimp
fpeak
(24)
=
1∑
j
Njop
Nop
1
Ejcycle·Ejarea
· Efreq (25)
3.7.2. Multiple operation types and one component type
N iop denote the useful operations per instruction type i. Since we focus on only one
component type, we take the cost vector with the minimal λiop if minimal implementa-
tions would be possible. For optimal execution, the components are divided among the200
different instruction types according to N iop · λiop:
Topt =
∑
iN
i
op · λiop
R · fpeak (26)
Let Riopt be the number of components that each instruction uses. To reach peak per-
formance, we should divide the available components according to the following rate:
Riopt =
N iop.λ
i
op∑
iN
i
op.λ
i
op
(27)
In the implementation under study, Riimp components are used to execute instructions
of type i. The impact of the local efficiencies on the global efficiency is given by the
following equations:
E = (
∑I
i N
i
op · λiop)/R/fpeak
Limp/fimp
(28)
=
∑
iN
i
op · λiop/R
Limp
· fimp
fpeak
(29)
=
∑
i
(E icycle
Riimp
R
) · Efreq (30)
=
∑
i
(E icycle
Riimp∑I
i R
i
imp
) · Earea · Efreq (31)
12
The factors
Riimp∑
i R
i
imp
are based on the actual ‘distribution’ of the components across the205
total number of components. This rate might be different than the optimal distribution
given by Eq. 27. When more components are devoted to an operation than the ideal
balance, the cycle efficiency will be lower than that of the other operations. Either there
is more overhead, or the reason is an unequal distribution: the other operations are
executed at maximal performance, but the components executing one of the operations210
cannot be kept busy because too many resources were reserved for it.
Eq. 29 shows that it makes sense to define an aggregate cycle efficiency as
Ecycle =
∑
i
N iop · λiop
R · Limp (32)
The global efficiency remains a multiplication of the 3 efficiency components.
3.7.3. Multiple instruction types and multiple component types
Finally, we consider the most general case in which each component type can ex-
ecute several instruction types. The optimal configuration (mapping of instructions on215
components) determines the peak performance. Some of the component types will be
fully used. They determine the peak performance. Peak performance is reached with a
configuration that minimizes the runtime:
Topt = arg min
conf
(argmax
i,j
((N i,jop λ
i,j
op/R
i,j)/fpeak)) (33)
With conf iterating over all possible configurations resulting in different values for
N i,jop and R
i,j . Also,
∑
j N
i,j
op = N
i
op and
∑
iR
i,j = Rj .220
The component that bounds peak performance is denoted by j = J and the opera-
tion by i = I , then efficiency becomes:
E = (N
I,J
op λ
I,J
op /R
I,J
opt)/fpeak
Limp/fimp
(34)
=
N I,Jop · λI,Jop /RI,Jopt
Limp
· fimp
fpeak
(35)
= EI,Jcycle
RI,Jimp
RI,Jopt
· Efreq (36)
= EI,Jcycle
RI,Jimp∑I
I R
I,J
imp
· R
J
RI,Jopt
· EJarea · Efreq (37)
13
The weight factors in the equation have an explicit meaning: the actual rate of used
components times the inverse of the rate of the optimal configuration. The product is 1
if they are equal. An actual implementation will typically be based on a different con-
figuration. If more components are used (
RI,Jimp∑I
I R
I,J
imp
>
RI,Jopt
RJ
), then the cycle efficiency
is increased. The total efficiency can, however, not be greater than one since such a225
non-optimal configuration will induce additional overheads of another type.
As discussed in Sec. 5.3, the values I and J that determine the optimal runtime do
not mean that they are the bounding instruction and component type. Other instruction
and component types might have to be considered as well.
4. Practical usage230
Our efficiency analysis is not linked to any particular step in the design flow. How-
ever, the accuracy of the performance values is determined by at what stage is the data
collected. Figure 1 shows the impact of optimizations on the throughput based on the
stage of Xilinx’s design flow [15]. Modifications at high-level have the highest impact
on the final performance. It motivates a deeper analysis of the potential designs at235
an early stage before going through the whole design flow. Additionally, High-Level
Synthesis (HLS) tools offer a fast design-space exploration, which facilitates such an
analysis. The examples used to introduce our analysis consider traditional design tools
such as Xilinx ISE and the Xilinx HLS tool called Vivado HLS. This HLS tool accepts
C based languages like C, C++ or SystemC as input and converts each source code into240
a synthesizable RTL module.
Vivado HLS generates reports with estimations of the FPGA resource utilization,
latency, and throughput of the resulting RTL module. The HLS design can go through
the next stages (RTL, place and route) which will produce more accurate statistics on
the design. Together with the Analysis viewer (discussed later) it provides enough245
information to obtain the parameters of our analysis:
• Rj , fpeak, λi,jop : These hardware parameters are usually specified in technical
reports.
14
HLS 
(C,C++) 1000x
10x
1.2x
1.1x
Impact of change on 
performance
RTL
Synthesis
Placement
Routing
Figure 1: Design flow and level of the impact of the design decisions on performance.
• N iop: The number of useful operations is retrieved from the Vivado HLS report,
based on the number of instances, and from the Analysis viewer, by measuring250
how many times a particular operation i is executed per iteration. The number
of iterations is known (or estimated for variable loops). Both values define the
total number of useful operations of type i. One can separate useful operations
from overheads (such as control operations) in two ways. First, one can label
the source code parts with the useful operations (such as the computations of255
a loop body) in Vivado HLS such that the computational units are labeled in
the generated report. An alternative is to count useful operations in C code.
The difference with the numbers from the HLS tool represents the number of
overhead operations.
• Riimp: The resource consumption per operation is reported in the Analysis viewer.260
• fimp, Limp and Trun: The cycles and frequency is reported by Vivado HLS, the
runtime can be calculated from them.
15
From these parameters, the optimal runtime can be calculated (Eq. 1, Eq. 26, Eq. 21
or Eq. 33), the global efficiency (Eq. 2) as well as the three efficiency components (
Eq. 17, Eq. 18, Eq. 32) and Eq. 20 for a detailed lost cycle efficiency.265
5. Overhead analysis
Now that our methodology has clearly defined what deteriorates the global perfor-
mance - namely lost frequencies, lost area and lost cycles - we want to dive deeper into
the analysis. The question arises what causes these lost performances. A causal expla-
nation has a counterfactual interpretation: if the cause could be eliminated, we expect270
the lost performance to disappear. In the sense that if 100 lost cycles are due to C, then
by removing bottleneck C, the 100 lost cycles would disappear, cycle efficiency would
increase accordingly and hence the global efficiency. The counterfactual increase of
performance is called speedup by Koehler et al. [9]. For instance, non-overlapped
communication will block some computational components and hence induces lost cy-275
cles. In this case we have to identify the reason of the non-overlapped communication.
This might be non-ideal communication (below the potential bandwidth) which causes
long transfer periods.
In this section we establish a classification of possible causes for lost frequencies,
lost area and lost cycles. We want to know the reasons for a frequency drop and the280
role of the used components that are not used for useful computations. We want to
label each lost cycle with the actual reason of being idle. In this sense we will analyze
each component in turn.
5.1. Lost frequencies
An FPGA is forced to function at a lower frequency because the design takes up285
a large percentage of the available logic resources, increasing the critical path which
determines the clock rate.
5.2. Lost area
In the equations we considered the area (Rimp) that contributes to the execution of
useful instructions. The rest of the components can either be used for other, non-useful290
16
unused area
used area
overhead
Figure 2: The area of an FPGA is partly used for doing useful computations or overhead operations.
operations or not used at all. As shown in Fig. 2, we call the former overhead area
and the latter unused area.
Classification of lost area
1. Unused area
(a) No more replication possible due to depletion of other area (which is neces-295
sary for more replication). Here, another resource type is (at least partially)
bounding the performance. Therefore it is useful to keep track of the usage
of all resources. This resource might be part of the cost vector or needed
for support functionality such as memory or control logic.
(b) Prevent frequency drop300
2. Used area, instead of doing useful operations, components are being used for
(a) routing
(b) control
(c) memory
When considering performance efficiency, area that is not used is considered ‘lost’305
17
Figure 3: Vivado HLS Analysis viewer reflecting the used and lost cycles.
because it could have been used to increase the performance. When optimizing used
area efficiency, the unused area is not considered overhead.
5.3. Lost cycles
The concept of lost cycles comes from the overhead analysis in parallel computing
established by Crovella and LeBlanc [13]. On CPUs, the frequency is constant and one310
does not consider area. So the only inefficiencies that remain are cycles of the proces-
sors that are not used to execute useful instructions. The rationale of a lost cycle is that
ideally the processor could issue an operation during that cycle, but it didn’t because of
‘this or that’. The ‘this or that’ is what we are interested in: the reasons for not reaching
the peak performance. In parallel computing, when the parallel runtime takes x cycles,315
each processor can exploit these x cycles usefully. An efficiency analysis is devoted
to analyze the portion of the x cycles that are not used usefully. For FPGAs we want
to apply the same idea: Rimp components are devoted to the useful operations. The
execution takes Limp cycles. Thus, Rimp ·Limp is the total number of cycles at which
an operation could be executed on the useful area. In reality, when executing the Nop320
operations, only λ · rij · Nop cycles were effectively used. The ratio of the former to
the latter defines the cycle efficiency (expressed by Eq. 15). The lost cycle analysis is
therefore performed on the execution profile, checking the cycle consumption of each
Rimp component. Note that the other components should not be considered since they
are already counted in the area efficiency.325
This is illustrated by Fig. 3 showing the execution profile provided by the Vivado
HLS analysis viewer. The viewer details the schedule of the design’s execution. It
shows how the resources are consumed, the I/O access and what is executed at every
18
Figure 4: Execution profile with 2 components having dependent operations. Component 2 is bounding the
performance. Cycles in gray are useful, in white are idle.
Figure 5: Execution profile with 2 components having dependent operations. The idle periods can be filled
with an overlapping iteration through pipelining. Cycles in gray are useful, in white are idle.
clock cycle. The top bar reflects the control states of the execution, where each state
corresponds to one clock cycle. Four different floating-point operations are used to330
compute a L2 normalization of HOG (which is discussed next). Each floating-point op-
eration consumes dedicated resources; particularly specific IP-cores which are mostly
implemented with DSPs. Every control step (at the top bar) represents one state of the
Finite State Machine (FSM) and consumes one clock cycle. Cycles are colored gray
when the components are executing the operation. However, only the issue latency (λ335
= 1 cycle) has to be considered as a useful cycle, since operations can be pipelined.
This happens for instance at clock cycle 46, when a new addition is started and over-
laps with the previous addition initiated at cycle 44. For the lost cycle analysis we have
to consider the Rimp components that are performing the useful operations and have
Limp cycles at their disposal. From these Rimp × Limp cycles only 5 cycles were de-340
voted to issuing new operations. All the other cycles are lost cycles because they could
have been used.
Classification of lost cycles:
19
1. imperfect execution of operations: happens with longer issue latencies than345
normal, e.g. with memory access.
2. idle cycles: component cannot proceed with the next operation due to a depen-
dency with another component. A dependency is caused when the data to be
processed is not ready yet. It can be static or dynamic (control, synchroniza-
tion, ...). There are 2 possibilities: either the other component is bounding the350
performance or there is not enough parallelism to overlap operations.
A. bounded by another instruction on a certain component, as shown in
Fig. 4. The bounding instructions could be useful instructions as well as
data movement or overhead operations.
I. due to non-optimal execution (e.g. non-optimal data movement)355
II. due to an imbalance: another, better configuration is possible. Moving
operations to other components will lead to fewer idle cycles and a
better global performance. This was discussed in Subsection 3.7.2.
B. not enough parallelism, as shown in Fig. 5. Since the involved compo-
nents all exhibit idle periods, the idle periods could be filled with overlap-360
ping iterations through pipelining.
I. no independent iterations in the algorithm: the algorithm is inher-
ently sequential. Then the lost cycles are due to non-overlapping use-
ful or overhead instructions (data movement, loop control, ...)
II. insufficient concurrent execution possible due to resource limitations.365
III. imperfect overlap: although there is sufficient parallelism, the con-
current execution cannot prevent the idle cycles.
To analyze idle cycles and differentiate between cases of type A and type B, the
lost cycle analysis must be performed per component and not per component type.
This will be shown in the first example of the next section.370
Note that the same analysis applies to the performance of GPUs: either GPU pro-
grams are compute bound, memory bound or latency bound [14]. The latter happens
when there are not enough concurrent hardware threads to hide all latencies. This hap-
pens because of resource limitations.
20
Figure 6: Graphical description of HOG. Example of how the sliding windows can be processed
in parallel since each individually processes a block of 2× 2 cells.
6. Applying the methodology in practice.375
Our methodology is evaluated for several designs on a Xilinx Virtex6-LX240T.
The tool chain is composed of the Xilinx HLS tool called Vivado HLS 2014.4 and the
Xilinx ISE 14.7, in order to support our target FPGA. The resources available on this
FPGA are 768 DSPs, 150k LUTs, 301k FFs and 832 Block RAMs. For our Virtex6 the
maximum measured frequency that DSPs can operate is 484 Mhz.380
Despite that our methodology is independent of the development tool, we consid-
ered the use of Vivado HLS for some of our examples for the sake of simplicity. This
type of tools accelerate the design-space exploration thanks to the high-level represen-
tation of the algorithms and to the large set of available optimizations of such HLS
tools. The optimizations considered are pipelining, partial loop unrolling and optimiz-385
ing the I/O interface.
6.1. Efficiency analysis of a real-world HPC FPGA implementation
Our methodology is firstly applied to a real-world algorithm, a Histogram-Oriented
Gradients (HOG) descriptor.
21
Algorithm description390
The HOG algorithm is one of the most popular algorithms for object detection. The
first step, the HOG descriptor is a sliding window algorithm that processes the gradient
orientations and magnitudes obtained for each pixel from a pre-processed image (Fig-
ure 6). The output generated from the HOG descriptor is used by a classifier algorithm
such as a Support-Vector Machine (SVM) to assign a matching score to the descrip-395
tor. The HOG algorithm is an application with high computational demands that needs
to be executed on a hardware accelerator such as GPU [16] or FPGA [17] to achieve
real-time object detection.
Although the whole object detection requires several steps, we only target the de-
scriptor calculations since it is the performance limiting step. The HOG descriptor400
processes gradients obtained by calculating the orientation and the magnitude gradi-
ents for each pixel of the original image. The image is divided in square cells of 8 by
8 pixels, and a sliding window of 2 by 2 cells is slid over the image (Figure 6). The
sliding window is moved one cell at a time and four histograms corresponding to the
cells contained by this window are computed. Two loops, L0 and L1 are defined to405
traverse all the gradients of the 2 by 2 cells. The histograms are generated based on
the orientation and magnitude values of each gradient. For each cell, the histogram
bin is computed and the contribution in terms of the gradient orientation and magni-
tude is added to the histogram bin. Each combination of orientation and magnitude not
only determines which histograms must be incremented but is also used to determine410
the value of the increment. Since there is an overlap of one cell for each sliding win-
dow, each cell contributes to the histogram of four sliding windows. These calculations
demand several floating-point operations, which dominate the overall execution time.
The HOG implementation analyzed in this paper is the floating-point version of the one
presented in [17].415
FPGA implementations
The C/C++ code describing HOG is compiled and synthesized using the Vivado HLS
2014.4 tool. The first implementation is one without optimizations. Secondly, the inner
loop L1 is pipelined and, thirdly, the outer loop L0 is pipelined. Here we discuss the
optimization of a single core. As shown in our previous work [17], this core can then420
22
be replicated. For this reason we will try to optimize occupied FPGA efficiency E ′.
Efficiency analysis
The reports from Vivado HLS are used to create a table such as Table 2, where the
parameters of the efficiency analysis, derived and calculated as explained in Sec. 4 are
summarized. The upper table shows the detailed resource consumption and cycle effi-425
ciencies (which are calculated with Eq. 32). Since the DSPs are the limiting resources,
we focus on this component. Two types of floating point instructions have to be con-
sidered: additions and multiplications. Used area (U ) is calculated based on consumed
DSPs. All consumed DSPs participate in the additions or multiplications, E ′a is there-
fore 100% for the three implementations. The lower table shows the aggregate results.430
The aggregated cycle efficiency Ec is based on Eq. 20 and is a weighted average of
the detailed cycle efficiencies (see Sec. 3.7.2). Note that the occupied FPGA efficiency
E ′ can be calculated in two ways: as T ′opt/Trun or as the multiplication of the three
efficiency components Ef , E ′a and Ec.
Because of not overlapping iterations, the first version has a very low efficiency.435
The first optimization results in the highest efficiency. The second optimization is
consuming more area; it leaves less area for replication. This is reflected in a lower E ′,
while the run time is nearly the same.
Accuracy validation
Vivado HLS cannot guarantee that the reported numbers of the HLS design are correct440
before proceeding with the next steps towards the actual implementation (e.g. it does
not know what the actual routing delays will be). To validate the results, we proceeded
with the RTL design and place and route stage of the standalone core generated by
Vivado HLS. The I/Os are assigned to the available pins of the FPGA and the target
clock frequency is determined by the maximum reported frequency in Vivado HLS. The445
recalculated efficiency values are shown in Table 3. It shows that the only difference
in terms of efficiency is a decrement of frequency by about 10% for the first and third
implementation. The resource consumption is slightly different since the estimation of
the area consumption is based on a component library. For this example, E ′a remains
the same because the consumed DSP remains the same. Nevertheless, the estimations450
provided by Vivado HLS tool are good a good reference in terms of efficiency.
23
Impl i N iop λiop DSP LUT FF Limp[cc] Eic[%]
No optim Add/Sub 3072 2 2 212 227 13.5
Mul 1024 3 3 135 128 4.5
Mul 512 3 3 135 128 2.3
Mul 512 3 3 135 128 2.3
Mul 512 3 3 135 128 22306 2.3
Sub 256 2 2 212 227 1.1
Sub 256 2 2 212 227 1.1
Sub 256 2 2 212 227 1.1
Total - - 20 1388 1420
Pipeline L1 Add/Sub 1280 2 2 212 227 49
Add/Sub 2560 2 2 212 227 2601 98
Mul 2560 3 3 135 128 98
Total - - 7 1101 553
Pipeline L0 Add 16 2 2 212 227 0.6
Add/Sub 1632 2 2 212 227 63.0
Add/Sub 1456 2 2 212 227 56.1
Add/Sub 384 2 2 212 227 14.8
Add/Sub 272 2 2 212 227 2591 10.5
Add/Sub 64 2 2 212 227 2.5
Mul 1312 3 3 135 128 50.6
Mul 1264 3 3 135 128 48.8
Total - - 18 1542 1618
Impl U [%] T ′opt[ms] Trun[ms] fimp[MHz] Ef [%] E ′a[%] Ec[%] E ′[%]
No optim 2.60 15.8e-4 0.1887 118.2 24.42 100 3.44 0.84
Pipeline L1 0.91 4.53e-4 0.02494 104.28 21.54 100 84.36 18.2
Pipeline L0 2.34 1.76e-4 0.02485 104.28 21.54 100 32.97 7.1
Table 2: The reports from Vivado HLS are used to analyze the efficiency of three HOG implemen-
tations: useful operations and resource consumption (top) and efficiency components (bottom).
The peak frequency is 484MHz and the FPGA contains 768 DSPs.
24
Impl U [%] T ′opt[ms] Trun[ms] fimp Ef [%] E ′a[%] Ec[%] E ′[%]
No optim 2.60 1.58e-3 0.205 109.46 22.62 100 3.44 0.77
Pipeline L1 0.91 4.53e-3 0.02494 104.21 21.53 100 84.36 18.2
Pipeline L0 2.34 1.76e-3 0.0292 91.32 18.87 100 32.97 6.02
Table 3: Efficiencies obtained after the placement and routing of the design solutions presented
in Table 2.
Figure 7: Analysis viewer of the HOG descriptor without any optimization.
Lost cycle analysis
The next step in our methodology is to identify the overheads, the reasons for the
efficiency drops. The low area efficiency indicate that it might be possible to have
more pipelining or more replication of the HOG blocks. The frequency drop is caused455
by the length of the critical path.
Regarding the cycle efficiency, we identify the lost cycles based on the classification
in Sec. 5.3. The poor cycle efficiency of the non-optimized first implementation is due
to lack of overlapping iterations (type 2.B.I). This also becomes apparent in Fig. 7,
which depicts part of the Analysis viewer. In this part there are only 3 useful cycles.460
Cycle efficiency is much better for the second implementation. The 98% of the
second and third unit show almost complete utilization. Only the first unit is underuti-
lized (49%). This is due to an imbalance in the distribution of the instructions among
the components (type 2.A.II). A better distribution could in principle increase the per-
formance. Imbalances also clearly appear for the third version, ‘Pipeline L0’. 4 of the465
units have a utilization between 50% and 60%, while 4 units between 0.6% and 14.5%.
As none of the units attain an efficiency close to 100%, more overlap seems to be pos-
25
Impl U [%] T ′opt[ms] Trun[ms] fimp Ef [%] E ′a[%] Ec[%] E ′[%]
Pipeline L0 2.34 1.76e-3 0.0292 91.3 18.9 100 32.97 6.02
Limiting adders 2.60 1.58e-3 0.0324 91.3 18.9 100 26.0 4.87
Partitioning memory 3.39 1.21e-3 0.031 49.9 10.3 100 37.9 3.90
Table 4: Efficiencies obtained after optimizing version ‘Pipeline L0’ (first row) by limiting the
number of adders and partitioning the memory.
sible (type 2.B.I). However, close inspection of the execution profile in the Analysis
viewer reveals that the lost computational cycles are due to the memory components
which are fully busy during those cycles (type 2.A).470
Optimization
Finally, we try to optimize the design based on the aforementioned bottlenecks. Two
optimizations were tried on the ‘Pipeline L0’ version. First, to remove the imbalance
of the DSPs doing additions, we forced Vivado HLS to limit the used DSPs doing
additions to 4 computational units. Secondly, we let Vivado HLS partition the memory475
to overcome the memory bottleneck. The results of both are reported in Table 4. We
copied the results for the original ‘Pipeline L0’ version in the first row. However,
no optimization results in a better performance. Limiting the adders resulted in more
multipliers: 4 instead of the original 2. This decreased the cycle efficiency instead
of increasing it. Memory partitioning increased the area consumption (7 adders and480
4 multipliers) which were better used (higher cycle efficiency). But, conversely, the
obtained frequency had to be almost halved.
6.2. Interaction of Efficiency Components
As clearly demonstrated by the HOG use case, the three efficiency components, Ec,
Ea and Ef , are not independent. Changing or optimizing one component might affect485
another component. We discuss the 3 possible interactions.
6.2.1. Area and Frequency
The dependency between area and frequency efficiency is demonstrated by con-
structing a benchmark to attain the theoretical peak performance. A cascade of single-
26
A[i]
B[j]
Adder 0 Adder 1 Adder N-1
?[i]
Figure 8: Our benchmark consists of a cascade of a variable number of single precision floating-
point adders.
?
0 50 100 150 200 250 300 350 400
0
40
80
120
160
Logic
DSPs
Single precision floating-point adders 
G
F
L
O
P
S
Figure 9: Evolution of the floating-point performance by generating single precision adders with
DSPs or with logic resources.Values obtained after placement and routing.
precision floating-point adders is built as proposed in [21].490
Fig. 8 shows the benchmark used to measure the attainable floating-point perfor-
mance on a particular FPGA. These IPs are implemented as single precision floating-
point adders configured to consume DSPs or logic resources (8). In order to maximize
the floating-point performance, the best choice is to use the add/subtract operation.
This floating-point operation can be implemented on an FPGA using DSPs and/or logic495
resources. The FPGA vendors’ floating-point intellectual property (IP) user guide al-
ready offers an estimation of logic consumption and maximum frequency.
The performance is obtained after the placements and routing of the handmade
VHDL benchmark, showing the real attainable performance and reflecting not only
the routing congestion but also the impact on the frequency. Such effects can hardly500
be estimated at high-level, since the resource consumption are based on component’s
27
0 50 100 150 200 250 300 350 400
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
E
Ea
Ef
Ec
Single precision floating-point adders
Figure 10: Evolution of the efficiencies when increasing the number of single precision adders
with DSPs.Values obtained after placement and routing.
estimations. As a result, effects like routing congestion cannot be accurately estimated.
Nevertheless, our model is designed to be applied at any level of design. Therefore,
despite the results would be more accurate when using values after placement and
route, the overall effort increases and a higher implementation time is needed.505
The theoretical way to estimate the peak performance is by consuming all the avail-
able DSPs operating at their theoretical maximum frequency. However, by benchmark-
ing the floating-point peak performance is not achieved when all DSPs are consumed.
Fig. 9 shows how the floating-point performance increases linearly up to a certain point
where the frequency starts to decrease due to routing congestion. The peak is achieved510
just before the maximum frequency starts to drop drastically when all DSPs are con-
sumed.
Our Virtex6 lx240t offers 768 DSPs, or 384 single point additions, able to operate
at 484 MHz. This represents a theoretical floating-point peak performance of 185.8
GFLOPS. Our measurements show that the peak is close to 158 GFLOPS when using515
only DSPs, while using only logic resources it approximates to 133 GFLOPS. Conse-
quently, only 85% or 71% of the peak performance using DSPs or logic is respectively
achieved respectively.
Fig. 10 shows how our benchmark exploits the available resources in order to reach
28
        





	
	
	
	
	









	
	






a
b
c
de
f
g
Figure 11: Trade-off consuming DSPs and logic resources to obtain the single precision floating-
point peak performance. By reducing the DSP consumption more logic is available for routing
and additional adders. While up to 484 adders can be implemented, the peak performance is
dominated by the maximum frequency. Values obtained after placement and routing.
the highest efficiency when using DSPs for building the adders. The increment of520
the consumed area to execute useful operations leads to a higher area efficiency, but
not necessarily to the highest performance. Ec remains constant at the highest value
since there are no lost cycles due to the pipelining of the adders. Pipelining allows
to start a new addition after the issue latency of the previous operation. Therefore,a
floating-point operation is executed each λ cycles. The global efficiency is determined525
by Ea and Ef . Ea increases linearly with the number of DSPs used to implement single
precision floating-point adders. Ef starts to decrease just before Ea reaches the highest
efficiency as shown in Fig. 10. The increment of additions due to a higher consumption
of DSPs leads to a routing congestion, which enlarges the critical path and decreases
the maximum frequency. Consequently, it is not possible to reach 100% of efficiency.530
Due to the dependency between area and frequency, the highest performance is only
achievable assuming a trade-off between those parameters.
The strategy to reach the peak performance proposed in [21] is to consume all avail-
able DSPs by building as many adders as possible while dedicating the remaining logic
resources to build more adders. An adder can be built with either 2 DSPs and about535
212 LUTs, or with 385 LUTs. We neglect the amount of flipflops because they are
not constraining the implementations. The equations when using a mix of component
29
ND&Lop N
LUT
op fimp Perfimp E Ef EDSPa RLUTimp ELUTa Ec
a 374 90 222 103 37.7 45.8 97 118636 79 100
b 374 110 305 148 54.2 63.1 97 123751 82 100
c 376 95 392 184 67.7 80.9 98 119615 79 100
d 376 100 315 150 54.9 65.1 98 122691 81 100
e 384 85 343 161 58.9 70.8 100 118624 79 100
f 384 95 270 129 47.3 55.8 100 120182 80 100
g 384 100 202 98 35.8 41.7 100 122000 81 100
Table 5: Efficiencies obtained by constructing ND&Lop adders with DSPs and Logic, and N
LUT
op
adders with only Logic (LUTs). Each row correspond to a point in the graph of Fig. 11. fimp is
in MHz and Perfimp is in GFLOPS. Peak performance is 272 GFLOPS.
types were discussed in Sec. 3.7.1, although not when cost vectors have to be consid-
ered. With cost vectors, the calculation of the optimal runtime is in general less straight
forward than Eq. 21, since one has to find the optimal usage of the different compo-540
nents first. In our case it is fairly simple, one first consumes all DSPs, 374 adders are
built, and then uses the remaining LUTs for additional adders. This results in theoreti-
cally 185 additional adders and a peak performance of 272 GFLOPS. In practice it was
only possible to synthesize 110 additional adders after consuming all DPSs. Figure 11
shows that the peak performance rounds 184.5 GFLOPS when using only 752 out of545
768 DSPs, and combined with remaining logic. Next, we apply our methodology to
acquire insight into the obtained efficiency. Table 5 summarizes the efficiency values
for 7 points of Fig. 11. After filling the adder pipeline, each adder executes 1 operation
each cycle. Cycle efficiency is then 100%. Perfimp is the product of frequency and
the number of adders. Efficiency is calculated by comparing the attained performance550
with the theoretical peak performance of 272 GFLOPS. EDSPa and ELUTa represent the
fraction of components used (Eq. 18), RDSPimp = 2.N
D&L
op and the given R
LUT
imp . Note
that the total efficiency can also be calculated from its components with Eq. 25.
30
100 150 200 250 300 350 400 450
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
30.2%
50.8%
59.2%
82.0%
86.4%
10.0%
7.1% 6.2% 5.0%4.2%
Ec Ef
Frequency [MHz]
100 150 200 250 300 350 400 450
0%
2%
4%
6%
8%
10%
12%
0.02% 0.02% 0.02% 0.02%0.02%
0.52% 0.52% 0.52% 0.52%0.52%
10.0%
7.1%
6.2%
5.0%
4.2%
Ec Ea E
Frequency [MHz]
Figure 12: Evolution of Ecycle when increasing Efreq for a matrix size of 32 × 32.
6.2.2. Cycles and Frequency
Changing the frequency affects the total number of cycles and cycle efficiency. This555
is demonstrated with a fundamental linear-algebra operation, a matrix multiplication.
We consider a matrix multiplication of a matrix A of m rows by k columns and a
matrix B of k rows by n columns resulting in a matrix C of m rows by n columns. We
assume all elements of matrix C to be equal to 0 before the computation begins. The
implementation we consider, consists of three nested loops: L0, L1 and L2. L0 iterates560
over all rows of matrix C, while L1 iterates over all elements (columns) of the row
corresponding to the current iteration of L0. An element at a given row and column
is determined by computing the scalar product of the corresponding row of matrix A
and the corresponding column of matrix B. For this purpose loop L2 adds the products
of corresponding elements of said row and column to the target element of matrix C.565
Given that no intermediate values are used, each iteration of L2 accesses an element of
matricesA,B andC. For the sake of simplicity, we will only consider square matrices.
Fig. 12 shows the evolution of Ec when increasing Ef with and without optimiza-
tions. Notice how the increment of Ef reduces Ec. The increment on Ef is obtained
by increasing the target frequency of the Vivado HLS design, which forces the tool to570
achieve a lower maximum clock period for every compilation. The frequency defines
the clock period, which is the basic time unit. Therefore, a higher number of clock
cycles is needed to execute the same code because of the increment of the frequency.
Ec decreases because of two main reasons:
• The impact of the lost cycles increases since their number increases due to a575
31
shorter clock period.
• A higher frequency demands additional logic, mainly registers, to decrease the
critical clock path. This leads to overhead, resulting in additional lost cycles.
6.2.3. Area and Cycles
As has been detailed before, E reflects the quality of an implementation, while580
the area, latency and frequency reflect where the design needs to be improved. The
analysis of the peak performance for a floating-point matrix multiplication reveals that
the limiting efficiency is Ea. The original design does not fully exploit the available
area.
Our efficiency analysis shows that pipelining loops is the most effective optimiza-585
tion for this algorithm, Ec achieves the highest efficiency. The overall efficiency, how-
ever, can be improved by increasing Ea or Ef . The increment of the resource consump-
tion increases Ea because these resources are dedicated to compute useful operations in
parallel. On the other hand, Ef can increase up to a certain limit, as has been previously
shown.590
Optimizations at memory level need to be made to increase Ea. In the previous
design, memory was accessed serially, but with the proper optimization it is possible
to reduce the memory accesses. By partitioning the memory blocks, the memory can
be accessed in parallel. Partitioning the memory allows to operate in parallel, con-
suming more resources for useful operations and reducing the overall execution cycles.595
However, only the proper level of partitioning leads to the peak performance.
Table 6 shows how the overall efficiency increases when the memory is partitioned.
The greater parallelism increases the resources dedicated to execute useful operations,
improving Ea. However, Ec slightly decreases. Since most of the lost cycles are con-
stant when pipelining this algorithm [22], their relative impact increases when the ex-600
ecution time decreases. Thus, while increasing the parallelism, more area is used but
the impact of control overhead increases. Actually, this is an example of Amdahl’s
law. Loop control overhead is constant, it does not decrease with more parallelism.
Therefore, its impact on the overall efficiency increases with more parallelism.
32
Partition Level Ef [%] Ea[%] Ec[%] E [%]
1 59.20 0.52 99.28 0.31
2 59.20 1.04 98.47 0.61
4 59.20 2.08 96.89 1.19
8 59.20 4.17 93.97 2.32
16 59.20 8.33 88.62 4.37
32 59.20 16.67 79.56 7.85
Table 6: Efficiency analysis of a 32 × 32 matrix multiplication when pipelining L1 and with
different levels of memory partitioning.
0 500 1000 1500 2000 2500
0%
20%
40%
60%
80%
100%
E
Ec
Ea
Ef
Matrix Size [Rows,Columns]
Figure 13: Evolution of the efficiencies for each matrix size. The values have been obtained from
the most efficient design with a particular matrix size. The optimizations applied are pipelining
of the loop L1, several target frequencies and different levels of memory partition.
6.3. Bottleneck identification605
Our efficiency analysis helps in identifying bottlenecks. The results in the previous
example only considers matrix multiplication of 32 × 32 matrices. Despite that the
maximum parallelism has been reached, Ea only reaches 16.7%. Here we analyze what
happens when processing large matrices. Our analysis targets the optimizations in the
middle loop L1, because, as shown in the previous section, the highest performance610
is obtained through loop pipelining and memory partitioning. Consequently, these are
the optimizations considered for this analysis.
Fig. 13 shows the evolution of the efficiencies in function of matrix size for the ma-
33
0 500 1000 1500 2000 2500
0%
20%
40%
60%
80%
100%
Resource consumption @ Peak
FF
LUT
DSP48E
Matrix Size [Rows,Columns]
Figure 14: Resource consumption Eju in function of matrix size. The optimizations applied are
pipelining of loop L1, several target frequencies and different levels of memory partition.
trix multiplication implementation. As depicted in Table 6, Ec slightly decreases due
to the latency overhead introduced by the extra control logic required for the memory615
partitioning. When increasing the matrix size, the impact of this overhead is reduced
and Ec converges to 100%. Regarding Ea, the level of the memory partition is defined
by the matrix size or by the area consumption. By default, 5 DSPs are consumed to
execute the single floating-point additions and multiplications. The level of the mem-
ory partition increases the DSP consumption, and consequently Ea. The maximum620
level of the memory partition is close to 150. This is obtained by considering the DSP
consumption of each operation and the available DSPs on the target FPGA. Conse-
quently, the highest Ea is expected to be achieved by completely partitioning a matrix
of 150× 150 floating-point elements.
Fig. 14 shows the limiting resource based on the matrix size. Thus, matrices smaller625
than 150 × 150 elements are limited by the number of available DSPs, while larger
matrices are limited by LUTs. A high level of memory partition for large matrices is
causing LUTs to become the limiting factor. Matrices larger than 150× 150 elements
consume all the available DSPs plus certain number of LUTs. For instance, matrices
larger than 256 × 256 do not support the same level of memory partition than for630
matrices of 150 × 150 due to the LUTs consumption. This fact results in a reduction
of Ea due to a lower level of memory partition. It decreases the number of parallel
34
operations and, therefore, reduces the number of consumed DSPs. Ef is also affected
by this turning point. The achievable design frequency decreases with the increment of
the matrix size due to the additional resource consumption. The resource overhead is635
slightly reduced when the target frequency is decreased, allowing higher partition levels
of the memory until the LUT limit is reached again. Only the proper combination of
both parameters leads to the highest efficiency, and, therefore, the peak performance
for this algorithm. Notice that the decrease of Ef , however, is not as abrupt as for Ea.
7. Conclusions640
When using FPGAs for HPC, one should compare the obtained performance with
the peak performance and identify what blocks optimal execution. For this, we pro-
posed a formal methodology to study the efficiency of an FPGA implementation. Our
work provides a formal umbrella to complement existing work by extending the perfor-
mance analysis with a quantification and decomposition of the efficiency. The value of645
the methodology is demonstrated with several studies. We were able to identify bottle-
necks in different types of implementations. Next, we compared different alternatives
in order to find the best compromise. It is also shown how the interrelations between
area, frequency, and performance can be better understood thanks to our methodology.
Nevertheless, the utility of this methodology will depend on its ability to be auto-650
mated and integrated into a tool.
References
[1] W. Vanderbauwhede and K. Benkrid, Editors, ”High-Performance Computing Us-
ing FPGAs”, Springer 2013.
[2] S. Aluru and N. Jammula, ”A review of hardware acceleration for computational655
genomics”, Design & Test, IEEE 31.1: 19-30, 2014.
[3] S. Skalicky, S. Lopez, M. Lukowiak and C. Wood, ”Mission control: A perfor-
mance metric and analysis of control logic for pipelined architectures on FPGAs”,
35
In ReConFigurable Computing and FPGAs (ReConFig), 2014 International Con-
ference on (pp. 1-6). IEEE. 2014.660
[4] E. Gerlein, T. M. McGinnity, A. Belatreche, S. Coleman and Y. Li, ”Multi-agent
pre-trade analysis acceleration in FPGA”, In Computational Intelligence for Fi-
nancial Engineering & Economics (CIFEr), IEEE Conference on (pp. 262-269),
2014.
[5] S. Skalicky, S. Lopez, M. Lukowiak, J. Letendre and M. Ryan, ”Performance665
Modeling of Pipelined Linear Algebra Architectures on FPGAs”, 9th International
Symposium, ARC 2013.
[6] S. Skalicky, S. Lopez and M. Lukowiak, ”Performance modeling of pipelined lin-
ear algebra architectures on FPGAs”, Computers & Electrical Engineering, Else-
vier, Volume 40, Issue 4, 2014.670
[7] B. Holland, K. Nagarajan, C. Conger, A. Jacobs, A. D. George, ”RAT: a method-
ology for predicting performance in application design migration to FPGAs”, In
Proceedings of the 1st international workshop on High-performance reconfigurable
computing technology and applications: held in conjunction with SC07, (pp. 1-10).
ACM, 2007.675
[8] J. Curreri, S. Koehler, B. Holland and A. D. George, ”Performance analysis with
high-level languages for high-performance reconfigurable computing”, In Field-
Programmable Custom Computing Machines (FCCM), 2008.
[9] S. Koehler and A. D. George, ”Performance Visualization and Exploration for Re-
configurable Computing Applications”, ERSA, 2010.680
[10] S. Koehler, G. Stitt, and A. D. George. Platform-aware bottleneck detection for re-
configurable computing applications. ACM Transactions on Reconfigurable Tech-
nology and Systems, Vol. 4, No. 3, 2011.
[11] M. Beltran, A. Guzman and F. Sevillano, ”High level performance metrics for
FPGA-based multiprocessor systems”, Performance Evaluation, Vol. 67, No. 6,685
pp. 417-431, 2010.
36
[12] A. Grama, A. Gupta, G. Karypis and V. Kumar, ”Introduction to Parallel Com-
puting”, Benjamin-Cummings, 2003.
[13] M. Crovella and T. J. LeBlanc, ”Parallel performance using lost cycles analysis”,
Proceedings of the conference on Supercomputing, 1994.690
[14] S. Hong and H. Kim, An analytical model for a GPU architecture with memory-
level and thread-level parallelism awareness, ACM SIGARCH Computer Archi-
tecture News, 2009.
[15] ”UltraFast Design Methodology Guide for the Vivado Design Suite”, Available
online: http://www.xilinx.com/support/documentation/sw manuals/ug949-vivado-695
design-methodology.pd, Xilinx Inc., November 2015
[16] V. Prisacariu and I. Reid, ”fastHOG-a real-time GPU implementation of HOG”,
Department of Engineering Science, Technical Report 2310/09, Oxford University,
2009.
[17] B. da Silva, A. Braeken, E. H. D’Hollander, A. Touhafi, J. G. Cornelis, and J.700
Lemeire. ”Comparing and combining GPU and FPGA accelerators in an image
processing context”, 23rd International Conference on Field programmable Logic
and Applications, IEEE, 2013.
[18] S. Sirowy and A. Forin, ”Where’s the Beef? Why FPGAs Are So Fast”, Technical
Report MSR-TR-2008-130, Microsoft 2008.705
[19] B. So, M. W. Hall, and Pedro C. Diniz. ”A compiler approach to fast hardware
design space exploration in FPGA-based systems.” Proceedings of the ACM SIG-
PLAN 2002 conference on Programming language design and implementation,
New York, NY, USA, 2002.
[20] ”Vivado Design Suite, High-Level Synthesis User Guide”, Available online:710
http://www.xilinx.com/support/documentation/sw manuals/xilinx2014 2/ug902-
vivado-high-level-synthesis.pdf, Xilinx Inc., 2014.
37
[21] ”Technical White Paper: Understanding Peak Floating-Point Performance
Claims”, Available online: http://www.altera.com/literature/wp/wp-01222-
understanding-peak-floating-point-performance-claims.pdf, Altera Corporation,715
2014
[22] B. da Silva, J. Lemeire, A. Braeken and A. Touhafi, ”A Lost Cycles Analysis for
Performance Prediction using High-Level Synthesis”, International Symposium on
Applied Reconfigurable Computing (pp. 334-342), Springer, 2016.
[23] G. Zhong, V. Venkataramani, Y. Liang, T. Mitra and Smail Niar, ”Design space720
exploration of multiple loops on FPGAs using high level synthesis”, 32th IEEE
International Conference on Computer Design (ICCD), 2014.
38
Bruno T. Da Silva Gomes is a Ph.D student at Vrije Universiteit Brussel (Belgium) under the 
supervision of Pr. Abdellah Touhafi.  He obtained his degree in Telecommunications at the 
University of Vigo (Spain), specialized in Electronics and Telematics, and he completed his 
master thesis at iMEC (Belgium). For more than two years he has worked as researcher in the 
Signal department of the Univesity of Vigo (Spain) implementing different architectures for DvB 
receivers on an FPGA. He is currently finishing his Ph.D in the topic of High-Level Synthesis for 
High-Performance Computing applications. 
*Author Biography & Photograph
Abdellah
Antwerp
Sciences
full time
Parallel 
 
An Brae
she rece
(Compu
Erasmus
Erasmus
BCG. H
private l
Jan Lem
Departm
for the b
architect
book cha
computi
models. 
founder 
for resea
models i
systems.
 Touhafi ob
. He has a M
: Scalable R
 professor at
Processing a
ken obtained
ived her PhD
ter Security 
hogeschool 
hogeschool 
er current in
ocalization t
 
eire is profe
ent of Indus
achelor cour
ure and para
pters, on pa
ng, his areas
The results o
of the Perso
rchers and c
nclude caus
 
 
tained his ba
asters degre
un-Time Re
 the VUB an
nd Telecom
 
 her MSc D
 in engineer
and Industri
Brussel in th
Brussel, she
terests inclu
echniques, a
ssor at the D
trial Science
ses on infor
llel systems
rallel compu
 of expertise
f this resear
nal SuperCo
ompanies fo
al analysis a
chelor's deg
e in Electro
configurable
d the leader
munications
egree in Ma
ing sciences
al Cryptogra
e Industrial 
 worked for 
de cryptogra
nd FPGA im
epartment o
s (INDI) at t
matics and e
. He is the au
ting and pro
 are MPI, G
ch can be fo
mputing Com
r exploiting 
nd learning a
ree in Electr
nics from th
 Computing
 of the Rapp
 Oriented Re
thematics fro
 from the KU
phy). In 200
Sciences De
almost 2 yea
phy, security
plementatio
f Electronics
he Vrije Un
lectronics, a
thor of seve
babilistic gr
PU computin
und at www
petence Ce
GPUs. His i
lgorithms in
onics, optio
e VUB Brus
 Systems als
tor Lab: Rec
search. 
m the Univ
Leuven at 
7, she becam
partment. Pr
rs at a mang
 protocols f
ns. 
 and Inform
iversiteit Bru
s well as for
ral publicati
aphical mod
g, performa
.gpuperform
nter which a
nterests in th
 the context
n: Computer
sels and a Ph
o at the VUB
onfigurable 
ersity of Gen
the research 
e professor 
ior to joinin
ement consu
or sensor net
atics (ETRO
ssel (VUB)
 the master c
ons in top jo
els. Within t
nce analysis
ance.org. Ja
ims at lowe
e field of pr
 of modeling
 systems at I
D in Applie
. He curren
Architectur
t in 2002. In
group COSI
at 
g the 
lting compa
works, secu
) and the 
. He is respo
ourses on co
urnals, as w
he field of p
 and perform
n Lemeire is
ring the thre
obabilistic g
 static and d
HAM 
d 
tly is a 
es, 
 2006, 
C 
ny 
re and 
nsible 
mputer 
ell as 
arallel 
ance 
 co-
sholds 
raphical 
ynamic 
Jan G. C
Universi
He retur
Science 
ETRO D
research
 
ornelis grad
ty of Leuven
ned to the ac
in 2010 at th
epartment o
 that both re
 
uated as Ma
. He worke
ademic wor
e Vrije Univ
f this univer
volve around
ster of Scien
d for more th
ld in 2011 af
ersiteit Brus
sity, workin
 GPU progr
ce in Engine
an 10 years
ter graduatin
sel. He curr
g on project
amming and
ering in 199
in the privat
g as Master
ently works 
s and perform
 performanc
6 at the 
e sector. 
 of 
at the 
ing 
e. 
