A software framework for pipelined arithmetic algorithms in field
  programmable gate arrays by Kim, J. B. & Won, E.
ar
X
iv
:1
71
0.
09
23
5v
3 
 [c
s.O
H]
  1
4 D
ec
 20
17
A software framework for pipelined arithmetic algorithms in field programmable gate
arrays
J. B. Kim and E. Won∗
Physics department, Korea University, Anam-ro 145, Seongbuk-gu, 02841 Seoul, Korea
Abstract
Pipelined algorithms implemented in field programmable gate arrays are extensively used for hardware triggers in the
modern experimental high energy physics field and the complexity of such algorithms increases rapidly. For development
of such hardware triggers, algorithms are developed in C++, ported to hardware description language for synthesizing
firmware, and then ported back to C++ for simulating the firmware response down to the single bit level. We present a
C++ software framework which automatically simulates and generates hardware description language code for pipelined
arithmetic algorithms.
Keywords: Software framework, FPGA, Pipelined arithmetic algorithms, VHDL, C++, code generation
1. Introduction
In the modern experimental high energy physics field, de-
tectors with massive number of channels are used to iden-
tify physical processes that occur when colliding particles.
Because the rate of colliding particles including uninter-
esting background are in the scale of MHz [1, 2, 3] and
data readout from detectors are in the scale of megabytes
[4, 5, 6], it is currently impossible to record all the collision
data which would be produced in the terabyte per sec-
ond scale. Therefore a hardware trigger which determines
whether the data should be recorded or not is required.
The trigger should filter the detector data in such a way
that only the physics processes of interest are written to
a permanent storage at an acceptable rate. The trigger
response also needs to be prompt in making the decision,
because each sub-detector can hold its data for only a lim-
ited amount of time due to hardware limits which is in the
scale of micro seconds [1, 7, 8]. The trigger should per-
form all of its logic before this limited amount of time is
reached.
Field programmable gate arrays (FPGAs) are integrated
circuits that are programmed using hardware description
language. Due to their programmable and parallel nature,
they have been used for event triggers in the modern ex-
perimental high energy physics field [9, 10, 11] extensively.
FPGA based trigger algorithms generally use integer based
calculations [10, 12, 13, 14]. Although floating-point cal-
culation can be implemented in FPGAs, the calculation
latency, FPGA resource usage are significantly higher as
discussed in Ref. [15, 16].
∗Corresponding author
Email address: eunil@hep.korea.ac.kr (E. Won)
On the other hand, physics related data are generally han-
dled using floating-point calculations with general purpose
computers and physics analysis software are built with
floating-point calculations for precise results. One needs
to use these software to study the performance of trigger
algorithms. The implemented trigger should also be sim-
ulated in these software in such way that the effects of the
trigger on the recorded physics of interest can be studied,
as well as to be compared with the output from the real
hardware down to the single bit level.
Due to these facts, trigger algorithms are usually devel-
oped in two software versions. One that uses floating-
point calculations and one that uses integer calculations
[10, 12, 13, 14]. The floating-point version shows the pure
algorithm performance while the integer version shows the
degradation of performance due to the constraints of in-
teger calculation and the performance of the FPGA al-
gorithms. Due to the coexistence of the two versions, one
constantly needs to synchronize them when the algorithms
are modified in one of the versions. To make matters
worse, FPGAs are programmed using a hardware descrip-
tion language so that there can even be three versions of
the same algorithm that are not necessarily developed by
one individual. These various versions make the mainte-
nance of the level one trigger software extremely difficult.
A solution to these problems could be to use high level syn-
thesis (HLS) packages such as Vivado HLS [17]. One can
write code in C++ and let HLS convert it into a hardware
description language. If latency or resources of a FPGA is
not an issue, this would be the best solution. However, as
discussed earlier, floating-point calculation implemented in
FPGAs take up much more latency and resources, which
is also a concern for Vivado HLS [18]. Integer or fixed-
point algorithms can be written in C++ using the classes
Preprint submitted to Nuclear Instruments and Methods in Physics Research A December 15, 2017
Develop
algorithm
in C++
Port
algorithm
to VHDL
Port VHDL
algorithm
to C++
Figure 1: A development procedure for firmware algorithms. An
algorithm is developed with floating-point based calculations using
simulated data as input. It is then ported to VHDL with integer
based calculations to synthesize firmware for a given FPGA. To study
the performance of the synthesized firmware, it is ported back to
software which simulates the firmware response exactly bit-by-bit.
provided by Vivado HLS, but then the precision and bit
widths for each calculation should to be additionally con-
sidered. Also there would still be the issue of maintaining
two versions, floating-point and fixed-point, of the same
algorithm.
We have developed a framework that solves the multi-
ple version problem which uses integer based calculations
for the hardware description language. Once an algo-
rithm is implemented in the framework, one can obtain the
floating-point calculation result, integer calculation result,
and very high speed integrated circuit hardware descrip-
tion language (VHDL) code simultaneously.
In this work, a framework for pipelined arithmetic algo-
rithms in FPGAs is reported. The goals and design of the
framework are explained. The three C++ classes that were
developed for the framework are described. Algorithms
that were developed using this framework are discussed as
examples. We also compare between our framework and
Vivado HLS for a linear regression algorithm.
2. Goals
A typical procedure of firmware algorithm development is
shown in Fig. 1. Algorithms are developed and tested
in C++ first. After the algorithms are validated they are
ported to hardware description language such as VHDL.
There are several issues that should be considered when
porting to VHDL. Floating-point numbers should be con-
verted to integers. The bit width of all variables should
be determined. Division and non-linear operators such
as trigonometric operators should be implemented using
look-up tables (LUTs). The inputs to an operator should
be properly buffered so that the clock cycle between them
are in synchronization. Overflow and underflow should be
prevented when doing addition, subtraction, and multipli-
cation. In order to limit the FPGA resource usage, the bit
width of the inputs to the multiplication operator should
be small enough to be implemented in a digital signal pro-
cessing (DSP) slice [19]. After porting to VHDL, the re-
sources used by the algorithm should be small enough to
fit in to the chosen target FPGA. One way to reduce the
resource is by controlling the bit widths for the LUTs.
Due to the loss of calculation precision when porting, the
VHDL codes need to be simulated, a priori to confirm if
they can achieve their goals. A floating-point calculation
of the algorithm should be performed to confirm the loss of
precision due to this integer conversion. The VHDL codes
should be simulated in C++ in such a way that the results
can be used in studying other algorithms. Simulation in
C++ will also help in debugging the firmware algorithm
most efficiently.
A framework is developed to simplify the entire process
of pipelined arithmetic firmware algorithm development.
The framework can execute the algorithm using floating-
point calculations, simulate the integer-valued version of
the algorithm, automatically generate VHDL code, and
deal with all the issues described previously on arithmetic
algorithms. After an algorithm is developed, the frame-
work will handle the rest of the development process most
efficiently.
3. Design
Three classes have been developed in total. The first one
is for simulating VHDL signals and the second one is for
LUTs that use block random access memories (BRAMs)
[20]. The third class is to store the information related
with our VHDL codes. Clock cycles are taken into con-
sideration so that the signals are properly buffered for the
pipelined algorithms in the VHDL code.
3.1. Signal class
The signal class has been implemented to simulate the
signed and unsigned VHDL types. Since algorithms gen-
erally use floating-point variables but VHDL signals are in-
teger variables, a conversion from floating-point values to
integer values are executed when the range of the floating-
point variables and bit widths are given. For signed vari-
ables, the conversion is done by the following equations
symmetric max = max (maximum float value ,
|minimum float value|) (1)
convertion constant =
2(n−1) − 0.5
symmetric max
(2)
integer varaible = ⌊float variable
× convertion constant⌉ , (3)
where n is a given the bit width, max is the maximum
function, ⌊⌉ is a round-off function, and float refers to a
floating-point. For unsigned variables, the conversion is
done by the following equation
convertion constant =
2n − 0.5
maximum float value
(4)
integer variable = ⌊float variable
× convertion constant⌉ , (5)
where n is a given bit width. The real value which the
integer value represents can be calculated using following
2
equation
real value =
integer value
convertion constant
. (6)
Addition, subtraction and multiplication operators have
been implemented as class methods. The maximum and
minimum values are calculated and stored in the class so
that bit widths can be reduced to a minimum for each
operator. Before adding and subtracting, the input’s con-
version constants should be matched. They are similarly
matched by multiplying a factor of two which is done by
bit shifting. The multiplication method is implemented so
that only one DSP slice is used to reduce FPGA resources.
One DSP slice can perform 25 bit × 18 bit calculations so
that the bit width of the input is constrained to 25 bits or
18 bits by applying bit shifts. An if-else method is also im-
plemented to be able to control the flow of the algorithm.
It consists of a comparing component and an assigning
component. Two signals can be compared with a compare
method which receives ==, !=, >=, >, <=, <, &&, and || as
an argument and returns a Boolean type signal. Depend-
ing on the comparison, different arithmetic operations can
be preformed by setting the assigning component.
Each method for this class has logic which can generate
VHDL code. To reduce calculation overhead, a flag is
used to turn it on and off. All the methods also perform
floating-point calculations where the results are stored in
the class so that it can be compared with the integer-
valued calculations.
The <= operator is overloaded to represent that the logic
should be performed in one clock cycle as in VHDL. When
this operator is used, the clock cycle of the signal in the
left-hand side will be assigned with one addition clock cycle
compared to the right-hand side.
An example C++ code for pipelined addition using the
framework is shown in Listing 1. After two φ values are
added together in one clock cycle, another φ value is added
to the sum in the next clock cycle, as shown in the bot-
tom part of Listing 1. The automatically generated VHDL
code is shown in Listing 2. The signals are defined accord-
ing to the logic in the implemented C++ code. The buffers
required for pipelining the logic are also defined. Sequen-
tial VHDL statements are written according to the C++
code. The framework simulated results is shown in Table
1. The simulated floating-point values, integer values and
real values for each signal are shown. It demonstrates that
the framework simulation is working well.
Listing 1: Example C++ code for a pipelined addition. The <= oper-
ator is overloaded to represent the logic should be performed in one
clock cycle. phi_0, phi_1, and phi_2 are defined as signed signals
with 10 bits and have a range from −3.14 to 3.14. phi_0, phi_1, and
phi_2’s current values are 1.57, −0.785, and 0.785. phi_0 and phi_1
are added during one clock cycle to obtain phiAdd. This is added
with phi_2 to obtain phiAdd2 on the next clock cycle.
// Define signals
JSignal phi_0 <= JSignal(10, 1.57 , -3.14, 3.14, 0,
storage);
JSignal phi_1 <= JSignal(10, -0.785, -3.14, 3.14, 0,
storage);
JSignal phi_2 <= JSignal(10, 0.785, -3.14, 3.14, 0,
storage);
// Addition
JSignal phiAdd <= phi_0 + phi_1;
// Pipelined addition
JSignal phiAdd2 <= phiAdd + phi_2;
Listing 2: Automatically generated VHDL for a pipelined addition
example. The framework defines the signals according to the JSignal
properties in the C++ code. There is also a buffer for phi_2 to syn-
cronize the clock cycle between phiAdd and phi_2 when calculating
phiAdd2. The framework writes VHDL according to the logic defined
in C++.
-- Define signals
signal phi_0 : signed( 9 downto 0) := (others=>’0’);
signal phi_1 : signed( 9 downto 0) := (others=>’0’);
signal phiAdd : signed(10 downto 0) := (others=>’0’);
signal phiAdd2 : signed(11 downto 0) := (others=>’0’);
type S10D1Array is array(0 downto 0) of signed(9 downto 0);
signal phi_2_b : S10D1Array := (others=>(others=>’0’));
-- Sequential logic
phiAdd <= resize(phi_0 ,11)+phi_1;
phiAdd2 <= resize(phiAdd,12)+phi_2_b(0);
phi_2_b(0) <= phi_2;
Name Float value Integer value Real value
phi_0 1.570 256 1.57153
phi_1 −0.785 -128 −0.78577
phi_2 0.785 128 0.78577
phiAdd 0.785 128 0.78577
phiAdd2 1.570 256 1.57153
Table 1: Results of the pipelined addition example, where float value
is the floating-point value, integer value is the converted integer value
from the floating-point value, and real value is the value that the
integer value represents. Difference between the floating-point values
and integer representation values are due to the conversion from
floating-point values to integer values.
3.2. LUT class
The LUT class generates LUTs with signal instances as
input and output which can be used for operations that are
not directly possible in VHDL. Division and trigonometric
operators can be implemented using this class. The LUTs
are implemented using BRAMs. After the LUT class is
properly set, it can generate a text file which has all the
values to be stored in the BRAM. This text file1 is then
used with a commercial synthesis tool [21]. The input
is transformed so that its minimum value is zero which
reduces the BRAM size in certain cases. The output of
the BRAM also shares this property. A constant value is
added to get the proper output. Although this process uses
a few clock cycles, it can drastically reduce the BRAM size
in certain cases. This class generates VHDL code which
should be used with a Block Memory Generator IPCORE
[21] and the generated text file.
1This text file is called COE (coefficient file) within the Xilinx
tools.
3
3.3. VHDL code storage class
This class stores the entire VHDL generated by the signal
class and LUT class, so that the pipelined arithmetic algo-
rithm can be written to a VHDL file. Also VHDL syntax
for design entities, signal declaration, and buffers can be
optionally added when generating the VHDL file.
4. Implementation examples
The Belle II experiment [1] aims to study the charge-
conjugation and the parity violation in B or D meson
system precisely and search for new physics at the Su-
perKEKB accelerator [22]. Due to the high beam current
and small cross section of physics in interest, a fast and
highly efficient trigger is required.
The level one trigger is implemented using FPGAs to
achieve the above goals. It consists of several systems, and
one of the major systems is the level one central drift cham-
ber (CDC) trigger. The level one CDC trigger algorithms
are implemented on merger boards [23] and third genera-
tion universal trigger boards (UT3), which are 6U VME
[24] boards with optical cables. The UT3 board was devel-
oped by the high energy accelerator research organization
(KEK) for the Belle II level one trigger. The structure
and connections between the CDC front-end, merger, track
segment finder, 2D tracker, 3D tracker and neural network
tracker boards can be seen in Fig. 2, where all connections
between the boards are with optical cables. These UT3
boards which are used for most of the algorithms in the
level one CDC trigger, has a Virtex 6 HXT FPGA with 40
GTX and 24 GTH gigabit optical transceivers which can
be seen in Fig. 3.
The level one trigger uses pipelined algorithms to find pat-
terns potentially originated from physics of interest. The
pipelined algorithms finds track parameters obtained from
the CDC hit information and minimize χ2 for the track pa-
rameter fits. They also include logic for combining CDC
track parameters with electromagnetic calorimeter (ECL)
cluster parameters. Our framework described above has
been used to develop the firmware and C++ code for the
simulation in order to implement these algorithms auto-
matically. All the algorithms in the following sections are
implemented on the UT3 boards.
4.1. χ2 minimization fitters
There are two fitters that have been developed using our
framework. One fitter minimizes χ2 defined as
χ2 =
5∑
i
[2 (a cosφi + b sinφi)− ri]
2
σ2i
, (7)
where a and b are fit parameters and φi, ri and σi are
input variables related with charged tracks in CDC. The
second fitter transforms the wire hit information into a ge-
ometric representation and minimizes a χ2 to obtain track
parameters. The transformation equations are
φfineSt = φst ± LUT (TDC− t0) (8)
φax = ± cos
−1
(rρ
2
)
+ φincident ∓
pi
2
(9)
z =
zendplate − 2r sin(
φfineSt−φax
2 )
tan θst
(10)
s = sin−1
(rρ
2
)
, (11)
where φfineSt is the fine phi position of a hit stereo wire
(wires that have a finite phi shift at the end plates) , φst
is the phi position of a hit stereo wire, TDC is the wire
hit time relative to the revolution of the beam, t0 is the
event time relative to the revolution of the beam, LUT is
a look up table that has the x-t curve of the CDC, φax is
the phi position if a stereo wire is an axial wire (wires that
are parallel to the beam) , r is the radius of a stereo wire
layer, ρ is the curvature of a track, φincident is the incident
angle of a track, z is the geometric hit position, zendplate
is the distance from the IP to the end plate of the CDC,
and s is the arc length of the track in a two dimensional
plane for a stereo wire layer [25]. The χ2 is defined as
χ2 =
4∑
i
[(cot θ × si + z0)− zi]
2
σ2i
, (12)
where cot θ and z0 are fit parameters, zi and si are the
z and s for hit stereo wires, and σi is the resolution of
zi. There are analytical solutions to these χ
2 minimization
which have been used to calculate the fit parameters. They
are
a =
5∑
i
sin2 φi
σ2
i
5∑
i
ri cosφi
σ2
i
−
5∑
i
sinφi cosφi
σ2
i
5∑
i
ri sin φi
σ2
i
2
[
5
∑
i
cos2 φi
σ2
i
5∑
i
sin2 φi
σ2
i
−
(
5∑
i
sinφi cosφi
σ2
i
)2] (13)
b =
5∑
i
cos2 φi
σ2
i
5∑
i
ri sin φi
σ2
i
−
5∑
i
sinφi cosφi
σ2
i
5∑
i
ri cosφi
σ2
i
2
[
5∑
i
cos2 φi
σ2
i
5∑
i
sin2 φi
σ2
i
−
(
5∑
i
sinφi cosφi
σ2
i
)2] (14)
and
cot θ =
4∑
i
1
σ2
i
4∑
i
sizi
σ2
i
−
4∑
i
si
σ2
i
4∑
i
zi
σ2
i
4∑
i
1
σ2
i
4∑
i
s2
i
σ2
i
−
(
4∑
i
si
σ2
i
)2 (15)
z0 =
−
4∑
i
si
σ2
i
4∑
i
sizi
σ2
i
+
4∑
i
s2
i
σ2
i
4∑
i
zi
σ2
i
4∑
i
1
σ2
i
4∑
i
s2
i
σ2
i
−
(
4∑
i
si
σ2
i
)2 . (16)
These solutions consist of addition, subtraction, multipli-
cation, division and trigonometric operations. Division
4
Merger
(x73)
CDC
Front-End
(x292)
UT3 Board
Track Segment 
Finder
(x9)
UT3 Board
2D tracker
(x4)
UT3 Board
Event time finder
UT3 Board
3D tracker
(x4)
UT3 Board
Neural network
tracker
(x4)
(2.4 Tbps max.) (1.2 Tbps max.) (686 Gbps max.)
(163 Gbps max.)
(43 Gbps max.)
(40 Gbps max.)
(40 Gbps max.)
Later stages
of the 
level one trigger
Figure 2: Structure of the Belle II level one CDC trigger. The level one CDC trigger consists (from the left) of CDC front-end boards, merger
boards, track segment finder boards, 2D tracker boards, event time finder board, 3D tracker boards and neural network boards. Data is
transferred using gigabit optical transceivers between boards. The CDC front-end boards receive the CDC detector response. The merger
boards combine the CDC front-end data. The track segment finder finds partial tracks. The 2D tracker find tracks in a two dimensional
plane. The event time finder finds the initial timing of the event. The 3D tracker and neural network tracker find three dimensions tracks.
Figure 3: A picture of the UT3 board. The UT3 board is a 6U VME
board with two daughter boards that extend the number of channels
of communication. It has a Vertex 6 HXT FPGA with 40 GTX and
24 GTH gigabit optical transceivers. It was developed by the high
energy accelerator research organization (KEK) for the Belle II level
one trigger.
and trigonometric operations are implemented using LUT
class.
For the second fitter, we used the VHDL code from our
framework to generate firmware. Xilinx ISE [26] was
used to generate the firmware for the FPGA on the UT3
board where the clock frequency time constraint was set
to 127 MHz. Xilinx ISE reported that 2,365 slice registers,
2,919 slice LUTs, 18 RAMB36E1s, 5 RAMB18E1s, and 52
DSP48E1s were used for the design and that all timing
constraints were met. A firmware based test bench was
developed to record the results from the firmware mod-
ule that has the generated VHDL code. We compare the
firmware results with the simulated results from our frame-
work.
The structure of the firmware based test bench can be seen
in Fig. 4. The firmware based test bench has LUTs which
contain values that are acquired from a trigger simulation
(TSIM). The values are given to the firmware module clock
by clock. The output of the firmware module is connected
to Chipscope [27] to record the firmware response which
is shown in Fig. 5. The values between the recorded the
firmware results and simulation results from the framework
are found to be identical down to single bit level. In Fig. 6,
a large statistics of the integer output of the framework and
recorded firmware results for z0 are shown to be exactly
the same. We were also able to confirm that the firmware
latency is 19 clock cycles which was the expected value
from the framework. The recorded firmware results and
the float-point calculation results from the framework for
z0 have a strong correlation which is shown in Fig. 7. To
see the precision of the recorded firmware results, they are
multiplied with a conversion constant and then subtracted
with the float-point calculation results from the framework
where a histogram of z0 precision is shown in Fig. 8. The
histogram’s root mean square (RMS) was found to be 0.2
cm which is below the expected resolution from the z0
fitter algorithm of O (1) cm [25]. These results show that
our framework works well and satisfy the level 1 trigger
requirements of Belle II.
5
LUT 
Simulated  TSF data
Chipscope
3D tracker’s fitter
UT3 board
LUT
Simulated 2DT data
LUT
Simulated ETF data
(Module 
generated from
our framework)
Figure 4: Firmware based test bench structure for testing the 3D
tracker’s fitter firmware module which was generated from the frame-
work. The firmware module is tested with data from a TSIM which
are held in LUTs. The LUTs contain tracks segment finder (TSF),
2D tracker (2DT), and event time finder (ETF) data. Chipscope is
connected to record the output of the firmware.
(a)
z0
225 -435 184 79 630 -621 -109 -528
cot θ 244 94 23 110 -22 405 347 350
(b)
memory_initialization_radix=10;
memory_initialization_vector=225,-435,184, 79,630,-621,-109,-528,
(c)
memory_initialization_radix=10;
memory_initialization_vector=244, 94, 23,110,-22, 405, 347, 350,
Figure 5: Recorded results of the firmware using Chipscope and inte-
ger based simulation results from the framework for an automatically
generated VHDL code. In (a), recorded firmware results for z0 and
cot θ are shown. In (b), integer based simulation results from the
framework for z0 are shown. In (c), integer based simulation results
from the framework for cot θ are shown. The firmware results and
integer based simulation results from the framework match perfectly.
firmware data (integer)
800− 600− 400− 200− 0 200 400 600 800
fra
m
ew
or
k 
sim
. d
at
a 
(in
teg
er)
800−
600−
400−
200−
0
200
400
600
800
Framework sim. vs Firmware
Figure 6: Comparison of z0 between the recorded firmware results
and the integer based simulation results from the framework (frame-
work sim.) for an automatically generated VHDL code. The results
are exactly equal.
firmware data (integer)
800− 600− 400− 200− 0 200 400 600 800
flo
at
in
g-
po
in
t d
at
a 
(cm
)
30−
20−
10−
0
10
20
30
Floating-point vs Firmware
Figure 7: Correlation of z0 between the recorded firmware results and
the floating-point calculation results. They have a strong correlation
with each other. The outliers are due to the accumulation of lost
precision when processing the pipelined integer based algorithm.
6
precision
Entries  1024
Mean  0.02473− 
RMS     0.248
 (cm)0z
2− 1.5− 1− 0.5− 0 0.5 1 1.5 20
50
100
150
200
250
Firmware - Floating-point
Figure 8: Histogram of the firmware precision for z0. A conversion
constant was multiplied to the recorded firmware results and sub-
tracted with the float-point calculated results to calculate the pre-
cision. The RMS is 0.2 cm which is below the expected resolution
from the z0 fitter algorithm of O (1) cm [25].
4.2. CDC geometry calculation
By using track parameters, the position of the track at
a specific layer of the CDC is required to be calculated.
The position is calculated in two steps. In the first step,
our algorithm calculates the φax position of the track for
the layer of the CDC using Eq. 9. The second step con-
verts the φax position to a corresponding wire position at a
given layer of interest. These calculations consist of addi-
tion, subtraction, multiplication and trigonometric opera-
tions. The trigonometric operations were implemented us-
ing the LUT class. A firmware test bench confirmed that
the synthesized firmware and simulated algorithm using
the framework return the same results.
4.3. Combining CDC track and ECL cluster parameters
The framework has been used in algorithms that combines
CDC trigger and ECL trigger information. The level one
CDC trigger outputs track momentum parameters while
the level one ECL trigger outputs cluster positions created
by the deposited energy from the tracks. The two infor-
mation can be combined using the position of the track
which can increase the performance of the trigger. Using
the track momentum parameters from the CDC trigger,
the expected position of the track in the ECL detector is
calculated which is used to calculate distance between the
expected position and the actual cluster position. This
distance can be used to relate the CDC tracks and ECL
clusters. The ratio between energy and momentum of the
track is also calculated which can help to identify the par-
Linear regression float HLS fixed HLS framework
LUT 8061 1104 723
FF 8064 501 283
BRAM 0 5 2.5
DSP 126 40 29
Latency (clocks) 33 5 5
z0 precision (cm) 0.12 0.13
Table 2: Comparison between HLS and the framework for a linear
regression algorithm. A floating-point calculation (float HLS) and
fixed-point calculation (fixed HLS) version was developed. The re-
source usage, latency, and precision are compared, where LUT are
slice look-up tables, FF are flip-flops, BRAM are block RAMs, DSP
are DSP slices, latency are the number of clock cycles to preform
the algorithm, and z0 precision is the RMS of a histogram where the
results are subtracted with floating-point calculation results.
ticle. All of these calculations are implemented using the
developed framework.
5. Comparison with Vivado HLS
Our framework was compared with Vivado HLS for a linear
regression algorithm (Eq. 15 and Eq. 16). Floating-point
and fixed-point versions were developed using Vivado HLS.
For a fair comparison (latency wise), a LUT that replaces
the division operator was also developed for the Vivado
HLS fixed-point version case. All firmware were synthe-
sized and implemented using the Vivado design suite [28].
The target FPGA and clock frequency timing constraint
were set to xcvu080-ffvb2104-2-e and 127 MHz. Simula-
tion results from Vivado HLS and our framework were used
to measure the precision of the firmware where the input
data is from TSIM. The precision is measured by filling a
histogram with simulated firmware results subtracted by
floating-point calculated results. We define the precision
to be the RMS of the histogram. The resource usage, la-
tency and precision can be found in Table. 2. We find that
within the expected resolution from the z0 fitter algorithm
of O (1) cm, that our framework uses the least resources
and latency, as demonstrated clearly in Table. 2.
6. Conclusions
A framework that allows automatic generation of pipelined
algorithms in VHDL is implemented which also simulate
the algorithms. It was validated with χ2 minimization,
a sub-detector geometry calculation, and combining algo-
rithms of sub-detectors. It was compared with Vivado
HLS and our framework is found to be more efficient re-
source and latency wise. Development and maintenance of
pipelined arithmetic firmware algorithms using this frame-
work is applied to a variety of situations and is demon-
strated that the framework we developed is most efficient
in dealing with these tasks. Our framework can be used
for future trigger development in an efficient way.
7
Acknowledgment
We acknowledge support from the National Re-
search Foundation of Korean Grants No. NRF-
2017R1A2B3001968.
References
[1] T. Abe, et al., Belle II Technical Design Report
arXiv:1011.0352v1.
[2] Pamela Klabbers, Operation and Performance of the CMS
Level-1 Trigger during 7 TeV Collisions, Physics Procedia
37 (Supplement C) (2012) 1908 – 1916, proceedings of the 2nd
International Conference on Technology and Instrumentation in
Particle Physics (TIPP 2011).
[3] Imma Riu and the ATLAS Collaboration, Performance of the
ATLAS Trigger with Proton Collisions at the LHC, Journal of
Physics: Conference Series 331 (3) (2011) 032027.
[4] R. Itoh, T. Higuchi, M. Nakao, S. Y. Suzuki, and S. Lee, Data
Flow and High Level Trigger of Belle II DAQ System, IEEE
Transactions on Nuclear Science 60 (5) (2013) 3720–3724.
[5] G Bauer, et al., The data-acquisition system of the CMS experi-
ment at the LHC, Journal of Physics: Conference Series 331 (2)
(2011) 022021.
[6] The ATLAS TDAQ Collaboration, The ATLAS Data Acquisi-
tion and High Level Trigger system, Journal of Instrumentation
11 (06) (2016) P06008.
[7] C Foudas, The CMS Level-1 Trigger at LHC and Super-LHC.
[8] P. B. Amaral, et al., The ATLAS Level-1 trigger timing setup,
in: 14th IEEE-NPSS Real Time Conference, 2005., 2005, pp. 4
pp.–.
[9] Y. Iwasaki, B. Cheon, E. Won, and G. Varner, Level 1 trigger
system for the Belle II experiment, in: 2010 17th IEEE-NPSS
Real Time Conference, 2010, pp. 1–9.
[10] J Chaves, Implementation of FPGA-based level-1 tracking at
CMS for the HL-LHC, Journal of Instrumentation 9 (10) (2014)
C10038.
[11] R. Caputo, et al., Upgrade of the ATLAS Level-1 trigger with
an FPGA based Topological Processor, in: 2013 IEEE Nuclear
Science Symposium and Medical Imaging Conference (2013
NSS/MIC), 2013, pp. 1–5.
[12] E. Won, A hardware implementation of artificial neural net-
works using field programmable gate arrays, Nuclear Instru-
ments and Methods in Physics Research Section A: Accel-
erators, Spectrometers, Detectors and Associated Equipment
581 (3) (2007) 816 – 820.
[13] E. Bartz, et al., Fpga-based real-time charged particle trajec-
tory reconstruction at the large hadron collider, in: 2017 IEEE
25th Annual International Symposium on Field-Programmable
Custom Computing Machines (FCCM), 2017, pp. 64–71.
[14] J. Wu, M. Wang, E. Gottschalk, and Z. Shi, Fpga curved track
fitters and a multiplierless fitter scheme, IEEE Transactions on
Nuclear Science 55 (3) (2008) 1791–1797.
[15] N. Shirazi, A. Walters, and P. Athanas, Quantitative analysis of
floating point arithmetic on fpga based custom computing ma-
chines, in: Proceedings IEEE Symposium on FPGAs for Cus-
tom Computing Machines, 1995, pp. 155–162.
[16] W. B. Ligon, et al., A re-evaluation of the practicality of
floating-point operations on fpgas, in: Proceedings. IEEE Sym-
posium on FPGAs for Custom Computing Machines (Cat.
No.98TB100251), 1998, pp. 206–215.
[17] Xilinx, Vivado Design Suite User Guide: High Level Syn-
thesis, available at https://www.xilinx.com/support/
documentation/sw_manuals/xilinx2014_1/ug902-vivado-
high-level-synthesis.pdf.
[18] Xilinx, Reduce Power and Cost by Converting from Floating
Point to Fixed Point, available at https://www.xilinx.com/
support/documentation/white_papers/wp491-floating-to-
fixed-point.pdf.
[19] Xilinx, Virtex-6 FPGA DSP48E1 Slice, available at https://
www.xilinx.com/support/documentation/user_guides/ug369.
pdf.
[20] Xilinx, Virtex-6 FPGA Memory Resources, available at
https://www.xilinx.com/support/documentation/user_
guides/ug363.pdf.
[21] Xilinx, LogiCORE IP Block Memory Generator, avail-
able at https://www.xilinx.com/support/documentation/ip_
documentation/blk_mem_gen/v7_3/pg058-blk-mem-gen.pdf.
[22] Yukiyoshi Ohnishi, et al., Accelerator design at SuperKEKB,
Progress of Theoretical and Experimental Physics 2013 (3)
(2013) 03A011.
[23] Y. S. Teng, et al., The status of high-speed trigger mul-
tiplexer module with aurora protocol implemented on ar-
ria ii fpga for the belle ii cylindrical drift chamber detec-
tor, in: 2013 IEEE Nuclear Science Symposium and Med-
ical Imaging Conference (2013 NSS/MIC), 2013, pp. 1–3.
doi:10.1109/NSSMIC.2013.6829748 .
[24] IEEE Standard for a Versatile Backplane Bus: VMEbus, AN-
SI/IEEE Std 1014-1987 doi:10.1109/IEEESTD.1987.101857 .
[25] E. Won, J. B. Kim, and B. R. Ko, Three dimensional fast tracker
for central drift chamber based level 1 trigger system in the Belle
II experiment arXiv:1711.02800v1.
[26] Xilinx, ISE Design Suite 14, available at https://www.xilinx.
com/support/documentation/sw_manuals/xilinx14_7/irn.
pdf.
[27] Xilinx, Chipscope Pro Software and Cores, available
at https://www.xilinx.com/support/documentation/sw_
manuals/xilinx14_7/chipscope_pro_sw_cores_ug029.pdf.
[28] Xilinx, Vivado Design Suite User Guide, available at https://
www.xilinx.com/support/documentation/sw_manuals/
xilinx2016_4/ug893-vivado-ide.pdf.
8
