Investigating the Dirac operator evaluation with FPGAs by Korcyl, G. & Korcyl, P.
Investigating the Dirac Operator Evaluation with FPGAs
Grzegorz Korcyl1, Piotr Korcyl2,3
c© The Authors 2019. This paper is published with open access at SuperFri.org
In recent years the computational capacity of single Field Programmable Gate Array (FPGA)
devices as well as their versatility have increased significantly. Adding to that fact the High Level
Synthesis frameworks allowing to program such processors in a high level language like C++, makes
modern FPGA devices a serious candidate as building blocks of a general purpose High Performance
Computing solution. In this contribution we describe benchmarks which we performed using a kernel
from the Lattice QCD code, a highly compute-demanding HPC academic code for elementary particle
simulations on the newest device from Xilinx, the U250 accelerator card. We describe the architecture
of our solution and benchmark its performance on a single FPGA device running in two modes: using
the external or embedded memory. We discuss both approaches in detail and provide estimates for
the necessary memory throughput and the minimal amount of resources needed to deliver optimal
performance depending on the available hardware. Our considerations can be used as guidelines for
estimating the performance of larger, many-node systems.
Keywords: high performance computing, FPGA, lattice QCD, Dirac operator evaluation.
Introduction
Quantum Chromodynamics is the theory describing the interactions of quarks and gluons,
explaining why the latter form bound states such as protons and neutrons. One of the characteristic
features of this theory is that in the low energy regime quarks and gluons form a strongly coupled
system. As a consequence it is difficult to extract predictions for the properties of such a system
from First Principles of Physics. Up to now the only available computational tool allowing for such
calculations are numerical simulations (Monte Carlo simulations) of a discretized version of the
theory, called Lattice Quantum Chromodynamics (LQCD). Traditionally, physicists working in the
field of LQCD searched for the most performant, vector machines consisting of a large number of
compute nodes, and have designed many new HPC solutions: QCDOC [3], APE [1], QPACE [2],
just to name a few. Currently GPU and ARM processors are considered for the next generation of
supercomputing machines and it is an open question whether FPGA devices could be used as an
alternative.
In the discretized version of Quantum Chromodynamics the basic degrees of freedom are as-
sociated to each point of a four-dimensional grid representing a finite volume of four-dimensional
space-time. The sizes of such volumes vary from V = 106 up to V = 108 points. The most com-
pute intensive part of any such simulation is the inversion of the Dirac matrix, which is of the size
(24V )× (24V ). The matrix has a sparse structure because it describes the nearest-neighbour inter-
actions. The Dirac matrix D(n,m)ABαβ acting on the vector ψ(n) can be written down as follows [6]
1Department of Information Technologies, Faculty of Physics, Astronomy and Applied Computer Science, Jagiellonian
University, Krako´w, Poland
2Institut fu¨r Theoretische Physik, Universita¨t Regensburg, Regensburg, Germany
3M. Smoluchowski Institute of Physics, Jagiellonian University, Krako´w Poland
ar
X
iv
:1
90
4.
08
61
6v
2 
 [c
s.D
C]
  1
8 J
ul 
20
19
D(n,m)ABαβ ψ
B
β (m) = (mq + 4)ψ
A
α (n)+
+
1
2
3∑
µ=0
[
UABµ (n)P
−µ
αβ ψ
B
β (n+ µˆ) + U
†,AB(n− µˆ)P+µαβ ψBβ (n− µˆ)
]
(1)
The most elementary computational block is the evaluation of the single stencil, i.e. evaluation
of the right hand side of (1) for a given value of the index n. Note, that the coefficients of the
D(n,m)ABαβ matrix differ for each m, i.e. the U complex-valued 3×3 matrices and ψ complex-valued
3-element vectors depend on the position m. Therefore, each stencil involves loading of eight U(n)
matrices and nine spinor fields from the neighboring lattice sites, which in total corresponds to 360
input words. In the case of double precision this amounts to 2880 input bytes. One can exploit the
structure of the SU(3) matrices and parametrize them in terms of 10 input words each, instead
of 18 in the naive formulation (9 real and 9 imaginary entries). We return to this point in Section
2.2. The U × ψ matrix-vector multiplications require 1464 floating point operations for complex
additions and multiplications. P± are real-valued 4 × 4 constant matrices, mq is a real parameter
corresponding to the quark mass, µ labels directions in the four-dimensional space-time. Repeated
indices are summed within the ranges: α, β = 1, . . . , 4, A,B = 1, 2, 3. For unexplained notation
please see [6] or [8]. One of the simplest algorithms allowing to invert such a matrix is an iterative
conjugate gradient algorithm. The relevance of this algorithm is demonstrated by the fact that
the HPCG benchmark was introduced since November 2017 as a new ranking of supercomputers
published by the TOP500 organization. Such benchmark differs from the traditionally used Linpack
benchmark where the employed matrix was dense. The argument behind the HPCG benchmark
is that in many cases sparse matrix computations are more representative of the variety of HPC
applications which run on a supercomputer. Indeed, the iterative solver of the type of conjugate
gradient is, for instance, at the heart of Monte Carlo simulation of QCD.
The rest of this article is organized as follows. In the next section we specify the details of the
implemented algorithm as well as summarize the description of the kernel which is being hardware
accelerated. Subsequently in the following section we propose two implementations on the FPGA
devices which differ by the location where the main data is stored, either these are registers in the
programmable logic, or an external DDR memory bank attached to the programmable logic. In
section 3 we compare and discuss the achieved performances using both approaches. Eventually we
conclude and point to future research directions.
1. Kernel Description
In this work we consider an improved version of the conjugate gradient algorithm which allows
us to test different floating and fixed point precision without a deterioration of the ultimate solution.
Similar considerations for GPU were presented in [5]. The algorithm intertwines iterations in low
and high precision, working mainly in low precision and correcting a possible systematic error by a
high precision iteration. Our algorithm follows the one suggested in [9] and is shown in Algorithm
1. We provide an exact form of the mixed precision conjugate gradient algorithm implemented in
this work to show which parts have been hardware accelerated and what is the interplay between
parts of the algorithm requiring implementations in different precision. In both cases the most time
consuming part are matrix multiplications in lines 2, 14 and 24 of Algorithm 1.
Algorithm 1 Residual Guided CG algorithm
1: ψhigh ← ψhigh0
2: rhigh0 ← ηhigh −
(
D†D
)high
ψhigh
3: shigh0 ← ||rhigh0 ||
4: r0 ← r
high
0
shigh0
5: l← 0
6: while shigh ≥ rhighmin do
7: n← 0
8: ψ0 ← 0
9: r0 ← r
high
l+1
shighl+1
10: p0 ← pk −
(
r0 · pk
)
r0
11: α0 ← 0
12: β0 ← s
high
l+1
shighl ρk
13: while n < k do
14: qn ← D†Dpn
15: αn ← ρnpn·qn
16: ψn+1 ← ψn + αnpn
17: rn+1 ← rn − αnqn
18: ρn+1 ← rn+1 · rn+1
19: βn ← ρn+1ρn
20: pn+1 ← rn+1 + βnpn
21: n← n+ 1
22: end while
23: ψhighl+1 ← ψhighl + shighl
(
ψk + αkpk
)
24: rhighl+1 ← bhigh −
(
D†D
)high
ψhighl+1
25: shighl+1 ← ||rhighl+1 ||
26: l← l + 1
27: end while
We wish to hardware accelerate them and briefly summarize the FPGA implementation of
these kernel functions. We follow what was presented in [8]. In particular that Reference contains
a description of C++ data structures used for the implementation as well as relevant details of the
memory allocation which allows for a fully pipelined execution of the kernel. Fragments of C++
and HLS directives codes are provided and discussed in that Reference.
For both, high and low, precisions of the kernel the implementation is similar: a single function
involves a loop over a subvolume and an evaluation of the stencil for each site of the lattice. The
evaluation of a single stencil is fully parallelized as far as the data dependencies allow and all stencils
are pipelined.
Figure 1. Computation sequence of the stencil solver
All operations involved in the estimation of a single stencil are graphically shown in Fig. 1.
The evaluation naturally splits into 4 stages. The clock cycles provide an estimate of the amount
of parallelization and correspond to the number of clock cycles required to finish the computation
at a given stage in double precision. In the first stage all the necessary data is copied from the
BRAM memory blocks to local registers which requires only one clock cycle. In stage 2 linear
combinations of input data, 8 additions and 8 subtractions of vector type, are evaluated. They are
all performed in parallel, taking 14 clock cycles, which corresponds to a single addition of double
numbers in programmable logic. The most compute intensive stage 3 involves SU(3) matrix by
vector multiplications. In total 1152 operations are performed. Complete parallelization allows to
execute them in a 5-layer operation cascade taking in total 5 ∗ 14 = 70 cycles. Finally, at stage 4
all contributions are added up to the final result. Because of the dependencies between consecutive
partial results this creates a 4-layer operation cascade, which in total takes 57 = (4 ∗ 14) + 1 clock
cycles, 4 additions plus one data copy. Overall, the kernel requires 142 clock cycles and a total of
1464 basic operations to compute the final result since the reception of the input data. The kernel
is fully pipelined: i.e. it can accept new input data at each clock cycle and produce the results with
latency of 142 cycles.
2. Two Approaches
There are two approaches one can follow in order to provide to the kernel the required data.
One can divide the entire problem into small parts such that the entire set of data for a single part
fits into the BRAM memory of the device. Alternatively, one can store the entire set of data in the
DDR die attached to the programmable logic and stream the data through the link. We discuss
below the performances of both solutions.
Figure 2. Memory usage as a function of the initiation interval
2.1. Smaller Lattice Stored in BRAM Memory
This is the approach we followed in [8]. We showed that lattices up to the size of 12× 83 data
points in each direction in double precision can fit into the internal memory of the programmable
logic of the FPGA devices available currently on the market. In Fig. 2 we show the required number
of URAM blocks for a given size of the lattice for single and double precision. As can be seen in
that figure the storage requirements are not linear because in order to allow the compiler to take
advantage of the natural parallelism of FPGA devices it is crucial to store data in PL in as many
separate PL local registers blocks as possible. This is due to the fact that in a single PL clock cycle
only one memory element can be read from the BRAM block. In the computation of a single stencil
one needs eight different U matrices and we impose that they are stored separately. Although this
requires duplicating the amount of stored data, the matrices U(n) and U †(n) are stored separately,
the gain is considerable. The HLS directives ensuring such memory allocation were described in [8].
Thanks to that the stencil evaluation can be fully pipelined, i.e. the hardware block can accept new
input data at each clock cycle. The resulting performance simulated in software is 812 GFLOPs for
single precision and 406 GFLOPs for double precision with the PL running at 300 MHz.
2.2. Larger Lattice Streamed from the DDR Memory
The way to operate on larger data sets is to keep the data in the DDR die attached to the
programmable logic and process the data in a streaming mode. This was investigated on the Maxeler
system in [7]. The U matrices and ψ spinors are prepared beforehand into sets corresponding to
consecutive stencils and are streamed continuously from the DDR into the logic. The limitation of
this solution is the throughput of the memory link between the DDR and the logic. Using SDAccel
and an openCL implementation of the CG algorithm we verified that for the Xilinx U250 device one
can send 256B from the DDR memory to the PL part per clock cycle working at the frequency of
Figure 3. Transmission rates as a function of the initiation interval
Figure 4. Performance as a function of the initiation interval
Figure 5. Resources consumption as a function of the initiation interval
300 MHz. Four channels are available aggregating to 77 GBps throughput. In order to decrease the
amount of data transferred we change the representation of the U matrices and following [4] we use
a 10 parameter parametrization. We trade two more parameters and avoid computing trigonometric
functions in the programmable logic. The reduced set of data translates to an initiation interval
of 5 and 9 clock cycles for the compute kernel for single and double precision respectively, i.e the
programmable logic has to wait 5/9 clock cycles to gather enough data to start a new computation.
The performance in that case would approximately be equal to 86 and 46 GFLOPs respectively,
which is comparable to the one quoted in [7] on the Maxeler system. However, if we also count
the additional operations needed to recover the U matrices from their reduced form, the achieved
sustained performance reaches 194 GFLOPs for single precision. In Fig. 3 we show how the required
throughput depends on the initiation interval. The calculation assumes the reduced form of the U
matrices. The smaller the initiation interval is, the shorter is the time in which the data has to
be transferred. We show the throughput estimates for single and double floating point precision.
Knowing the throughput between the DDR and the programmable logic on a given device one can
easily read the corresponding minimal initiation interval and henceforth the resulting performance,
which is shown in Fig. 4 for both the single and double precision case.
Finally, in Fig. 5 we show how the hardware resource consumption depends on the initiation
interval for single and double floating point precision but also for a more FPGA friendly 32 bit fixed
point data format. In this streaming scenario one can relax the initiation interval of one clock cycle
imposed in the first approach. The memory throughput being the bottleneck, one can implement the
kernel with a lower initiation interval because in any case several clock cycles are needed to collect
all the necessary data for a single stencil computation. The figure shows an indicative percentage
of all available resources counting together all DSP, LUTs and BRAM blocks. We see that in the
described case where the memory throughput imposes an initiation interval of 5 clock cycles the
compute kernel uses only 20% of the available resources for double precision.
3. Discussion
The presented results allow to understand various constraints limiting the performance of the
investigated kernel on FPGA devices. Starting from the embedded memory scenario, the practical
problems that are being analyzed are larger by a factor of the order of 4096. One would probably use
that amount of FPGA devices running in parallel and exchanging boundary data directly from and
to the programmable logic through the embedded transceivers. On the other hand, in the external
memory scenario in principle the entire set of data could be stored in the DDR. However, the wall
clock time to the solution on a single FPGA device would be impractically long. In that case one
would also resort to a many-node system where the computations could be speed up by running
them in parallel. In principle neither of the two scenarios is obviously superior. The number of
required nodes can be different in both solutions and the details would depend essentially on the
memory throughput of the FPGA device used. With the numbers provided above such estimations
can be put on a solid ground.
Conclusions
In this contribution we discussed the applicability of FPGA devices to High Performance Com-
puting solutions. As a benchmark we used the academic code for Monte Carlo simulations of Quan-
tum Chromodynamics. In traditional computer architectures this code is memory bound due to the
unfavorable ratio of the amount of data to be loaded to the amount of floating point operations to
be executed be the most elementary kernel function. On the available programmable logic hardware
the problem turns out to be memory bound in the scenario where the data is streamed from the
DDR die, which will be considerably improved with the arrival of Xilinx Alveo U280 cards with a
480 GB/s memory bandwidth between DDR and programmable logic. In the scenario where the
data is stored in the embedded memory, the problem’s limitation is the available size of the in-
ternal memory. Both cases seem to be scalable and thus offer a viable proposal for a larger scale
infrastructure.
Acknowledgments
This work was in part supported by Deutsche Forschungsgemeinschaft under Grant No.
SFB/TRR 55 and by the polish NCN grant No. UMO-2016/21/B/ ST2/01492, by the Foundation
for Polish Science grant no. TEAM/2017-4/39 and by the Polish Ministry for Science and Higher
Education grant no. 7150/E-338/M/2018. The project could be realized thanks to the support
from Xilinx University Program and their donations. P.K. acknowledges support from the NAWA
Bekker fellowship and thanks Universita` degli Studi di Roma Tor Vergata for hospitality during
which this work was finalized.
This paper is distributed under the terms of the Creative Commons Attribution-Non Commercial
3.0 License which permits non-commercial use, reproduction and distribution of the work without
further permission provided the original work is properly cited.
References
1. APE collaboration: APE project in Rome. http://apegate.roma1.infn.it, accessed: 2019-
04-19
2. Baier, H., et al.: QPACE: A QCD parallel computer based on Cell processors. PoS LAT2009,
001 (2009), DOI: 10.22323/1.091.0001
3. Boyle, P., Chen, D., Christ, N., Clark, M., Cohen, S., Cristian, C., Dong, Z., Gara, A., Jo, B.,
Jung, C., Kim, C., Levkova, L., Liao, X., Liu, G., Mawhinney, R., Ohta, S., Petrov, K., Wettig,
T., Yamaguchi, A.: Hardware and software status of qcdoc. Nuclear Physics B - Proceedings
Supplements 129-130, 838 – 843 (2004), DOI: 10.1016/S0920-5632(03)02729-4, lattice 2003
4. Bunk, B., Sommer, R.: An 8 parameter representation of su(3) matrices and its applica-
tion for simulating lattice qcd. Computer Physics Communications 40(2), 229 – 232 (1986),
DOI: 10.1016/0010-4655(86)90111-6
5. Clark, M.A., Babich, R., Barros, K., Brower, R.C., Rebbi, C.: Solving Lattice QCD systems
of equations using mixed precision solvers on GPUs. Comput. Phys. Commun. 181, 1517–1528
(2010), DOI: 10.1016/j.cpc.2010.05.002
6. Gattringer, C., Lang, C.B.: Quantum chromodynamics on the lattice. Lect. Notes Phys. 788,
1–343 (2010), DOI: 10.1007/978-3-642-01850-3
7. Janson, T., Kebschull, U.: Highly Parallel Lattice QCD Wilson Dirac Operator with FPGAs.
Parallel Computing is Everywhere 32, 664 – 672, DOI: 10.3233/978-1-61499-843-3-664
8. Korcyl, G., Korcyl, P.: Towards Lattice Quantum Chromodynamics on FPGA devices (2018)
9. Strzodka, R., Goddeke, D.: Pipelined mixed precision algorithms on fpgas for fast and accurate
pde solvers from low precision components. In: Proceedings of the 14th Annual IEEE Symposium
on Field-Programmable Custom Computing Machines. pp. 259–270. FCCM ’06, IEEE Computer
Society, Washington, DC, USA (2006), DOI: 10.1109/FCCM.2006.57
