Simulation of the Lattice QCD and Technological Trends in Computation by Ibrahim, K. et al.
ar
X
iv
:0
80
8.
03
91
v3
  [
he
p-
lat
]  
13
 A
ug
 20
08
Simulation of the Lattie QCD and Tehnologial Trends in
Computation
K. Ibrahim
1
, J. Jaeger
2
, Z. Liu
6,7
, L.N. Pouhet
3,4
, P. Lesniki
3,4
, L. Djoudi
2
, D. Barthou
2,3
, F. Bodin
1
, C.
Eisenbeis
3
, G. Grosdidier
5
, O. Pène
7
, and P. Roudeau
5
1
Irisa/Inria, Rennes, {kibrahim, bodin}irisa.fr
2
Université de Versailles St Quentin
3
Alhemy team, Inria Salay
4
Laboratoire de Reherhe en Informatique, Université de Paris XI
5
Laboratoire de l'Aélérateur Linéaire, Université de Paris XI and IN2P3/CNRS
6
LPSC, Université Joseph Fourier Grenoble 1, IN2P3/CNRS, INPG, 53 Av. des Martyrs, 38026 Grenoble
7
Laboratoire de Physique Théorique
⋆⋆
Université de Paris XI, MPPI/CNRS
Abstrat. Simulation of Lattie QCD is a hallenging omputational problem. Currently, tehnologial
trends in omputation show multiple divergent models of omputation. We are witnessing homogeneous
multi-ore arhitetures, the use of aelerator on-hip or o-hip, in addition to the traditional arhi-
tetural models.
On the verge of this tehnologial abundane, assessing the performane trade-os of omputing nodes
based on these tehnologies is of ruial importane to many sienti omputing appliations.
In this study, we fous on assessing the eieny and the performane expeted for the Lattie QCD
problem on representative arhitetures and we projet the expeted improvement on these arhite-
tures and their impat on performane for Lattie QCD. We additionally try to pinpoint the limiting
fators for performane on these arhitetures.
1 Introdution
Quantum hromodynami (QCD) is the theory of strong interation in the domain of subnulear physis.
Lattie QCD (LQCD) is a numerial method based on QCD rst priniples, the only one able to ompute
reliably many quantities of high sienti relevane. It is based on a disretization of spae time and a Monte-
Carlo method. The system being an extremely omplex one and the number of degrees of freedom being of
the order of a billion today, a number promised to inrease in the future, LQCD needs heavy, eient and
heap enough omputing tools (hardware and software).
The goal of the alulation is to produe, aording to a given probability law resulting from the theory,
a wide statistial sample of gauge ongurations, eah of whih being a large le of omplex numbers.
Although using eient algorithms whih will be desribed in the next setion, it requires a very large
amount of omputing power.
Simulation of this theory is one of the grand hallenge problems in part beause of the small perentage
of the omputation load usually being observed on most omputing arhitetures. Assessing the performane
and eieny on new arhitetures, and based on dierent algorithmi representation of this problem, is
important to get loser to the omputational power needed for this problem. This omputation tends to
have low utilization and eieny on most general-purpose omputing failities, leading to ineient power
onsumption and unrealisti demands on the number of required omputational nodes. So far building a
speial mahine for simulating the Lattie QCD problem has been a widely used approah [1,2,3℄. The
motivation to build speialized omputing failities, despite all the assoiated overheads, is the enormous
omputational power needed in addition to the speial harateristis of the omputation of the Lattie
QCD.
⋆⋆
Unité Mixte de Reherhe 8627 du Centre National de la Reherhe Sientique
For this problem, an optimal omputing node should oer support for omplex arithmeti instrutions,
large register le, SIMD instrutions, software ontrolled ahe management and balaned memory band-
width and ommuniation bandwidth to the omputational power. Multiple suessful balaned designs were
historially build for Lattie QCD [1,2,3℄, but the overhead of design and maintenane is usually high. Build-
ing out of ommodity omponents that best t the problem harateristis is a very attrative alternative, but
it requires a areful analysis of the problem, together with the analysis of the large spetrum of arhitetural
alternatives.
Currently, the driving fores for omputer arhiteture push multiple tehnologies with no lear onver-
gene to performane/power overall winner. We intend to explore these tehnologies, guided by the om-
putational requirements of the Lattie QCD. As it is extremely diult to exhaustively experiment all the
emerging tehnologies, we hoose to fous on two groups of tehnologies:
 General-purpose homogeneous nodes: we desribe our eorts to assess the trade-os of implementing
the Lattie QCD omputation on dierent arhitetures. We mainly foused on Itanium and Pentium
arhitetures. These two arhitetures represent two major design alternatives in general purpose om-
puting. The EPIC arhiteture for Itanium proessor relies on the ompiler in managing instrution
parallelism, while the supersalar arhiteture for the Pentium proessor relies on hardware management
of instrution parallelism. Both arhitetures are deployed suessfully to build highly salable mahines.
 Heterogeneous omputing nodes: we investigate the use of speialized aelerators to improve the perfor-
mane of a omputing node. The rst alternative we explored is the use of Intel Xeon proessor assisted
by a G80 NVIDIA graphi ard as an aelerator, a heterogeneous multi-hip alternative. The seond
alternative we explored is based on integrating aelerators with the main proessor on hip, providing
a heterogeneous system-on-hip kind of arhitetures. A good representative of this arhiteture is the
IBM Cell broadband engine.
The objetive of this study is to ompare future tehnologies prospet for the simulation of Lattie QCD.
We do not seek a generalized omparative study of all future arhitetural trends; we target the omparison
based on the requirements for the simulation of the Lattie QCD omputations.
Our study reveals that the performane of the Lattie QCD omputation an be greatly improved using
speialized aelerators. More importantly, we predit that the imbalane of the omputational power to
ommuniation bandwidth for the Lattie QCD will remain an obstale for all the studied arhiteture.
Eient usage of the omputational power will rely heavily on the level of expliit resoure management
that a partiular hardware will oer.
The rest of this paper is organized as follows: Setion 2 introdues the Lattie QCD problem and its
physial formulation. Setion 3 introdues analysis of the performane of single node based on various ar-
hitetural alternatives. Setion 4 disusses the needed improvements in the performane of the disussed
arhitetures and ontrast it with their expeted or planned evolution. Setion 5 disusses the performane
impat of the ommuniation arhiteture for a large sale system. Setion 6 onludes this work.
2 The Problem of LQCD
In Lattie QCD, the four-dimensional spae-time ontinuum is simulated by a four-dimensional lattie, of
length respetively X,Y, Z, T in the four diretions, with quark quantum elds on eah lattie site and gluon
quantum elds represented by SU(3) matries on eah link between these sites. SU(3) refers to 3× 3 unitary
matries of omplex numbers of unit determinant. The 3-dimensional spae in whih these matries at is
referred to as the spae of the three quark olours". The spinors are represented by four SU(3) vetors,
eah omposed of three omplex variables. The alulation aims at omputing the average values of physial
quantities, whih are funtions of these elds, aording to a probability distribution also depending on the
elds, and derived by a disretization proedure from the basi QCD Lagrangian. This average is taken over
the full spae of all the possible values of the elds. This spae is known as the eld onguration spae.
The integration of quark elds is done formally, leading to a ompliated non-loal probability distribution:
the determinant of the very large Dira operator matrix, whih depends only on the gluon elds. The
2
probability law is desribed by a ompliated expression depending only on the gluon elds i.e. the SU(3)
matries. We all gauge onguration a set of SU(3) matrie matries dened on all links. The probability
law is thus dened in the spae of gauge ongurations.
For large latties the spae of gauge ongurations is a variety with dimensionality of the order of
billions. Only a Monte-Carlo method allows suh a huge alulation. To estimate the average values of the
physial quantities we need representative samples of gauge ongurations (say about 5000) for every set of
parameters, generated aording to the above-mentioned probability law. The Hybrid Monte-Carlo (HMC)
algorithm [4℄, or variants of it suh as the Polynomial HMC (PHMC), the Rational HMC (RHMC), is used
to generate these samples. This is a very heavy alulation. In the following disussion, we will onsider an
HMC implementation ahieved by the ETMC ollaboration [5,6℄.
The run is deomposed into trajetories, whih are indeed trajetories of a omplex dynamial system
depending on the SU(3) matries. Eah trajetory leads from one gauge onguration to the next one of our
sample. The Hamiltonian of this system is devised in suh a way as to generate gauge ongurations with
a large enough probability. At the end of the trajetory a Metropolis test ensures the orret probability
law. The trajetory is divided into steps. After every step the gauge onguration is updated. During the
step, the gauge onguration stays unhanged. The algorithm manipulates objets named Wilson spinors.
One Wilson spinor is omposed by a spinor (12 dimensional omplex vetor) on every lattie site. During
the step, whihever variant of the algorithm being used, there is an iteration of the multipliation of a large
Wilson spinor by the large matrix named Dira operator leading to an output Wilson spinor. This part
is linear algebra and it is the most time-onsuming part of the algorithm.
The multipliation of the Wilson spinor by the Dira Operator is mainly performed in the ETMC ode
by a routine named Hopping_Matrix, whih is ontributing about 90% of the total exeution time [7℄.
Wilson spinors as well as the gauge ongurations are very large arrays. As we shall see the major problem
to produe eient omputations is to ensure fast enough data transfer to and from the omputing units. It is
worth notiing that the stability of the gauge onguration, during very many iterations of the multipliation
of Wilson spinors by the Dira operator, allows a signiant redution of the data to be transferred if one
manages to keep the SU(3) matries in some kind of fast aess memory lose to the omputing units. This
is not easy in general beause the gauge matries onstitute rather large arrays.
The multipliation of the Wilson spinor by the Dira operator is expressed in formula (1): the ations
of the Dira operator involves a sum over quark spinors (ψ(i)) multiplied by a gluon eld (Uµ(i)) through
the spin projetor.
χ(i) =
∑
µ=x,y,z,t
{
κµUµ(i) (I − γµ)ψ(i + µˆ) + κ
∗
µ U
†
µ(i− µˆ) (I + γµ)ψ(i− µˆ)
}
(1)
where κx = κy = κz = κ and κt = κ exp ipi/T , κ is the "hopping parameter" and the phase exp ipi/T
expresses the anti-periodi boundary onditions in the time diretion. The gluon eld SU(3) matries are
labelled by their starting site and the spae-time diretion of the link on whih it is dened.
The ode we onsider ontains two variants of the algorithm. The rst one named full-spinor orresponds
to the diret appliation of equation (1), while the seond named half-spinor proesses via two phases whih
an be expressed by the following set of equations: First phase (K series):
φµ(i,+) = κµUµ(i) (I − γµ)ψ(i+ µˆ) φµ(i,−) = (I + γµ)ψ(i− µˆ) (2)
seond phase (L series):
χ(i) =
∑
µ=x,y,z,t
φµ(i,+) + κ
∗
µ U
†
µ(i− µˆ)φµ(i,−). (3)
The pros and ons of both variants depend on the arhiteture and will be disussed later.
In the next setions we will present studies on numerous arhitetures and lattie sizes. Our general goal
is the Petaop as justied in setion 5. Working on one node we have in mind dierent sublatties aording
to dierent possible deompositions of the full lattie. We also varied the lattie size in order to highlight
the role of the dierent arhitetural omponents (e.g., ahe size, et).
3
3 The Performane of a Single Computing Node
The performane of lattie QCD on a multiproessor mahine relies heavily on the performane of the indi-
vidual omputing nodes. In this setion, we will start by outlining the arhiteturally independent attributes
of the Hopping_Matrix routine that will interat with the arhitetures under investigation. We will then
disuss the performane based on these individual nodes in separate setions.
Inputs
Spinors,  Gauge links, U
Compute an 
output spinor
1608 FP ops
(Independent 
iterations)
P
a
ra
ll
e
l 
lo
o
p
A) Full-spinor 
Computation
Compute 
½ spinor
768 FP ops
(Independent 
iterations)P
a
ra
ll
e
l 
lo
o
p
B) Half-spinor 
computation
P
a
ra
ll
e
l 
lo
o
p Complete  
spinor compute
840 FP ops
(Independent 
iterations)
Spinors, !
Compute
T-direction 
P
a
ra
ll
e
l 
lo
o
p
Compute
 X-direction 
Compute
Y-direction
Compute
Z-direction 
C) Decomposed Full-spinor computation
½ Compute 
T-direction 
P
a
ra
ll
e
l 
lo
o
p
½ Compute 
X-direction
½ Compute 
Y-direction
½ Compute 
Z-direction 
D) Decomposed Half-spinor computation
½ Compute 
T-direction 
½ Compute 
X-direction
½ Compute 
Y-direction
½ Compute 
Z-direction 
P
a
ra
ll
e
l 
lo
o
p
Outputs
Intermediate,
Intermediate,
Fig. 1. Computation shemes of the Hopping_Matrix routine based on the full-spinor and the half-spinor
versions.
In the Hopping_Matrix routine, the omputation of the spinor involves 1608
8
oating-point operations
per lattie site touhing 360 oating-point variables. We fous on the two implementations already mentioned,
the full-spinor and the half-spinor ones. The full-spinor version pulls all the data needed to ompute an
output spinor from all surrounding sites. These data inlude the gauge eld links and the spinors. In eah
all of the Hopping_Matrix routine, the gauge links show non-redundant regular aess, while reads of the
surrounding spinors usually arry redundany and irregularity of aess beause eah input spinor appears
in the omputation of eight dierent output spinors. To solve this problem in aessing data, the half spinor
version arries the omputation in two phases. In the rst phase, eah input spinor is visited one and the
omputations related to all the surrounding spinors are pushed to the surrounding spinors in intermediate
half-spinor strutures. Writing of the output half-spinors is aligned to optimize the aess pattern in the
seond phase. In the seond phase, the results of the rst phase are used to ompute the output spinors. The
aess of the half-spinors intermediate struture is more regular. The advantage of the half-spinor version is
that irregular pattern of aess is assoiated with the writes of the rst phase and not with data reads. In
most general-purpose arhitetures, memory reads are more ritial to performane. On the other hand, the
aessed data are inreased by about 7% for the half-spinor version ompared with the full-spinor version.
Figure 1 shows four ode variants of the ode explored in this study. The omputation an be deomposed
into two phases of omputation, in the half-spinor version. Additionally, the omputation an be further
8
This number is larger than the ommonly quoted 1320 ops per site, the dierene being due to the multipliation
by the fator κµ in eq. (1).
4
deomposed based on the number of spae diretions. Figure 2 depits the two main phases of the half-
spinor omputation that allow friendlier ahe behavior on proessors with ahe hierarhy.
 !"#$"!%&
'("&)*+'#$,-.*+!&/#*0&1&"2"+&!$'"!&3,4-"5&
'*&*66&!$'"!&3#"65&$!&)*07-'"6&8+6&
!'*#"6&$+'*&(84/9!7$+*#&68'8&
:!"#$"!%&
'("&)*+'#$,-.*+!&!'*#"6&
$+'*&1&(84/9!7$+*#!&
8#"&#"-!"6&
8+6&0"#;"6&
'*&,-$46&
<+84&!7$+*#&68'8&&
Fig. 2. Hopping_Matrix is splitted into two phases, Kseries (phase one), then Lseries (phase two) with
almost balaned ode, data, CPU time.
For proessors with normal ahes, e.g. Itanium and Pentium, we will fous on the half-spinor version
beause of its performane advantage, for omputing nodes using aelerators we will explore both tehniques.
The most dominant attribute of Hopping_Matrix omputation that aets performane is the low ompu-
tation to memory aess ratio, as shown in Figure 3. This analysis assumes no temporal loality in inter-alls
to the Hopping_Matrix routine. This assumption is valid taking into aount the large footprint assoiated
with reasonable lattie size, in addition to the alternations in the omputation between multiple input data
on onseutive alls to the routine.
This omputation to memory aess ratio does not exeed 1.05 double preision oating point operations
per byte. This ratio is usually as low as 0.56 FP/byte if the lattie data is not ahed. In ontrast, reuse
of data is related to the data size for dense matrix-matrix multipliation, though it is partly exploited
using bloking due to limited ahe sizes. Cahing the lattie data is diult to ahieve beause we tend
to hoose bigger lattie to mitigate the ost of ommuniation between nodes. The Lattie QCD problem
requires dividing the lattie among many ooperating nodes, whih need to ommuniate results. To overome
the disparity between the ommuniation bandwidth and the omputational power of the proessing node,
we tend to inrease the sublattie size per node. The omputation grows linearly with volume while the
ommuniation grows linearly with the surfae. Figure 3 shows that improving (inreasing) the omputation
to ommuniation ratio linearly requires exponential growth in the required memory spae.
Another worth noting attribute is that the gauge eld onstitutes about 75% of the data aessed (stati
memory footprint) ompared with 12.5% for input spinors. At runtime for the full-spinor version, the input
spinors vetor represents 55% of the dynamially aesses data with the least regular aess pattern beause
of the spin-projetion operator. For the half-spinor version, most of the aessed data at runtime belongs to
the intermediate data strutures arrying the half-spinor data.
In the following subsetions, we present our study rst on the use of homogenous omputing nodes
based on Pentium and Itanium proessors, then, on the use of heterogeneous nodes where a general purpose
proessor is assisted by a speial aelerator to speedup oating point omputations. We speially present
our study on the use of Nvidia GPU assisting an Intel Xeon proessor and the use of the IBM Cell BE.
3.1 Baseline Code - Pentium4
In order to have a base of omparison, the performane of the HMC/ETMC ode is measured on an Intel
Xeon Presott proessor at 3.2GHz with 16KB L1 and 1MB L2 (using one ore).
5
Fig. 3. Summary of the attributes of the Hopping_Matrix routine in terms of memory requirements, density
of omputation to aess and aess pattern.
The HMC/ETMC ode omes with an optimized version for SSE SIMD instrutions. These are vetor
instrutions able to perform 2 double preision operations in one single instrution issue. As a matter of
fat the use of SIMD SSE Pentium instrutions is expliit in the ode, by the use of dediated intrinsis. It
basially addresses the omputations on the spinor vetors (omplex vetors of size 3).
Performane have been measured on two input data sets. The rst one is a lattie of size 44. The seond
one is a lattie of size 163 × 32. Performane results are desribed in Figure 4 for two versions, with and
without SSE. First, based on the number of oating point operations at eah lattie site (see Setion 3) -
1608 operations per site -, the lok yle ounter and the frequeny of the Pentium (3.2GHz), we found
that the speed of the original ode is about 2.3 GFlops for the 44 lattie and 1.5 GFlops for the 163 × 32
lattie when using the SSE instrutions. This means that the original ode Pentium version is already highly
optimized (peak performane is 6.4 GFlops in double preision). The better performane for the smaller
lattie is probably due to data reuse that annot be exploited so well with the large size.
 0
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
4
4
16
3
x32
G
F
lo
p
s
Sublattice sizes
no SSE with SSE
Fig. 4. Performane in GFlops of Hopping_Matrix
on 3.2GHz Pentium4 Presott, with and without
SSE and for dierent lattie sizes.
 0
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
4
4
8
3
x16 16
3
x32
G
F
lo
p
s
Sublattice sizes
Original code Tiled version
Fig. 5. Performane in GFlops of original Hop-
ping_Matrix and tiled version on 1.6GHz Montvale
Itanium2 for dierent lattie sizes.
3.2 Itanium Arhiteture
We evaluate performane of the original HMC ode on an Intel Itanium Montvale proessor at 1.67 GHz,
with 256KB L2 and 12MB L3 (on a single ore).
6
In the original version, Hopping_Matrix runs one loop for eah of the two half-spinor omputation, one
loop for all diretions on the odd sites, then one loop for all diretions on the even sites (Figure 1.B). The
original ode suers from two main deienies on Itanium arhiteture:
 Analysis of the ompiler generated assembly ode [8℄ shows that the ompiler has diulty optimizing
the whole basi blok of the loop. Too many instrutions and a too high register pressure prevent the
ompiler for instane to software pipeline the loop. This has a high impat on Itanium arhiteture.
 While some data (in partiular the gauge elds) are reused through the omputation, the size of the
volume prevents data from staying in ahe between two uses.
Note that there is no SIMD ode for Itanium, unlike for Pentium, the ode onsidered is plain C ode.
This leads to two transformations: eah loop of the two phases is tiled so that data within a tile stays in
ahe, and the tiled loop is split for all diretions of omputation in order to enhane the quality of ompiled
ode. Figure 1.D shows the struture of the resulting ode: eah of the two parallel loops are tiled and the
half-omputations orresponding to eah diretion within eah of these loops are exeuted sequentially.
Figure 5, on the left, shows performane of the two versions w.r.t. lattie sizes. For a small lattie of
size 44, all the data t in L2 ahe, tiling only introdues overhead and performane reahes 3.2 GFlops
(peak performane is at 6.4 GFlops in double preision). For a medium size lattie of size 83 ∗ 16, the data
is still in L3 but no longer in L2. There is a light performane improvement of the tiled version, that nearly
reahes 3 GFlops. Finally, when the lattie is too large even for L3 ahe, the tiled version outperforms the
original ode by 60%. The eet of tiling redues the impat of L3 ahe misses but as the reuse fator is low
(between 2 and 3) and does not grow with the lattie volume, memory aesses still drive the performane
for large enough latties.
The overall performane gain for the whole HMC ode reahes 40% speed up for a 163∗32 lattie. The best
tile size for the arhiteture has 128 iterations and orresponds to a maximum usage of the ahe hierarhy.
3.3 Computing Node Based on CPU Assisted by a GPU Aelerator
The use of GPUs for sienti omputing is urrently under investigation by many researh groups and gives
very promising results for many sienti appliations [9℄. We investigated the use of NVIDIA G80 GTX
graphi ard GPU board hosted on a server of quad-proessor Intel Xeon proessor at 1.86GHz, 4MB L2
ahe. The NVIDIA G80 GTX GPU is omposed of 16 multiproessors that are interonneted to banked
DRAMs through an impressive bandwidth of 78.6 GB/s. Figure 6 depits the layout of onneting a GPU
aelerator to a CPU through a PCI Express bus.
Fig. 6. Computing node based on a CPU assisted by a GPU aelerator.
The parallelization proess for GPUs is traditionally done based on vendor ompiler. Expressing problem
is failitated with the advent of general-purpose programming tehnology, suh as CUDA [10℄ by NVIDIA [11℄.
7
Even though parallelization is done through the ompiler, the programmer arries the responsibility of
transforming the ode in a way that enables eient parallelism. A must-do transformation is to remove
ontrol-ow instrutions whenever possible. For ontrol-ow variables with limited outomes, lookup tables
an be used or redundant omputation in onjuntion with masking to introdue ontrol-ow free ode.
These tehniques an prove eetive in assisting the generation of SIMD operations.
In our implementation, we explored the eet of work granularity on performane. The granularity one
thread impats on the resoures alloated to this thread, in partiular onerning the number of registers
(8192 for NVIDIA 8800 GTX). Apparently, the Cuda ompiler tries not to redue the amount of parallelism
below 64 threads, assigning at most 128 physial registers per thread. Among alternatives explored, we tried
the half-spinor version and the full spinor version. For both versions, either eah thread arries the responsi-
bility of the whole omputation of an output spinor (oarse-grained implementation) or the omputation is
divided among 16 threads of omputation, based on the number of the dimension of the spae. Eah thread
iterates through multiple sites of the output spinor array. The oarser the thread omputation the more the
stress on the resoures beause more resoures are needed to redue the pressure on memory. Given that the
memory aess lateny on GPU in the range of 400-600 yles, it is neessary to redue the memory aess
frequeny, espeially sine the ahing within the GPU is severely limited in size.
For Hopping_Matrix omputation, we notied the less the granularity of the work assigned to the thread
the better the performane ahieved. Figure 7 shows the eet of granularity hoie on performane. Number
of threads per multiproessor is set to 64 (higher number fail to launh for oarse-grained tasks beause of
the exessive resoures required). We experimented the two versions of the omputation half-spinor and
full-spinor, disussed in Setion 3.
For oarse-grained versions, even with the inreased memory pressure, the half-spinor version (two threads
per spinor omputation) provides a better performane ompared with the full-spinor (one thread per spinor
omputation). When the spinor omputation are split among 16 threads (of ne-granularity) for both the
half-spinor and the full-spinor tehniques, the full-spinor version beome better than the half-spinor version
beause the former has less frequeny of aessing the memory.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Full-spinor Half-spinor Fine-grained
full-spinor
Fine-grained
half-spinor
N
o
rm
a
li
z
e
d
 E
x
e
c
u
ti
o
n
 T
im
e
GPU compute Memory exchange
Fig. 7. The eet of granularity on performane.
Number of threads per multiproessor is xed to
64 threads. Fine-grained versions use 16 threads to
arry out the omputation of a spinor.
 0
 1
 2
 3
 4
 5
 6
 7
 8
8
4
16
4
32
4
32
4
 (Multistep)
G
F
lo
p
s
 (
s
in
g
le
 p
re
c
is
io
n
)
Sublattice sizes
GPU only GPU with host comm.
Fig. 8. Performane saling (in single preision) of
the omputation for multiple sublattie sizes.
The bandwidth of the host CPU memory to the GPU memory has ritial impat on performane.
Although, the gauge eld is onstant aross iterations (need not to be exhanged), the input and the output
spinors onstitute 25% of the data aessed in the omputation. Moving these spinors data bak and forth
between the GPU and the host proessor is of signiant ost on performane. The low density of omputation
8
ompared with exhanged data auses the ommuniation overhead between the GPU and CPU to mount
to 40% of the total exeution time even for a large lattie size of 323 × 32.
To improve the ratio of oating point operations to data exhanged, we alloated some of the arrays that
are used to hold the intermediate spinor omputation to the GPU memory. Using this tehnique, we redued
the data exhange frequeny to one fourth and the ontribution of data ommuniation to total exeution
time is lowered to 11%. This tehnique requires ommuniating the omputed spinors less frequently in a
multi-node implementation.
Figure 8 shows that inreasing the sublattie size improves the performane up to a ertain extent. Using
intermediate spinor arrays to do multiple step of the Hopping_Matrix redues the spinor memory exhange
overhead between the CPU memory and the GPU memory. The performane ahieved is about 6.2 Gops
in single preision. The eieny of Lattie QCD omputation on GPUs is in the range of 3-4% of the peak
performane beause of the low reuse of data and the omplexity of the data aess pattern that inreases
oniting aesses of the GPU memory banks.
3.4 Computing Node Based on the Cell BE
On the Cell BE, we explored again both the half-spinor and the full-spinor implementations of the Wilson-
Dira operator. Two data layouts are onsidered: small latties that an t in the loal store and large latties
that are stored in the main memory and DMAed to the loal store in hunks.
Cell broadband engine (BE) is a unique arhiteture in integrating speialized aelerator proessors,
alled synergeti proessing element SPE, to the main PowerPC based proessor. Eah SPE has a limited
memory, alled loal store, large register le of 128 16-byte registers and a speialized SIMD proessing
element. The hip integrates XDR memory ontroller in addition to FlexIO ontroller. This integration leads
to a bandwidth to the memory up to 25.6 GB/s. Figure 9 outlines the main omponents of the Cell BE.
The Cell BE is known for being diult to program partly beause of the detailed ontrol it gives to
the programmer over memory management of the dierent address spaes of SPEs and the main memory.
Speial DMA alls are usually required to ontrol data transfers. Transforming ode to perform eiently in
SIMD mode is an additional traditional obstale to exploit SPE proessors. Relying on ompiler is an option
that is yet to mature for this kind of arhiteture. The limited loal store size and its separate address spae
add an additional dimension to the omplexity of aessing the data. As the data assigned to a omputing
node will not t in the loal store, subset of the data needs to be brought to the loal store for proessing.
Fig. 9. Computing Node based on aelerator on hip, represented by IBM Cell BE.
Two options exist in bringing data: The rst divides the sublattie assigned to the Cell BE into further
smaller sublatties with the possible data exhanges between SPEs. The seond option divides the ompu-
tation into frames of data, where SPEs do not need to ommuniate data. The rst alternative, whih will
9
be presented with the half-spinor implementation in this work, have the potential of reduing the pressure
on the memory bandwidth. On the other hand, it requires frequent ommuniation between the SPEs and
more synhronization of points. The seond alternative, whih will be presented with the full-spinor imple-
mentation, requires less synhronization and data exhange between SPEs, but may suer from some lost
opportunities in reusing data aessed by the neighboring SPEs. The little reuse of the data in the Lattie
QCD omputation enourages making this trade-o.
Implementation based on Half-spinor version All SU3 objets (vetors, matries) have been trans-
formed into 4-way omplex DP vetors and 4x4 omplex matries to allow for an easy SIMDization using
SPU intrinsis. This is ostly in additional ops (2656 instead of 1608) but allows, beyond diret measure-
ments, a simple grasp on dierent senarios depending on the size of the loal store. It is assumed that 4K
sites are loated on eah SPU and 128K sites on eah CELL. Double buering is always used aross these
senarios. They are :
S1 : very small urrent loal store size, all of Wilson spinor, gauge matrix and half spinors have to be moved
in or out to/from main memory for eah site between both phases ;
S2 : half spinor will be kept in the loal store memory (or very lose to it) ;
S3 : the gauge matrix will be kept into LS or around ;
S4 : both the half spinor and gauge matrix an be alloated into loal store (the 'Golden Cell').
The outome is that senario S1 demands a bandwith value well above the available loal store to main
memory bandwidth (3.2 GB/se/spu), leading to a degraded SPU performane (1.8 GFlop/se/spu instead
of 2.4 for other senarios). Senario S3 is very interesting, even if it does require an extra eort about loal
store size inrease : it is worth reminding that the Gauge Matries remain onstant over many alls of the
Hopping_Matrix routine.
Fig. 10. Site data ows inside of Hopping_Matrix within the 2 phases. The input (Kseries) and output
(Lseries) spinor indexes are dierent (dierent sites), hene the relevant Gauge Matries (site dependent)
are also dierent.
Implementation based on the full-spinor version SIMDizing the ode requires aligning the data in a
way that an be aessed with the least number of data shues. Eah spinor is aessed in eight dierent
ontexts (due to the spin projetion operator in Equation (1) depending on the spae diretion. Eah aess
involves dierent operations and memory aess pattern for the real and imaginary part of every omplex
variable. Unfortunately, SPEs do not support omplex arithmeti instrution set. Dynami memory aesses
of the input spinors onstitute 55% of the data aessed as shown in Figure 3, while it represents only 12.5%
10
of the stati data aessed. Unfortunately, we annot fuse these data statially beause the same spinor is
aessed in eight ontexts with dierent surrounding spinors in eah ase.
To overome this diulty, we devise a tehnique, alled runtime fusion, that fuses the input data used
for the omputation of multiple onseutive spinors. The real parts of these input data are fused into single
register, and similarly for the imaginary part. For instane a 16 byte register requires fusing the omputation
of two output spinors of double preision or four single preision output spinors. Figure 11 shows the layout
of the runtime data fusion tehnique. Runtime fusion merges the omputation of unrolled loop, thus grouping
the data of similar aess pattern into 16 byte words. The result of the omputation is then sattered bak
into multiple spinors results. Cell BE allows suh tehnique beause of the large register le. Almost 6 KB
of data are touhed during the omputation of a group of two output spinors in double preision.
Fig. 11. Runtime data fusion tehnique for the full-
spinor version on Cell BE.
Fig. 12. Performane vs. used SPE for Hop-
ping_Matrix in single and double preision.
This tehnique leads to performing the Hopping_Matrix routine with 80 Gops of single preision om-
putation and 8.7 Gops of double preision omputation. Double preision is not optimized in the urrent
generation of Cell BE, but PowerXCell 8i with eDP arries an optimized engine for the double preision that
is apable of 50 Gops for the Hopping_Matrix routine.
Realisti lattie size needs to be stored in the main memory and be retrieved in piees for proessing.
The omputational power for single preision Hopping_Matrix would require 48 GB/s of the memory, far
beyond the 25.6 GB/s bandwidth available. The input spinor is redundantly aessed 8 times during the
omputation of the input spinor array. Bandwidth an be saved, if non-redundant data are brought to the
loal store memory from the external memory, then the redundant part is onstruted inside the SPE loal
store memory. The saving in bandwidth an be 25% for a sublattie size of 163 × 16.
We exploited above attributes, in the pattern of spinor aess, to ahieve omputation performane for
the Hopping_Matrix of 31.2 Gops for single preision and 8.6 Gops for double preision. Figure 12 shows
the performane ahieved while hanging the number of the SPE used. For single preision omputation,
four SPE are able to deliver the maximum the hip an aord. In fat the performane will slow down by
5% if all the SPEs are used. The demand of bandwidth of these 4 SPEs mount to 23.5 GB/s, i.e., almost
saturating the bandwidth to the external memory.
The same behavior is expeted for the double preision with the new PowerXCell 8i with enhaned double
preision.
4 Antiipated Future Evolutions and Comparisons
In this setion, we will try to disuss the expeted performane evolution for the studied arhitetures in
the future, for both homogeneous general-purpose ore and based on aelerators. Our study shows that the
11
use of aelerators an greatly help to boost the omputational performane of the main kernel routine for
Lattie QCD. We will try to disuss the most important riterion that will inuene the hoie between the
studied aelerator arhitetures, for Lattie QCD.
Expeted advanes for Pentium/Itanium For the end of the year 2008, the next generation of Itanium
arhiteture proessor, Tukwila, and Xeon proessor are expeted to integrate a new memory ontroller,
named Common System Interfae. This ontroller will oer fast point-to-point proessor ommuniation and
will have a peak inter-proessor bandwidth of (up to) 96 GB/s and a peak memory bandwidth of 34 GB/s
(rst proessor are expeted to have only a bandwidth of around 24 GB/s). This would then be omparable
to the urrent memory bandwidth of Cell BE and would improve performane for out-of-ahe latties.
The best performane obtained for Hopping_Matrix is when all the lattie ts in ahe (L2 and L3). For
Monvale proessor, this orresponds to latties up to the size of 83 × 16. Tukwila is planned to have a 30MB
shared L3, for 4 ores. Without any hange in the miro-arhiteture, a sustained 3 GFlops/ore would then
be obtained for latties of 83 × 32. Any inrease in the future of ahe sizes would help to maintain a high
level of performane.
Experimental results for a whole multi-ore proessor, taking advantage of multi-ore interations, are
still to be obtained. Eieny for Pentium/Itanium ode on one ore is as high as 50% for smaller latties
(only onsidering Hopping_Matrix) but for the whole ode, it is 18%. The eieny on a multiore node of
BlueGene/P is by omparison of 16%, i.e. a sustained Gops performane 2.2 Gops/node (4 ore/node).
Future prospets of the Cell BE The double preision omputation is improved on the new generation
Cell EDP engine (PowerXCell 8i). Simulation experiments show that the Cell EDP is expeted to deliver 16
Gops of double preision omputation. Three to four SPE will also be able to saturate the bandwidth for
double preision beause no improvement to the bandwidth to the memory is introdued.
An inrease in the loal store size an redue the pressure on the bandwidth by improving reuse of
the data brought to the SPE. The unused SPEs an be turned o thus saving power. The performane
of Lattie QCD odes on the Cell BE would improve if the bandwidth to the memory is improved in the
future generations of the Cell. The kernel routine implementation an saturate up to double the bandwidth
for single preision omputation on urrent generation Cell BE. For double preision implementation on
Cell EDP, the Hopping_matrix routine an saturate more than triple the urrent memory bandwidth (89
GB/s are needed to observe 50 Gops of double preision omputation on Cell EDP). If at one point of
time these bandwidths are ahieved, then omplex arithmeti instrutions would be needed to ahieve more
performane.
Expeted advanes on the GPU So far, most GPUs lak eient support for double preision ompu-
tation. This is to be retied in the near future. Exeption handling for oating point is also not supported.
The bandwidth of data exhange between the GPU and the memory is in the verge of doubling. Beause
of the dependeny of performane on this sare bandwidth, we do not expet that having multiple GPU
onneted to the same CPU northbridge will be an eetive solution. The eieny of Lattie QCD ompu-
tation on GPUs is in the range of 3-4% of the peak performane beause of the low reuse of data and the
omplexity of the data aess pattern that inreases oniting aesses of the GPU memory banks. These
issues may require further investigations for better data alignment.
Reliability of the results obtained by GPUs is a major onern. GPUs historially served graphi appli-
ations that require high performane but also an tolerate some errors at runtime. Certainly, for sienti
omputing this unreliability is diult to retify at the algorithmi level. Software solution to unreliability
usually results in loss of performane.
Cell vs. GPU performane omparison Among the fators dominating the performane that an be
ahieved by any omputing node for Lattie QCD are the bandwidths to the memory system, and the pro-
grammability of the omputing node. The best bandwidth observed is urrently assoiated with integrating
the memory ontroller on the die with the miroproessor. The low omputation to memory aess ratio
12
makes the performane heavily reliant on the memory bandwidth, espeially for miroproessor ores with
SIMD instrution set. The urrent bandwidth to memory winner is the Cell BE; that is why it delivers
promising performane numbers.
The GPU performane is bounded by the low ratio of omputation to data transfer: a large volume of
ommuniated data has to pass through the bounded bandwidth between the host memory and the GPU
memory. Another hallenge is that the irregular pattern of aessing spinors annot be handled eiently
when the job of SIMDization is handed to the ompiler. The ompute kernel performane for Lattie QCD
usually relies on hand-oded optimizations to ahieve the most out of the experimented arhiteture. Ex-
pressing the problem in a way that allows eient ompiler SIMDization requires more study.
The low eieny of omputation on GPUs makes the Watt/Gops ratio as high as 28. In the PowerXCell
8i, Lattie QCD requires 3 Watt/Gops for single preision and about 6 Watt/Gops for double preision,
assuming none of the SPEs is turned o.
Expeted performane evolution and the Lattie QCD problem The performane of a single node
an inrease in the future generation arhiteture beause of the hane of having higher integration on a
single hip. For instane, the future generation Cell BE is expeted to have more SPEs per Cell hip and
more multiproessor on the GPUs, and more ores per hip for multi-ore systems.
Our study leads us to believe that the eieny of utilizing the omputational resoures on any of these
future arhitetures will ontinue to be sub-optimal. The Lattie QCD reuse of data is less than average
appliations that most manufaturers balane their design for.
Using/designing a omputing faility based on ommodity omputing omponents an be used with
Lattie QCD given that enough resoure management is expliitly allowed. Expliit management an allow
using resoures based on the balane needed for Lattie QCD, for instane by swith-o omputing resoures
not used beause of the memory bandwidth bottlenek.
Balaning the resoures for a omputing node, bandwidth to memory, and ommuniation between nodes,
an be ahieved based on resoure management rather than speial system designs.
Currently, for Lattie QCD, our study shows that we annot ahieve less than 3-6 Watt/Gops, meaning
multi mega watts for Petaops apable mahine. The needed performane for Lattie QCD requires general
tehnologial improvement in performane and power onsumption as well as to failitate miro-tuning to
inrease the eieny of handling the speis assoiated with the Lattie QCD omputation.
5 Multi-node Systems
The goal of lattie QCD in the oming years is to ompute real QCD i.e. with light quarks possessing the
mass they have in nature. This means typially a pion twie lighter than usual present omputations whih
implies a length twie larger in physial units. To inrease the auray of the ontinuum limit and to allow
alulations with heavy quarks a typial redution of the lattie spaing by a fator 2 will be welome. This
leads to a multipliation by 4 of the lengths in lattie units, i.e. a saling fator of 256. Starting from a
lattie of 323 × 64 we end up with a 1283 × 256. This is of ourse only a rough estimate. We need to gain
more than two orders of magnitude whih amounts indeed to a Petaops sustained performane. The present
state of the art, on Bluegene/P, with the baseline ode studied in this paper, reahes about 2.2 Gops per
quadri-ore node, i.e. 22 Teraops for ten raks (10000 nodes).
5.1 The Eet of Communiation Arhiteture on Performane
Simulating Lattie QCD with physially meaningful size requires the use of a large number of omputing
nodes. For instane simulating a 1283 × 256 lattie requires 8192 nodes eah solving a 163 × 16 sublattie.
The ommuniation between the 8192 nodes is of ritial importane to the performane, espeially when
the omputing node performane is improved signiantly. Many mahines built for Lattie QCD used 3D
torus network for onneting the omputing nodes [1,3℄. A urrent projet for QCD speialized mahine,
QPACE [12℄, ontinues adopting this network topology with omputing nodes based on the PowerXCell 8i.
13
The ommuniation of the Hopping_Matrix follows the nearest-neighbors ommuniation pattern. With
the large volume of ontiguous data ommuniated, this ommuniation pattern relies mostly on the link
bandwidth to determine the ommuniation lateny. Assuming a simple model for ommuniation lateny
given by the equation communication latency = setup time + data size/bandwidth.Then, the ommuniation
lateny an be omputed easily ompared with the omputation time. The ommuniation lateny depends
on the bandwidth, as a large setup time of 1 µs will ontribute less than 1% of the ommuniation lateny. In
Figure 13, we present the ommuniation as a perentage of the omputation time for the Hopping_Matrix
routine. We did the omputation assuming multiple performane estimates for the omputing nodes range
between 1Gops to 16Gops. For simpliity, we assumed that the omputation power will not vary greatly
with the set of sublattie volumes experimented (arrying 8K to 1M spinors). The sustained link bandwidth
is 250MB/s per link, whih is about the expeted sustained bandwidth from Blue gene/P interonnetion
network (at 55% of the peak bandwidth).
We have three sublattie volumes eah with two strutures. For instane, the sublattie 43 × 128 is of
the same volume as 83 × 8, similarly for sublatties 83 × 128 and 163 × 16, and sublatties 163 × 256 and
323 × 32. The omputation to ommuniation ratio is proportional to the volume to surfae ratio. The
equal edge sublatties are favored by the bigger omputation to ommuniation, but would require a 4D
interonnetion network (16 unidiretional links of 250 MB/s sustained). Sublatties with dierent link size
are what we usually have to embed the four-dimensional lattie into nodes interonneted with 3D topology.
Assuming 3D interonnetion network for a sublattie 43×128, Figure 13 shows that having a node of 16
Gops will lead to a ommuniation that is 1.9 times the ompute time. Inreasing the sublattie volume is
one solution that leads to inrease the requirement of the memory substantially as shown earlier in Figure 3.
For instane to derease the ommuniation to 50% of the ompute time, we need to inrease the sublattie
to 163 × 256 (requiring to aess to 805 MB in one Hopping_Matrix all). We pratially try to math the
physial memory to the data aessed in a massively parallel mahine beause having virtual memory muh
larger than the physial memory is penalized by the expensive IO aess, espeially for an appliation like
QCD that streams the data from the memory most of the time, with little reuse.
Most of the high performane node, like Cell BE and Power6, embed a memory ontrolled on the hip and
an be onneted to a limited physial memory (usually in the range 0.5 to 2 GBytes). The ommuniation an
be ut to half if we adopt 4-dimensional interonnetion network, assuming preserving the link bandwidth,
similar to that of the QCDSP [2,13℄, requiring 16 unidiretional links per omputing node.
Fig. 13. Communiation as a perentage of the ompute time for dierent sublattie shapes and dierent
omputation power of nodes. Sustained link bandwidth is assumed to be 250MB/s per diretion.
14
6 Conlusion
In this study, we presented the attributes haraterizing the main kernel routine for the Lattie QCD om-
putation. We additionally studied optimizations and ode transformations needed for Lattie QCD on a
representative set of arhitetures inluding general-purpose proessors, like Itanium, and the use of om-
modity oating-point aelerators, suh as GPUs and the Cell BE.
Most of the optimizations presented in this work target better use of memory bandwidth, friendlier ahe
behavior and eient use of vetor instrutions, espeially on aelerators. The performane ranges varied
widely, but the use of aelerators provided an appealing potential espeially with the Cell BE. There is also
a promising potential with GPU aelerators if the above mentioned improvements are introdued. Neither
should one underestimate the potentiality of homogeneous multi-ore arhitetures with more ores and large
ahes. The prospets are open and the foreseeable evolution will be very fast.
The omputation to memory aess ratio for the Lattie QCD omputation is lower than what is aorded
by all the studied arhitetures and this trend is expeted to ontinue in the future. Arhitetures with expliit
resoure management an allow more eient use of the resoures.
We show that the inreased performane of omputing nodes will inrease the need for having higher
performane interonnetion network whih was traditionally easily ahievable for the Lattie QCD omputa-
tion, but will be the ritial issue in the future. Algorithmi improvements suh as domain deomposition [14℄,
whih inrease the omputation to remote data aess ratio, will be welome.
Aknowledgements
We would like to aknowledge the support of the Agene Nationale de la Reherhe: the projet ANR-
05-CICG-001 named PARA, and the projet NT05-3 43577 named QCDNEXT. We also aknowledge the
Barelona Superomputing Center for allowing us aess to their Cell lusters.
Referenes
1. Boyle, P., Chen, D., Christ, N., Cristian, C., Dong, Z., Gara, A., Joó, B., Kim, C., Levkova, L., Liao, X., Liu, G.,
Mawhinney, R., Ohta, S., Wettig, T., Yamaguhi, A.: Status of the QCDOC projet. arXiv:hep-lat/0110124v1
(2001)
2. Boyle, P.A., Chen, D., Christ, N.H., Clark, M.A., Cohen, S.D., Cristian, C., Dong, Z., Gara, A., Joo, B., Jung, C.,
Kim, C., Levkova, L.A., Liao, X., Mawhinney, R.D., Ohta, S., Petrov, K., Wettig, T., Yamaguhi, A.: Overview
of the QCDSP and QCDOC Computers. IBM Journal of Researh and Development 49(2-3) (2005) 351366
3. Belletti, F., Shifano, S.F., Tripiione, R., Bodin, F., Bouaud, P., Miheli, J., Pène, O., Cabibbo, N., de Lua, S.,
Lonardo, A., Rossetti, D., Viini, P., Lukyanov, M., Morin, L., Pashedag, N., Simma, H., Morenas, V., Pleiter,
D., Rapuano, F.: Computing for LQCD: apeNEXT. Computing in Siene and Engineering 8(1) (2006) 1829
4. Duane, S., Kennedy, A.D., Pendleton, B.J., Roweth, D.: Hybrid Monte Carlo. Phys. Lett. B195 (1987) 216222
5. Urbah, C., Jansen, K., Shindler, A., Wenger, U.: HMC Algorithm with Multiple Time Sale Integration and
Mass Preonditioning. Computer Physis Communiations 174 (2006) 87
6. Urbah, C.: Lattie QCD with Two Light Wilson Quarks and Maximal Twist. The XXV International Symposium
on Lattie Field Theory (2007)
7. Vranas, P., Blumrih, M.A., Chen, D., Gara, A., Giampapa, M.E., Heidelberger, P., Salapura, V., Sexton, J.C.,
Soltz, R., Bhanot, G.: Massively Parallel Quantum Chromodynamis. IBM journal of researh and development
52(1/2) (2008)
8. Djoudi, L., Barthou, D., Carribault, P., Lemuet, C., Aquaviva, J.T., Jalby, W.: Exploring appliation perfor-
mane: a new tool for a stati/dynami approah. In: Los Alamos Computer Siene Institute Symp., Santa Fe,
NM (Otober 2005)
9. Egri, G.I., Fodor, Z., Hoelbling, C., Katz, S.D., Nogradi, D., Szabo, K.K.: Lattie QCD as a Video Game.
arXiv:hep-lat/0611022v2 (2007)
10. NVIDIA Cuda 1.1: http://developer.nvidia.om/objet/uda.html (2008)
11. NVIDIA Corporation: http://www.nvidia.om/
12. QPACE (QCD PArallel omputing on the CEll B.E.): http://globe-meta.ifh.de:8080/lenya/hp/live/index.html
13. QCDSP (QCD on Digital Signal Proessors): http://phys.olumbia.edu/qft/qdsp/qdsp_hdw.htm
14. Lusher, M.: Solution of the Dira equation in lattie QCD using a domain deomposition method. Comput.
Phys. Commun. 156 (2004) 209220
15
