Benchmarking computer platforms for lattice QCD applications by Hasenbusch, M. et al.
ar
X
iv
:h
ep
-la
t/0
30
91
49
v1
  2
2 
Se
p 
20
03
1
Benchmarking computer platforms for lattice QCD applications
M. Hasenbuscha, K. Jansena, D. Pleitera, H. Stu¨benb, P. Wegnerc, T. Wettigd, H. Wittige
aNIC/DESY-Zeuthen, 15738 Zeuthen, Germany
bKonrad-Zuse-Zentrum fu¨r Informationstechnik Berlin, 14195 Berlin, Germany
cDESY-Zeuthen, 15738 Zeuthen, Germany
dDepartment of Physics, Yale University, New Haven, CT 06520–8120, USA
eDESY, 22603 Hamburg, Germany
We define a benchmark suite for lattice QCD and report on benchmark results from several computer platforms.
The platforms considered are apeNEXT, CRAY T3E, Hitachi SR8000, IBM p690, PC-Clusters, and QCDOC.
1. INTRODUCTION
Simulations of lattice QCD require powerful
computers. We have benchmarked computers
that are under consideration by the German Lat-
tice Forum (LATFOR) [1] to realize the future
physics program. The machines we have tested
fall into three categories: (1) machines that are
custom-designed for lattice QCD (apeNEXT and
QCDOC), (2) PC-clusters, and (3) commercial
supercomputers (CRAY, Hitachi, IBM).
2. COMPUTER PLATFORMS
2.1. apeNEXT
The apeNEXT project [2] was initiated with
the goal to build custom-designed computers with
a peak performance of more than 5 TFlops and a
sustained efficiency of about 50% for key lattice
gauge theory kernels. apeNEXT machines should
be suitable for both large-scale simulations with
dynamical fermions and quenched calculations on
very large lattices. The apeNEXT processor is a
64-bit architecture with an arithmetic unit that
can at every clock cycle perform the APE normal
operation a×b+c, where a, b, and c are IEEE 128-
bit complex numbers. The apeNEXT processors
have a very large register file of 256 (64+64)-bit
registers. On-chip network devices connect the
nodes by a three-dimensional network.
2.2. QCDOC
QCDOC (“QCD on a Chip”) [3] is a massively
parallel computer optimized for lattice QCD, de-
veloped by a collaboration of Columbia Univer-
sity, UKQCD, the RIKEN-BNL Research Cen-
ter, and IBM. Individual nodes are based on
an application-specific integrated circuit (ASIC)
which combines IBM’s system-on-a-chip technol-
ogy (including a PowerPC 440 CPU core, a 64-bit
FPU, and 4 MB on-chip memory) with custom-
designed communications hardware. The nodes
communicate via nearest-neighbor connections in
six dimensions. The low network latency and
built-in hardware assistance for global sums en-
able QCDOC to concentrate computing power in
the TFlops range on a single QCD problem.
2.3. PC-cluster
In recent years, commodity-off-the-shelf
(COTS) Linux cluster computers have become
cost-efficient, general-purpose, high-performance
computing devices. QCD simulations on clus-
ter computers can be boosted considerably by
exploiting the SIMD and data prefetch function-
ality of Intel Pentium processors via SSE/SSE2
instructions by means of assembler coding. The
benchmarked PC-cluster has 1.7 GHz Xeon Pen-
tium 4 CPUs with 1 GB of Rambus memory.
The nodes communicate via a Myrinet2000 inter-
connect.
22.4. CRAY T3E-900
The CRAY T3E is a classic massively parallel
computer. It has single CPU nodes and a three-
dimensional torus network. The T3E architec-
ture is rather well balanced. Therefore, the over-
all performance of parallel applications scales to
much higher numbers of CPUs than on machines
that were built later. The peak performance of a
T3E-900 is 900 MFlops per CPU, the network la-
tency is 1 µs, and the bidirectional network band-
width is 350 MByte/s.
2.5. Hitachi SR8000-F1
The Hitachi SR8000 is a parallel computer with
shared memory nodes. Each node has 8 CPUs.
The key features of the CPUs are the high mem-
ory bandwidth and the availability of 160 float-
ing point registers. These features are accompa-
nied by pseudo-vectorization, an intelligent pre-
fetch mechanism that allows to overlap compu-
tation and fetching data from memory. Pseudo-
vectorization is done by the compiler. The peak
performance of an SR8000 CPU is 1500 MFlops,
the network latency is 19 µs, and the bidirectional
bandwidth between nodes is 950 MByte/s.
2.6. IBM p690-Turbo
The IBM p690 is a cluster of shared memory
nodes. Its CPUs (and nodes) have the highest
peak performance of the machines considered but
only a relatively slow network. In order to in-
crease the bandwidth of the interconnect people
divide the 32-CPU nodes into 8-CPU nodes. This
increases the bandwidth per CPU by a factor of
4. The performance depends to a large extent
on the configuration of the machine. For bench-
marking this architecture it has also to be taken
into account that the performance drops by a fac-
tor of 3–5 when using all CPUs instead of only
one. The peak performance of a Power4 CPU
is 5400 MFlops, the network latency (of the so-
called colony network) is 20 µs, and the bidirec-
tional bandwidth between nodes is 450 MByte/s.
3. BENCHMARK SUITE
In this contribution we concentrate on one par-
ticular application: large-scale simulations of dy-
namical Wilson fermions with O(a)-improvement
on lattices of size V = 323 × 64. We assume
that these simulations are performed using the
Hybrid Monte Carlo algorithm [4] or the Polyno-
mial Hybrid Monte Carlo algorithm [5], as was
done in simulations with dynamical fermions in
recent years [6].
The most time-consuming operation is the
fermion matrix multiplication. We denote the
fermion matrix by M [U ] = T [U ] − H [U ], where
H is the Wilson hopping term
H [U ]xy = κ
∑
µ
{
(1− γµ)Uµ(x) δx+µˆ,y
+(1 + γµ)U
†
µ(x − µˆ) δx−µˆ,y
}
(1)
and T is the clover term T [U ] = 1− i
2
κcsw Fµνσµν .
Here we only consider the even-odd precondi-
tioned version ψ = Heo φ.
Basic operations of linear algebra are needed in
iterative solvers. We have considered the scalar
product
(ψ, φ) =
V∑
x=1
3∑
i=1
4∑
α=1
ψ∗i,α(x)φi,α(x) , (2)
the vector norm
||ψ||2 =
V∑
x=1
3∑
i=1
4∑
α=1
|ψi,α(x)|
2 , (3)
the zaxpy operation
ψi,α(x)← ψi,α(x) + c φi,α(x), c ∈ C , (4)
and the daxpy operation
ψi,α(x)← ψi,α(x) + r φi,α(x), r ∈ R . (5)
Two basic operations involving link variables
are part of our benchmark, the multiplication of
an SU(3) matrix by a vector
ψ = U ∗ φ; ψi =
3∑
j=1
Uijφj (6)
and the multiplication of two SU(3) matrices
W = U ∗ V ; Wij =
3∑
k=1
UikVkj . (7)
Finally, the benchmark contains the basic op-
erations involving the clover term, ψ = Tφ and
3Table 1
Benchmark results in MFlops per CPU. All numbers refer to 64bit floating point arithmetic. Italic
numbers indicate that communications overhead has been included. Further details are given in the text.
apeNEXT QCDOC PC-Cluster CRAY Hitachi IBM
Peak [MFlops] 1600 1000 3400 900 1500 5200
Heo φ 894 535 930 101 632 299
(ψ, φ) 656 450 530 148 680 303
||ψ||2 592 384 510 98 789 187
zaxpy 464 450 358 114 479 234
daxpy 116 190 183 57 241 115
U*φ 1264 780 307 104 811 261
U*V 1040 800 763 118 1182 413
Tφ 1136 790 800 111 1137 608
ψ = T−1φ. These were implemented with 6 × 6
block matrices,
ψ =
1
2
(
1 −1
1 1
)(
 0
0 
)(
1 1
−1 1
)
φ .(8)
4. BENCHMARK RESULTS
Our benchmark results are listed in Table 1.
The values for apeNEXT and QCDOC were ob-
tained from cycle-accurate simulations of the
forthcoming hardware. All the other performance
numbers were measured on existing machines.
On apeNEXT and QCDOC the hopping term
was benchmarked by distributing the problem
over the maximum number of nodes for the given
problem size. For the PC-cluster, where a C
code with SSE/SSE2 instructions based on the
benchmark program of M. Lu¨scher [7] has been
used, only single-node numbers (for V = 164)
are quoted because there is still some debate over
which network to use (e.g., Myrinet, Infiniband,
Gbit-Ethernet). The commercial machines have
been benchmarked using 256, 64 and 64 CPUs on
CRAY, Hitachi and IBM, respectively. We used
the Fortran90 production code of the QCDSF
collaboration that is parallelized with MPI and
OpenMP. For the linear algebra routines we used
Fortran loops on the Hitachi and the vendors’
high-performance libraries on CRAY and IBM.
In case of the scalar product and the vector
norm we quote only the single-processor perfor-
mance, since the performance including the global
sum depends on the number of nodes. We esti-
mated the overhead for computing the global sum
on some platforms, since it will affect scalability
of the considered application when going to a very
large number of nodes:
apeNEXT 5.2 µs on 4 × 8 × 8 = 256 CPUs
QCDOC 10 (15) µs on 4.096 (32.768) CPUs
PC-cluster 138 (166) µs on 256 (1.024) CPUs
5. CONCLUSIONS
We presented a selection of benchmarks rel-
evant for doing large-scale simulations of QCD
with dynamical fermions and provided initial
benchmark results for a range of platforms. A
more detailed comparison of these platforms in
terms of price/performance ratio, hardware relia-
bility, software support, etc. is beyond the scope
of this contribution. These questions will be ad-
dressed in a future publication.
REFERENCES
1. R. Alkofer et al.,
http://www-zeuthen.desy.de/latfor
2. F. Bodin et al., hep-lat/0306018.
3. P.A. Boyle et al., hep-lat/0306023 and these
proceedings.
4. S. Duane et al., Phys. Lett. B 195 (1987) 216
5. P. de Forcrand and T. Takaishi, Nucl. Phys.
(Proc. Suppl.) 53 (1997) 968; R. Frezotti and
K. Jansen, Phys. Lett. B 402 (1997) 328
6. E.g. (previous) proceedings of this conference.
7. M. Lu¨scher, Nucl. Phys. (Proc. Suppl.) 106 &
107 (2002) 21 [hep-lat/0110007]
