Status of and performance estimates for QCDOC by Boyle, P. et al.
ar
X
iv
:h
ep
-la
t/0
21
00
34
v1
  1
7 
O
ct
 2
00
2
Status of and performance estimates for QCDOC ∗
P.A. Boyleab, D. Chenc, N.H. Christb, C. Cristianb, Z. Dongb, A. Garac, B. Joo´b, C. Jungb, C. Kimb,
L. Levkovab, X. Liaob, G. Liub, R.D. Mawhinneyb, S. Ohtade, K. Petrovb, T. Wettigef , A. Yamaguchib
aSchool of Physics, University of Edinburgh, Edinburgh EH9 3JZ, Scotland
bDepartment of Physics, Columbia University, New York, NY, 10027
cIBM T.J. Watson Research Center, Yorktown Heights, NY, 10598
dInstitute for Particle and Nuclear Studies, KEK, Tsukuba, Ibaraki, 305-0801, Japan
eRIKEN-BNL Research Center, Brookhaven National Laboratory, Upton, NY, 11973
fDepartment of Physics, Yale University, New Haven, CT, 06520-8120
QCDOC is a supercomputer designed for high scalability at a low cost per node. We discuss the status of
the project and provide performance estimates for large machines obtained from cycle accurate simulation of the
QCDOC ASIC.
Introduction
QCDOC[1,2] was designed for a cost-effective
balance between memory bandwidth, floating
point performance and communication perfor-
mance when running massively parallelised dou-
ble precision QCD codes. As shall be seen, our
simulations suggest QCDOC should scale to ex-
ceptionally large machine sizes on a fixed scientif-
ically interesting lattice volume - or equivalently
QCDOC will operate efficiently on very small lo-
cal volumes. This gives the focussed compute
power necessary to simulate much lighter quark
masses than are accessible today, at a cost of 1
US-dollar per sustained Megaflop.
Hardware Overview
The QCDOC design is based on an applica-
tion specific integrated circuit, or ASIC. We use
IBM System-On-a-Chip technology to integrate
all the node logic on a single silicon chip. The
CPU core is a PowerPC 440[3] whose FPU[4] can
execute one operate instruction per clock cycle.
Peak performance of two flops per clock is ob-
tained by using fused multiply-add instructions.
The on-chip features include 4MByte of embed-
ded DRAM memory and its controller, a memory
∗Presented by PAB at Lattice 2002, Boston.
controller for off-chip DDR memory, an Ether-
net interface, and a custom serial communications
unit (SCU) linked by a number of on-chip busses.
The prefetching edram controller (PEC) has
been custom designed by our colleagues at
IBM Research, and provides ample bandwidth
for the FPU. It automatically prefetches two
read streams, and is multi-ported allowing re-
mote communication to overlap local computa-
tion without contention.
The SCU hardware implements a 6 dimensional
hyper-torus, with bi-directional low latency links.
The aggregate 1.5GByte/s inter-node bandwidth
easily supports QCD with 1 Gflop of local float-
ing point all the way down to 24 local volumes.
The DMA engines in the SCU transfer complex
patterns between nodes while the local floating
point calculation proceeds uninterrupted.
The high dimensionality of the network both
increases the communication bandwidth of the
machine, and gives very symmetric local volumes
when distributing the problem over very many
nodes in four, or even five, dimensions. This
yields the most favorable surface to volume ra-
tio.
Table 1
Performance of double precision assembler ker-
nels produced using the speaker’s code genera-
tor. Very high fractions of the 1Gflop raw peak,
and even higher fractions of the theoretical peak,
are obtained. Some compiled code figures are in-
cluded to show that reasonable performance can
be obtained without undue pain.
Operation Mflops/node
SU3-SU3 800
SU3-2spinor 780
DAXPY 190
ZAXPY 450
DAXPY-Norm 350
CloverTerm/asm 790
CloverTerm/gcc 150
CloverTerm/xlc 300
Hardware Status
The QCDOC ASIC design is functionally com-
plete, and the preliminary design release is un-
dergoing timing driven physical layout at IBM
Raleigh. Verification is on-going with release to
manufacture expected by mid fall and first silicon
at the end of this year. Large prototype machines
are expected in spring 2003.
Software Environment
Two widespread and standard compilers are
being used, namely the IBM xlc compiler and the
GNU gcc C and C++ compilers. Debug facili-
ties are provided by a port of the full featured
RISCWatch remote debug tool.
The standard runtime support libraries are a
popular GNU Public License libc released by
Cygnus2. Runtime libraries are available for per-
forming communication over the physics network
via the SciDac QMP interface which is in turn
implemented on top of a native SCU interface.
Assembler optimised QCD kernels will be freely
available.
Node kernels
The use of the PowerPC means that standard
operating systems could in principle be run on
the nodes. However, to scale to very small local
volumes it is essential that, in addition to very low
2Popularised in the Cygwin software package
Table 2
Sample performance of red-black fermion opera-
tors in cycle accurate simulation. The Wilson and
Clover kernels were generated using the speaker’s
C++ code scheduler, while the Staggered opera-
tor was hand coded by Calin Cristian. The codes
are very efficient even when all sites are on mul-
tiple boundaries.
Operation Local Vol. Mflops/node
Wilson Deo 2
4 470
Wilson Deo 4
4 535
Clover Deo 2
4 560
Clover Deo 4
4 590
Staggered Deo 2
4 370
Staggered Deo 2
2.42 430
latency hardware, we avoid unnecessary software
latency.
We will use a lean node kernel to avoid sched-
uler overhead and use the PowerPC virtual mem-
ory (VM) hardware to protect, but not translate,
memory pages. This gives both the benefits of ro-
bust and graceful error recovery, and the benefits
of zero-copy SCU transfers without VM gymnas-
tics.
Host operating system
The host for QCDOC will be an SMP
Unix server with multiple Gigabit links to the
Boot/Diagnostic/IO Ethernet network. A multi-
threaded qdaemon will boot and manage the ma-
chine partitions, and service socket-based connec-
tions to these partitions from a number of client
programs. Sufficient functionality has already
been implemented to boot and download code to
PowerPC boards in parallel.
Performance in simulation
As part of the design verification programme a
number of benchmarks and stress tests have been
written, to both check functional correctness and
that design goals are met. We shall present per-
formance figures for some of these tests, where
we have taken a “nominal” 500MHz CPU, and
made reasonable assumptions about the intercon-
nect wire length.
Table 1 shows a sample of common vector op-
erations performed in Lattice QCD codes. Most
Figure 1. Estimated performance per node
for assembly coded Wilson dslash operators as a
function of local volume. Roughly two orders of
magnitude better scalability is observed for QC-
DOC.
100
150
200
250
300
350
400
450
500
550
600
100 1000 10000
M
flo
ps
/n
od
e
Local Volume (lx * ly * lz * lt)
QCDOC Clover double
QCDOC Wilson double
Alpha Clover Myrinet/MPI single
of these figures were obtained using a C++ as-
sembler code generator written by the speaker to
automate loop unrolling, prefetching and detailed
scheduling.
Table 2 shows performance measurements of
optimised implementations of various precondi-
tioned Fermion operators in double precision. In
these operators unpaired adds and multiplies set
an upper limit (e.g. 780Mflops for the Wilson
kernel) somewhat below the “raw” peak. The
communication overhead is entirely taken into ac-
count, and the performance is excellent despite
having to communicate each site over four wires
in the 24 case.
Scalability
Performance characteristics at small local vol-
umes are critical to scaling. Figure 1 shows the
estimated performance of QCDOC on double pre-
cision Wilson dslash code (omits global summa-
tion time) distributed in four dimensions. At
the smallest volumes the entire operation takes
roughly 20 µs, with eight communications in this
time. For comparison we overlay a Myrinet Alpha
21264 cluster running single precision assembler
Table 3
Estimates for Wilson CG performance on a
323 × 64 Lattice. Both Clover and Domain Wall
simulations would be even more scalable.
Nodes M †M Gsum Sust. Tflops
4096 2620µs 10 µs 2.15
8192 1310µs 11.5µs 4.2
16384 680 µs 13 µs 8.1
32768 340 µs 15 µs 15.6
code distributed in three dimensions.
Given the fast linear algebra and the scalable
matrix multiply the remaining hurdle to be over-
come by an iterative solver is global summation.
The SCU has “pass-thru” hardware assist for
global sums and broadcasts. The global sum has
been benchmarked in simulation, and we use the
mixture of matrix multiplies, linalg and gsums in
the CG algorithm to predict the performance on
large machines on Wilson HMC in Table 3.
Conclusions
QCDOC is progressing well with large ma-
chines due in 2003. Simulations indicate these
machines will run QCD at high efficiency on the
largest machines, while having a low cost per sus-
tained Megaflop and low power consumption.
Acknowledgements
This research was supported in part by the
U.S. Department of Energy, the Institute of Phys-
ical and Chemical Research (RIKEN) of Japan,
and the U.K. Particle Physics and Astronomy Re-
search Council.
REFERENCES
1. D. Chen et al., Nucl. Phys. B (Proc. Suppl.)
94 (2001) 825.
2. P.A. Boyle et al., Nucl. Phys. (Proc. Suppl.)
106 (2002) 177.
3. www.ibm.com/chips/techlib/techlib.nsf
/products/PowerPC 440 Embedded Core
4. www.chips.ibm.com/products/powerpc/
newsletter/aug2001/new-prod3.html
