APENet: LQCD clusters a la APE by R. AmmendolaINFN RM2 et al.
ar
X
iv
:h
ep
-la
t/0
40
90
71
v1
  1
4 
Se
p 
20
04
APENet: LQCD clusters a` la APE∗
R. Ammendolaa, M. Guagnelliab, G. Mazzaa, F. Palombiac, R. Petronzioab, D. Rossettid, A. Salamona,
and P. Vicinid.
aIstituto Nazionale di Fisica Nucleare, Sezione Roma II
Via della Ricerca Scientifica 1, I-00133 Rome, Italy
bDipartimento di Fisica, Universita` di Roma Tor Vergata
Via della Ricerca Scientifica 1, I-00133 Rome, Italy
cE. Fermi Research Center, c/o Compendio Viminale, pal. F, I-00184 Rome, Italy
dIstituto Nazionale di Fisica Nucleare, Sezione Roma I
Piazzale Aldo Moro 2, I-00186 Rome, Italy
Developed by the APE group, APENet is a new high speed, low latency, 3-dimensional interconnect architecture
optimized for PC clusters running LQCD-like numerical applications. The hardware implementation is based
on a single PCI-X 133MHz network interface card hosting six independent bi-directional channels with a peak
bandwidth of 676 MB/s each direction. We discuss preliminary benchmark results showing exciting performances
similar or better than those found in high-end commercial network systems.
1. Overview
The APE research group[1] has traditionally
focused on the design and development of cus-
tom silicon, electronics and software optimized for
Lattice QCD (LQCD).
Recent works in LQCD numerical application
area [2,3,4] have shown an increasing interest on
clusters of commodity PC’s. This is mainly due
to two facts: good sustained performance of com-
modity processors on numerical applications and
slowly emerging low latency, high bandwidth net-
work interconnects.
This paper describes APENet, a 3D network of
point-to-point, low-latency, high-bandwidth links
well suited for medium sized clusters running nu-
merical applications.
2. APENet
APENet is a 3D network of point-to-point links
with toroidal boundary conditions. Each Process-
ing Element (PE), in our case a cluster node, has
∗Talk given at Lattice 2004 by R.A.
6 full-duplex communication channels (X+, X−,
Y +, Y −, Z+, Z−).
Data are transmitted in packets which are
routed to the destination PE according to sim-
ple — and software overridable — rules. Packet
delivery is always guaranteed: trasmission is de-
layed until the receiver has enough room in its
receive buffers. No external routing device is nec-
essary: next-neighbour and longer distance com-
munications are obtained efficently hopping until
the destination PE is reached, without penalties
for in-between PEs.
Latency is kept to the minimum thanks to a
lightweight low level protocol — just two 64bit
words for the header and the footer, — and to
the cut-through architecture of the switching de-
vice. Within 10 clock cycles from the arrival of
the header, the receiving channel starts forward-
ing the packet along its path, either toward local
buffers — for packets intended for that very PE
— or toward the proper trasmitting channel —
for packets which hop away. —
1
2Figure 1. The APELink card.
2.1. The APELink Card
The building block of the APENet implemen-
tation is the APELink card, shown in Fig. 1 The
APELink is a PCI-X 133MHz 64bit card which
uses an Altera’s Stratix device, a last generation
FPGA, as a the network device controller, and
six pairs of serializers/deserializers from National
Semiconductors as physical link interfaces. The
PCI-X (64b*133MHz)
PCI-X CORE
Crossbar Switch
 
 
48bit*133Mhz 
64b*133Mhz 
Fifo 2K*64b
Tx
Rx
EP1S30
Fifo 8K*64b
Tx
Rx
Tx
Rx
Tx
Rx
Tx
Rx
Tx
Rx
X+ X- Y+ Y- Z+ Z-
8bit(diff.)*800Mhz 
6.4(x2) Gb/s
Router
	
	
 

 ﬀﬁ
ﬂﬃ !ﬃ
"#$% &%$'
()*
+,-. /.-0
123452
6789 :98;
<=>
?@AB CBAD
EFGHIF
JKLM NMLO
PQR
STUV WVUX
YZ[\]Z
^_`a ba`c
def
ghij kjil
mnopqn
rstu vutw
xyz{|
}~ 

 Ł


Figure 2. The APELink functional blocks diagram.
APELink card, see Fig. 2, is composed of three
major functional blocks. Each block has its own
clock domain and all data communications be-
tween these blocks are based on dual clock FIFOs,
which guarantee robustness of the hardware it-
self. The first block is the PCI-X interface, which
handles the communication with the host PCI-
X bus; the second block, called crossbar switch,
controls the data flow among the PCI-X channel
and the remote communication links ; the third
block implements six remote communication bi-
directional links.
2.2. The APENet Software
The main programming interface is a propri-
etary simple library (apelib) of C functions, in-
cluding synchronous, asyncronous and basic col-
lective functions. It relies on the APELink driver,
a Linux device driver fully Multi Processor-aware
(SMP) and supporting versions 2.4 and 2.6 of the
Linux kernel. We are developing both an MPI
implementation based on LAM-MPI and a net-
work device driver, which allows simple IP proto-
col traffic to be routed on the APENet.
The apelib is targeted for numerical ap-
plication code and includes basic primitives
such as ape send(), ape recv(), ape sndrcv(),
and some collective functions (ape broadcast(),
ape global sum()). The ape sndrcv() primitive
squeezes the best performances from our architec-
ture, as it asymptotically exercises two channels
at once, incrementing the aggregated bandwidth.
3. Benchmarks
In this section we report some preliminary low-
level benchmark results, obtained on APELink
early prototypes.
Benchmarks were performed on some dual Intel
Xeon PC’s, with both ServerWorks GC-LE and
Intel E7501 chipsets. The PC’s are connected in a
small ring topology. The APELink channel speed
is currently kept at 100 MHz with a peak per-
formance of 508 MB/s per link while the PCI-X
interface runs at 133MHz.
The benchmark performs a ”ping-pong” data
transfer (unidirectional and bi-directional) be-
tween two adjacent PE. In the unidirectional test,
one PE sends a message to a remote PE then
blocks on receiving a response. The second PE
receives the full message and sends back the same
amount of data. Half round-trip time, averaged
3 4
 8
 16
 32
 64
 128
16 64 256 1K 4K 16K
Ti
m
e 
(us
)
Message Size (bytes)
unidirectional ping-pong
Figure 3. Latency as measured in a ping-pong test
for small packet sizes.
on a number of iterations, is defined as the la-
tency, i.e. the message transfer time.
From the same test we have estimated the sus-
tained bandwidth. The bi-directional test differs
from the unidirectional one since both PEs send
data simultaneously using the ape sndrcv() func-
tion.
 0
 100
 200
 300
 400
 500
 600
 700
16 64 256 1K 4K 16K 64K 256K 1M
Ba
nd
wi
dt
h 
(M
B/
s)
Message Size (bytes)
bidir sndrcv zero-copy
bidir sndrcv
unidir zero-copy
unidir
Figure 4. Bandwidth is measured both for uni-
directional and bidirectional tests. The zero-copy la-
bel refers to the use of optimized memory buffers.
In Fig. 3 we plot the latency for message sizes
ranging from 16 to 16K bytes. The smallest mes-
sage size is 16 as the minimum packet payload is
a 128bit word. The estimated latency is ∼ 6µs
and is constant up to 256 bytes size message.
For 4096 bytes messages we measure 20µs which
is quite good and pretty similar to commercial
interconnects[5].
Fig. 4 shows the bandwidth plot with mes-
sage sizes ranging from 16 bytes to 1MB. The bi-
directional zero-copy bandwidth saturates at 677
MB/s. At 1MB message size the uni-directional
bandwidth is 470MB/s, roughly 90% of the chan-
nel peak performance at 100MHz. The plot shows
two pairs of curves: those marked zero-copy re-
fer to the use of pinned-down memory, suitable
to be used for PCI DMA transfers. This way the
overhead of expensive memory copy operations
to/from DMA memory buffers are avoided. Non
zero-copy data are reported only to simplify the
discussion.
4. Conclusions
The hardware design of the APElink card is
completed and we are running tests on the fi-
nal release of the board whose link channels
run at full speed (133MHz). Preliminary bench-
marks have shown encouraging results, compara-
ble with commercial network interconnects. The
APELink software is currently in fast progress:
current activities focus on a better low level driver
a MPI implementation.
The INFN prototype APENet PC cluster, com-
posed of 16 PC’s equipped with APElink boards,
is ready to be used on LQCD test codes. We have
plans to expand it up to 64 PC’s (43 topology) in
the near future.
REFERENCES
1. The APE group, Istituto Nazionale di Fisica
Nucleare
http://apegate.roma1.infn.it/APE
2. M. Luscher, Nucl. Phys. Proc. Suppl. 106, 21
(2002) [arXiv:hep-lat/0110007].
3. Z. Fodor and S. D. Katz, JHEP 0203, 014
(2002) [arXiv:hep-lat/0106002].
4. T. Lippert, Nucl. Phys. Proc. Suppl. 129, 88
(2004) [arXiv:hep-lat/0311011].
5. J. Liu et al., Performace Comparison of MPI
Implementations over Infiniband, Myrinet
and Quadrics, SuperComputing Conference,
November 2003.
