APEnet+: high bandwidth 3D torus direct network for petaflops scale
  commodity clusters by Ammendola, Roberto et al.
APEnet+: high bandwidth 3D torus direct network
for petaflops scale commodity clusters
R Ammendola1, A Biagioni2, O Frezza2, F Lo Cicero2, A Lonardo2,
P S Paolucci2, D Rossetti2, A Salamon1, G Salina1, F Simula3,
L Tosoratto2 and P Vicini2
1 INFN Tor Vergata, Roma
2 INFN Roma, Roma
3 Sapienza Universita` di Roma, Roma
E-mail: francesca.locicero@roma1.uniroma1.it, piero.vicini@roma1.infn.it
Abstract.
We describe herein the APElink+ board, a PCIe interconnect adapter featuring the
latest advances in wire speed and interface technology plus hardware support for a RDMA
programming model and experimental acceleration of GPU networking; this design allows us to
build a low latency, high bandwidth PC cluster, the APEnet+ network, the new generation of
our cost-effective, tens-of-thousands-scalable cluster network architecture.
Some test results and characterization of data transmission of a complete testbench, based
on a commercial development card mounting an AlteraR© FPGA, are provided.
1. Introduction
The Array Processor Experiment (APE) is a custom design for HPC targeting the field of Lattice
QCD, started by the Istituto Nazionale di Fisica Nucleare and partnered by a number of physics
institutions all over the world, that since its start in 1984 has developed four generations of
custom machines [1, 2, 3, 4]. Leveraging on the acquired know-how in networking and re-
employing the gained insights, a spin-off project called APEnet [5, 6] developed an interconnect
board that allows assembling a PC cluster a` la APE with off-the-shelf components.
Following further developments funded by EU projects (FP 6 SHAPES [7, 8] and FP 7
EURETILE), the APEnet project evolved into APEnet+ [9]; its achievement is the design of the
APElink+ host adapter, which integrates both a network interface and a switching component,
bringing in state-of-the-art wire speeds for the links and a PCIe X8 gen2 host interface. With
this latest push to higher bandwidth, low power and low cost of the data transmission system,
we are encompassing not only a broader range of intensive numerical algorithms (Lattice QCD
is our primary but not exclusive concern), but also the field of acquisition systems for modern
particle and astroparticle experiments (sLHC, ILC, CLIC, NA62. . . ).
The outlook of this article is as follows: the first section explains the global network
architecture; the second one gives the details of the host board; the third one outlines the
software stack provided by the programming environment; the fourth one sketches the current
deployment of APEnet+ hardware in the framework of our QUOnG HPC initiative; the fifth
and final one gives conclusions and outlines to future work.
ar
X
iv
:1
10
2.
37
96
v1
  [
ph
ys
ics
.co
mp
-p
h]
  1
8 F
eb
 20
11
2. The APEnet+ hardware
The APEnet+ interconnect is our low latency, high bandwidth packet-based direct network,
supporting state-of-the-art link wire speeds and a PCIe X8 gen2 host connection. On this
network, the computing host — e.g. a multi-core CPU optionally paired with GPU — is
equipped with one APElink+ board and made into a node of the cluster. The nodes are
connected by point-to-point links to form a 3D torus in a cubic mesh; each node communicates
with each of its 6 neighbours along the X+, X−, Y+, Y−, Z+ and Z− directions by bi-
directional full-duplex communication channels.
Size envelope (header+footer) of packets is hard-coded and fixed, while payload size is
variable; packets are auto-routed to their final destinations according to wormhole dimension-
ordered static routing, with the system taking care of dead-lock avoidance.
The hardware block structure, depicted in Figure 1, is split into a so called network interface
— the packet injection and processing logic comprising PCIe, TX/RX logic, etc.— a router
component and multiple torus links.
 
Network Interface block
Router
block 7x7 ports 
switch
torus
link
torus
link
torus
link
torus
link
torus
link
torus
link
TX/RX 
FIFOs & 
Logic
routing 
logic
arbiter
X+ X- Y+ Y- Z+ Z-
PCIe X8 
Gen2 core
NIOS II 
processor
collective 
communicati
on block
memory 
controller
DDR3
Module
128@250MHz bus
PCIe X8 Gen2 @5 Gbps
100/1000 
Gpbs port
F
P
G
A
Figure 1. Internal FPGA block
architecture.
The APElink+ network interface has basically two main tasks:
• On the transmit data path, it gathers data coming in from the PCIexpress port, fragmenting
the data stream into packets which are forwarded to the relevant destination ports,
depending on the requested operation.
• On the receive side, it provides hardware support for the RDMA programming model,
implementing the basic RDMA capabilities (PUT and GET semantics) at the firmware
level.
Within this block, the addition of a NIOS II 32 bit embedded micro-controller — a standard
Altera R© Intellectual Property — simplifies some tasks along the path of the received packets.
The routing block takes care of examining the packet header and resolving the destination
address into a proper path across the switch according to the chosen routing algorithm.
The torus link block manages the data flow by encapsulating the APEnet+ packets into a
light, low-level, word-stuffing protocol able to detect transmission errors via CRC. It implements
two virtual channels [10] and proper flow-control logic on each RX link block to guarantee
deadlock-free operations.
3. The APElink+ card
For the design of the building brick of the APEnet+ infrastructure we leveraged on the most
recent advances in host interface technology, physical link speed and connector mechanics; the
result is the latest generation of our hardware, the APElink+ card — see Table 1. —
The APElink+ card is a single FPGA-based PCI Express board; the employed FPGA device
is the EP4SGX290, which is part of the Altera R© 40 nm Stratix IV device family and comes
equipped with 36 full-duplex CDR-based transceivers, supporting data rates up to 8.5 Gbps
each. It also provides a PCIe X8 gen2 interface, which is complemented by a commercial PCIe
core to allow communication between the host processor and the network.
Moreover, an Ethernet port is foreseen in order to build an additional, secondary network
with an offload engine for collective communication tasks.
3.1. APElink+ physical links
In the global network structure, each card stands as a vertex of a 3D torus mesh network with
6 independent point-to-point multiple links channel (i.e. the links between mesh sites). Each
link is made up of 4 bi-directional lanes bonded together; the automatic alignment logic is our
original addition.
Four links out of six are hosted on the main board; two more, say Z+ and Z−, are located
in a detachable, small daughter-card on the upper level. In this way, the complete card takes
on two PCI standard slots in a PC chassis, mantaining the chance, if four links are enough, to
use it in a single slot wide configuration.
The torus links are 6 independent blocks with 2 virtual channel receive buffers each, added to
manage deadlock prevention. Proper flow control is maintained via credits handshake between
a local RX block and the remote TX block; this handshake is embedded in the link protocol
data layer. The torus link is able to autonomously re-transmit the header and the footer in
case of transmission errors. Therefore, the protocol assures the delivery of the packet, avoiding
nonrecoverable situations where badly corrupted packets (with errors in the header or footer)
pose a threat to the global routing. Packets with payload errors (signaled by the footer) must
be instead handled at the software level. The chosen CRC polynomial generator is the industry-
standard, well-known CRC-32.
3.2. APElink+ routing capabilities
The router comprises a fully connected, 7-ports-in/7-ports-out switch, plus routing and
arbitration blocks. The routing block examines the header of each packet and translates its
destination address to a proper path across the switch; the routing is dimension-ordered, with
a measured latency of 60 ns.
Table 1. Evolution of the APElink cards.
APElink APElink+
FPGA component Altera Stratix S30 Altera Stratix IV GX 290
# links 6 4/6
link technology external National ser/des embedded Altera transceivers
link cables LVDS QSFP+ standard
raw link speed 6 Gbps 34 Gbps
host interface PCI-X 133 MHz PCIe X8 Gen2
peak host BW 1GB/s 4+4GB/s
3.3. Test bed and preliminary results
A schematic view of the complete APEnet+ board is visible in Figure 2. The prototypes will
be available at February 2011.
Main PCB Layer
Secondary PCB Layer
Board-to-Board Connector
QSFP Connector
ALTERA
SO-DIMM
I/O Slot 1
I/O Slot 2
Figure 2. APElink+ board.
Figure 3. Assembled
APElink+ test system.
A test system has been built in order to develop the FPGA firmware, the PCIe interface and
the physical layer interconnection technology [11]. We used a commercial Altera R© development
kit (equipped with a smaller Altera R© Stratix IV GX 230) and a custom-designed daughter-card
(an HSMC mezzanine designed at LABE in INFN-Roma) hosting 3 QSFP+ connectors and
some SMA test points. This setup allows us to test the complete communication chain up to a
bitrate of 24 Gbps for each link. Signal integrity was checked connecting to dedicated SMA test
points, straight at the output of the FPGA transceivers (see Figure 4) and on the mezzanine
card (see Figure 5) after one Samtec R© connector, two QSFP+ connectors and 1 m of copper
QSFP+ cable.
The link was successfully tested up to 3 Gbps data rate (compared to 8.5 Gbps achievable
with the Stratix IV embedded transceivers). Above this limit we found some signal degradation
probably caused by the tower connector between the Altera development kit and the test
mezzanine. Investigation is in progress; a likely culprit is the reduced bandwidth (below 5
GHz) of the 19 mm QTH Samtec R© connector, which would be substituted anyway by higher
bandwidth connectors in the production release of the communication card [12].
Characterization of signal integrity (and maximum achievable bandwidth) versus serial trans-
ceivers pre-emphasis and equalization is still in progress.
Figure 4. Eye diagram at
5 Gbps on the development
kit.
Figure 5. Eye diagram at
3 Gbps on the mezzanine card.
Recovered clock stability was checked transmitting a pseudorandom data stream organized
in 128 bit wide words over 1 m copper QSFP+ cable and checking the relative phase between
the input and the output clocks (see Figure 6). Recovered clock was found stable and in phase
with the input clock up to 400 MHz.
Latency was checked transmitting a pseudorandom sequence over 1 m QSFP+ copper cable
and rising a flag every time a fixed test word is transmitted and received by the serializer and
the deserializer respectively (see Figure 7). Transmission system latency was found stable up to
160 MHz transmitting clock.
Figure 6. Recovered clock
stability check at 40 MHz.
Figure 7. Latency measure-
ment at 40 MHz.
4. The APEnet+ software stack
All APEnet+ software is available under the GNU GPL Licence and spans across four major
topics: the firmware software running on the FPGA embedded processor, the Linux kernel driver,
the application level RDMA library and a MPI implementation, these latter three developed
and tested under RedHat Enterprise Linux 5.
Communication primitives (rmda put(), rdma get(), rdma send()), buffer registration
primitives (register buffer(), unregister buffer()) and synchronization primitives
(wait event()) covering a custom subset of the low-level RDMA APIs are made available to
the application programmer as a highly optimized C language library. On top of these, we built
a native APEnet+ BTL module for OpenMPI 1.X.
Work is underway [13] on the hardware and software features needed for GPU-initiated
communications, e.g. providing, using so called PCIe peer-to-peer transactions, a CUDA-
enabled [14] version of the rdma put() primitive, in order to avoid intermediate copies onto
CPU memory buffers. To further reduce overhead, another development oversees the delivery
of RDMA events by the APElink+ hardware in CPU memory in a way that is accessible from
within CUDA kernels.
Another research topic is exposing GPU memory areas as RDMA buffers, in such a way they
can be target of RDMA PUT and GET operations, even more reducing the latency of network
operations. To this end, discussions are ongoing with some GPU vendors.
The firmware software running on the FPGA embedded processor is currently in charge of
managing the RDMA virtual-to-physical address translation table, but we are exploring new
ways to exploit it for higher-level tasks.
5. The deployment initiative
We are currently exploring interconnection of GPU-equipped systems by means of APEnet+
(QUOnG project) to reach the PetaFLOPs range in aggregated computing power and working
on some GPU-related driver optimizations. For the 2011, our road-map foresees the integration
of a QUOnG rack, a mesh of computing nodes which are rack-mounted 1U systems – based on a
commodity Intel CPU Xeon 5650 – accelerated via high-end GPUs (Nvidia Tesla C1060/M2050)
interconnected with the APEnet+ hardware. This system, housed in a single rack of 42U,
will show a peak performance exceeding 60 TeraFLOPs and a power consumption of less than
26 KW. Leveraging on APEnet+ network, multiple QUOnG racks can be assembled to push up
the complete system to PetaFLOPs scale.
6. Conclusions and future developments
A first mini-cluster is being assembled together with GPUs and the APElink+ version with 3
links, for final validation of the firmware, the interconnection and the complete software stack
on a small size network (2-8 nodes). Synthetic tests, as well as real life simulations, will be
performed, so to be ready with the 6-links prototype release and eventually a bigger cluster
deployment.
The presence on the APEnet+ card of a programmable component of considerable power will
allow us to explore reconfigurable computing, e.g. accelerating some tasks directly in hardware.
The needs of a large scale deployment make it necessary for APEnet+ to employ fault-
tolerance features; we will be adding support for links’ self-diagnosis and the capability of routing
around faulty nodes [15, 16].
7. Acknowledgments
The authors would like to thank the Electronics Laboratory at INFN Sezione di Roma [17] for
technical support with the design, production and assembly of the test board used in this work.
This work was partially supported by the EU Framework Programme 7 project EURETILE
under grant number 247846.
References
[1] http://apegate.roma1.infn.it/APE
[2] Bodin F et al., The APENEXT project, Proceedings of Lattice2001 conference (2002), Nucl.Phys.Proc.Suppl.
106 173-176
[3] Ammendola R et al., Status of the apeNEXT project (2003), Proceeding of Lattice2002 conference,
Nucl.Phys.Proc.Suppl. 119 1038-1040.
[4] Belletti F et al., Computing for LQCD: apeNEXT (2006), Computing in Science Engineering, 8 18-29
[doi:10.1109/MCSE.2006.4].
[5] Ammendola R, Guagnelli M,Mazza G, Palombi F, Petronzio R, Rossetti D, Rossetti A, Salamon A, Vicini
P, APENet: LQCD clusters a la APE (2005), Proceedings of Lattice2004, Nucl.Phys.B - Proc.Suppl. 140
826-828[arXiv:hep-lat/0409071v1].
[6] Ammendola R, Petronzio R, Rossetti D, Salamon A, Tantalo N and Vicini P, Status of the APENet project,
Proceedings of Lattice2005 conference.
[7] Paolucci P S, Jerraya A A, Leupers R, Thiele L and Vicini P, Shapes: a tiled scalable software hardware
architecture platform for embedded systems (2006), Proceedings of the 4th international Conference on
Hardware/Software Codesign and System Synthesis CODES+ISSS ’06 (Seoul, Korea), ACM Press, 167-72.
[8] Paolucci P S, Vicini P et al., Introduction to the Tiled HW Architecture of SHAPES (2007), Proc. Design,
Automation and Test in Europe (DATE’07) 1, 77-82.
[9] Ammendola R et al., APENet+: a 3D toroidal network enabling petaFLOPS scale Lattice QCD simulations
on commodity clusters, to appear in proceedings of XXVIIIth International Symposium on Lattice Field
Theory, 14-19 June 2010 Villasimius, Sardinia, Italy
[10] Dally W J and Seitz C L, Deadlock-Free Message Routing in Multi- processor Interconnection Networks
(1987), IEEE Trans. Comput. 36, 547-53
[11] Ammendola R et al., High Speed Data Transfer with FPGAs and QSFP+ modules, Journal of
Instrumentation (JINST)
[12] http://www.samtec.com/ProductInformation/TechnicalSpecifications/Overview.aspx?series=SEAF
[13] Ammendola R et al., Mastering Multi-GPU Computing on a Torus Network, poster at NVidia GPU
Technology Conference 2010, San Jose (CA)
[14] NVIDIA Corporation, NVIDIA CUDA Programming Guide, 2010; http://developer.download.nvidia.com
/compute/cuda/3 2 prod/toolkit/docs/CUDA C Programming Guide.pdf
[15] Boppana Rajendra V, Fault-Tolerant Communication with Partitioned Dimension-Order Routers (1999),
IEEE Trans. Parallel Distrib. Syst. 10, 1026-39
[16] Chalasani Suresh and Boppana Rajendra V, Fault-Tolerant Wormhole Routing in Tori (1994), Proceedings
of 8th international conference on Supercomputing, 146-55
[17] http://maclabe.roma1.infn.it
