, interconnect technology will need to be improved considerably over what it is today. In this report, we explore one possible interconnect design for such a network. The guiding principle in this design is the optimization of all components for the finiteness of the speed of light.
To achieve a linear speedup in time over well-tested supercomputers of todays' designs will require scaling up of processor power and bandwidth and scaling down of latency. Latency scaling is the most challenging: it requires a 100 ns user-to-user latency for messages traveling the full diameter of the machine. To meet this constraint requires simultaneously minimizing wire length through 3D packaging, new low-latency electrical signaling mechanisms, extremely fast routers, and new network interfaces. In this report, we outline approaches and implementations that will meet the requirements when implemented as a system. No technology breakthroughs are required. 
Tables

Introduction
This paper proposes an architecture for a supercomputer interconnection network that will run nearly as fast as the speed of light will permit. As background, Moore's law has caused transistor speed to increase exponentially over the last several decades yet the speed of light does not change. While speed of light delay contributes only around 1% in the main interconnect of today's supercomputers, exponential increase in this proportion will make it rise to a critical issue in the next decade or two. Theoreticians [Perperata, Vitanyi] have explored supercomputer design in this realm (specifically networks), developing a theoretical foundation and concluding most sharply that a 3D mesh interconnect is the only way to go. This is a broader solution space than usual. Most work in this area looks at incremental changes to existing products. For example, serial communications links were designed into supercomputer networks years ago. Most work in supercomputer interconnect now seeks to upgrade serial links to the latest technology, such as Infiniband. Unfortunately, this approach rules out the possibility that serial links may not be the right choice if one were to start from scratch. In this paper, we will consider changing any component that would bound performance levels away from the limits dictated by the speed of light. This paper focuses on the network portion of a Petaflops supercomputer that might go into service in 2010. Sandia and the DOE community tend to build supercomputers from a large number of commodity components and at most a few custom ones. While small in number, the custom components have the longest lead times and involve the most Government intervention. This paper is studying interconnections network seven years in advance of a proposed deployment, with the expectation that the remaining components could be filled in later on from commodity parts with shorter lead times.
This report specifically excludes considering the computational engines in a supercomputer (the issue is too controversial for now), but the network proposed could be used in the two principal designs under consideration:
The left side of figure 1 shows a System On Chip (SOC) Processor In Memory (PIM) design. In this approach, a supercomputer would be constructed of one custom chip containing a processor and network interface. The supercomputer "node" would comprise this chip and some additional memory. To meet conventional balance requirements, the bandwidth to the external memory should be about the same as the bandwidth to each network interface. (The diagram is shown with six network interfaces for illustration.) Since I/O bandwidth is a limiting factor for chips in this technology node, the surface of the chip is shown as divided among the interfaces according to bandwidth.
2.
The right side of figure 1 shows the more conventional discrete implementation of a Massively Parallel Processor (MPP) node. In this approach, a conventional microprocessor chip interfaces to a network chip and DRAM. To meet conventional balance requirements, the network interface chip's interface to the microprocessor (and DRAM) should be about the same as the bandwidth to each network interface. This creates the same bandwidth and pin allocation as the SOC PIM approach.
In our view, both the SOC PIM and discrete approaches are candidates for the 2010 time frame (the SOC PIM probably yielding a more efficient design but with higher development costs). We will "hedge our bets" by proceeding in this report in a manner largely compatible with both approaches. However, we will use the SOC PIM for illustration.
Other issues addressed in this report include:
• A compact but serviceable 3D physical design compatible with both water and air cooling • Dynamic Phase Alignment as a low latency alternative to clock synchronization
• A DC balancing method with zero latency
• Deadlock free routing
• Router optimized to the most common route -straight ahead routing
• A router design with three-lanes but with the performance of a full crossbar
• Virtual channels and a generalized "protocol engine" for message processing in a couple clock cycles
• Support for both shared memory and message passing communications semantics
• Error detection and recovery capable of mitigating a failing router chip
The assumed parameters for the Petaflops supercomputer are given in the 
Primary Chip Design
Each node will consist of a main Application Specific Integrated Circuit (ASIC) and extra memory. For a SOC PIM implementation, the approximate floor plan of the ASIC is shown in figure 2 and comprises a network interface around the periphery with a central section comprised of some number of microprocessor cores (μP) with associated memory.
The choice of 16 microprocessor cores was for illustration purposes only. A discrete implementation would not have microprocessors or RAM at all. Figure 3 shows the proposed placement of the interconnections to and from the ASIC. As justified below, the network topology will be a 3D mesh and therefore have six bidirectional links. Since the bandwidth of each interconnection link will be approximately equal to the memory bandwidth according to balance factors, the chip area to be divided into seven more-or-less equal sections.
According to ITRS [ITRS] projections, the maximum pin count in 2010 will be 4009. If 2600 are available after a 33% reservation for power and ground, the number of pins in each group will be about 381. This design requires 360 pins per group.
Three Routing Channels 
Strategies for Low Latency
There are just a few degrees of freedom available to minimize latency: faster signals, shorter distances, less "hop" delay, fewer hops, and low overhead in the interface to user programs. We will address each of these in turn, starting with signal speed and propagation distance.
Physical Data Transmission
We will use electrical signal transmission. Free space optics is the technology closest to maturity that can get to within a few percent of the speed of light (c). However, our judgment is that free space optics will be quite expensive in the 2010 timeframe, if it is even available. The next best options are waveguides (wires) and optical fibers, which have a propagation speed around .7c (.7 of the speed of light, c). Since .7c meets design goals and the technology is readily available, we propose to use it.
Node Design and 3D Packaging
The obvious way to reduce wire length is to use a physically compact design where signals travel a path approximating a straight line between source and destination. Three dimensional mesh networks meet the requirements and are proposed. Three dimensional mesh networks have been proposed in the research community for some time, including the m-machine [Filo] , Blue Gene [Denneau] , etc. While these designs worked but in their time, it would not have been appropriate to develop them further because logic was still much slower than signal propagation. Since technology has progressed in the interim and the object of this paper is to explore low latency designs, we will investigate three-dimensional packaging. Figure 4 shows the layout of a two-node circuit board. Each of the primary ASICs is paired with a group of DRAM DIMMs mounted underneath the circuit board. Each circuit board comprises two ASICs/memory pairs and an associated power conversion module. Circuit boards can be strung end-to-end to create linear structures (this method is similar to the Shish Kabob packaging of Blue Gene). Figure 5 illustrates the proposed 3D packaging method. The linear board structures in figure 4 are connected along their sides into a three dimensional mesh. The circuit boards will require connectors capable of connecting along their edges, similar to Intercon shuttle connectors (used in Cray T3E and X1). The interconnection network will be based on a 3D mesh mapped directly to the 3D structure above. Figure 6 shows how the mesh interconnection wires flow between the primary ASICs to form the interconnection network. The key concept is to provide a structure where all intercommunications wires are of fixed length, thereby assuring fixed speed of light delays irrespective of the size of the machine. 
Physical Size and Overall Latency
Figure 7a illustrates our view of a potential machine geometry that offers the best performance with workable technology. The structure of figure 5 would be implemented as 27 x 16 x 24 mesh (same as Red Storm, but 100 GFLOPS/node) in a compact package of 5' x 7' x 12'. The linear structures of figure 4 would be 5' long and oriented in this dimension. The volume budget for each node would be 70 cubic inches, corresponding to a board size in figure 4 of 3.5" x 10" with about 3.5" spacing between boards. To remove heat from such a structure would require water-cooling in the channels of the structure in figure 5 .
A target of 100 ns cross-machine latency should be feasible for the structure in figure 7a , with the latency budget illustrated in the figure.
An air-cooled configuration is possible as well, as shown in figure 7b . The physical design would be quite different because air has much less heat capacity than water and because air-cooling presumes that people will be working in the coolant air. The candidate design shown has a room pressurized by cooling air with the structure of figure 5 protruding through the room's wall. People working on the "cold" side would have access to the machine for servicing. With a full-machine power dissipation of 1.5 MW, the configuration in figure 7b and 7c moving air at 20 fps would heat air from 70° F to 97° F (15° C). 
Serviceability
To enable the computer to be serviced, the system will consist of field replaceable units that can be withdrawn from the system from one side, as illustrated in figure 8 .
Electrical Issues
A review of existing supercomputer interconnects indicates that a major source of latency is repeated shifting of data between clock domains. In previous and existing designs, data is sent serially with the clock embedded in the data. To achieve higher bandwidth, many serial lines are run in parallel, but each serial line still carries its own clock. As far as we can tell, this is an artifact of the architectural history of supercomputers and not necessarily a good choice at this point in the technology. We propose to use a timing approach called "dynamic phase alignment" and illustrated in figure 9 . The entire Petaflops supercomputer will have a single global clock. This clock will be distributed to all boards using an especially engineered low-jitter distribution network. We will assume this clock to be distributed at 10 GHz. Thus, every chip will be guaranteed to have a clock of exactly the same frequency as every other chip, but of random phase relationship and with various sources of superimposed jitter and drift.
All communications links associated with the network will go through a programmable delay line before being clocked into a standard flip-flop. To first approximation, the delay lines will be set statically to the phase difference between the clock entering the chip and the phase of the external data signal. This could be accomplished with a delay line tapped at a couple dozen locations and selected with a multiplexer.
The proposed adjustment range of the delay line is given by the equation below. The factors are: (a) The adjustment range will have to be at least one clock period to accommodate the random phase relationship between the data signal and the global clock as it appears on the chip. (b) An additional range will be T range = T clk + T Δwire + T jitter + T margin where T range is the adjustment range T clk is the system clock period (100 ps) T Δwire is the transit time difference in data wires T jitter is other sources of jitter T margin is a timing margin This form of clock latency adjustment will add at most one clock period to whatever the delay is in the wire.
Completely static settings of the delay lines are unlikely to work at the highest frequencies due to jitter. Specifically, the relative phase between the clock and data at any chip will vary due to factors like:
1.
Jitter caused by power supply noise 2. Drift caused by cables expanding and contracting due to temperature or changes in dielectric constant in transmission lines causing changes in propagation velocity
Intersymbol jitter on data lines (clock distribution does not have this jitter because there are no symbols)
Jitter and drift in sources 1 & 2 could be accommodated by logic that adjusts the delay lines by monitoring the placement of transitions at a relatively slow rate (<1KHz). Jitter source 3 will need to be minimized, but may define the limits of this method. Figure 11 shows the conventional method of assuring constant DC levels on high speed lines.
Bit Encoding
The conventional method has unnecessary latency. Specifically, the method is designed with a 10-bit DC-balanced code word transmitted serially over each 10 bit shift register 8 bit parallel register Look at first bit after 11 clock cycles
10-8 Converter
Figure 11: Conventional Bit Encoding for DC Balance data line. The code words are shifted into a register over a period of 10 clocks, after which the code word in translated in parallel to a non-DC-balanced byte.
The entire process has a latency of 11 clock cycles between the time the first bit arrives on the wire and when the logic can make its first decision based on its contents.
We propose a method with significantly lower latency. The idea behind bit encoding for DC balance is to pick code words with the same number of 1s and 0s. There is no particular reason why the first, second, third, or any particular bit, needs to be constrained to achieve this balance. We therefore propose to use a code word where the first bit of each serial stream is a plain data bit. This shifts the DC balance constraint to the second and later bits.
We also propose a method where the router can alter data in the message while still retaining DC balance. To avoid routing tables, each message will contain the routing path (source based routing). However, each router will need to know where it is in the route. While there are several ways to do this, we propose that each message have a 7-bit field containing the "state" of the message along the route. As the message flows through each router, this field gets altered. However, this will change the DC balance of each bit line switched. To compensate, we propose that the state field be immediately followed by its complement. The router is therefore free to modify the state field as it chooses as long as it takes responsibility for injecting the complement in the subsequent bits.
The proposed packet format is shown below. The interconnect consists of 80 bit parallel busses, thus defining the packet width. The 80 bits comprising the first bit on each signal contain as the information necessary to route the packet, as well as some data. The route is defined by up to six segments each of up to 16 "hops." Dir n and Dist n give the direction and distance of the n'th hop. To enable to router to know how far through the route a message has gotten, a state field is included comprised of an indication of which of the routes (Route<0:2>) is currently being traversed and how many "hops" down the route the message had already traveled as it entered the router (HowFar<3:6>). The state field is followed by its complement.
The intent of this format is to enable logic to make a fast routing decision after looking at only the first bit of a packet.
Routing
We propose to use a routing schema based on the "turn model" [Glass] and which is very similar to the ASCI Red system at Sandia [Mattson] . We propose non-adaptive routes in three dimensions based on a three-dimensional turn model, but compatible with XYZ dimension-ordered routing.
Experience indicates that dimension-ordered routing works very well in terms of minimal path and load balance. Therefore we propose to use XYZ dimensionordered routing where this route is available.
Faults will make dimension-ordered routing infeasible for messages "near" the fault. To permit the machine to keep operating with only localized performance degradation, some other static route compatible with a three-dimensional turn model strategy will be used instead. For this approach to work, XYZ dimensionordered routing must be a legitimate subset of the turn model strategy. 
Router Chip Architecture
We propose to lay out the router to provide best performance for the most common case, filling in the rest of the design in a second stage. Most of the time messages will go through the router in a straight line, changing direction only a few times in an entire route. The router design in figure 13 has a fast and direct pathway for messages continuing straight ahead.
The router chip will have physically compact cut-through logic to handle data flows in each of the 6 directions (only 4 directions shown above). For example, the cut-through block highlighted in the diagram above handles messages
Turn Logic
This logic block optimized for messages flowing right with no turns 
Router Chip
arriving from the left and continuing to the right. The purpose of each such cutthrough block will be to make a one-cycle decision as to whether an incoming message can be "cut through" to the output link right away or routed to more turn and queuing logic elsewhere on the chip. This should be feasible in a short time because the only information needed will be the first address field and a flag indicating that the output buffer is currently available. Figure 14 illustrates the cut through logic in detail. In one setting, the incoming network connection is "cut through" to the network output connection. In the other setting, the turn and queuing logic is supplying data to the network and the incoming network data has to be routed elsewhere for storage. This calls for a 2 × 2 switching element.
We propose an internal router design that improves on the common designs widely studied in the literature. It is widely assumed that a wormhole router must include an internal crossbar switch data between input and output ports -and that the logical complexity and time cost of this crossbar is substantial [Chien] . However, figure 15 shows a better way. This diagram shows the data paths necessary to implement dimension-ordered routing as an abstract 2D layout. One can see all the connections needed, and these are far fewer than a crossbar. The diagram shows the six external connections and connections to the local processor at the center. The circular arrows at the outer boundary are the cut through logic as described in figure 13 . The interconnections within the hexagonal shapes implement the "turns" used in XYZ dimension-ordered routing: XYZ dimension-ordered routing only implements the turns X Y, X Z, Y Z, and any dimension to and from the local processor. Figure 16 illustrates the routing logic corresponding to one channel of the circular placement illustrated above. According to figure 15, three data pathways circling the chip will be sufficient if they can carry data in either direction and can be "broken" at various points. The three routing lanes in the diagram below have these properties based on the settings of the switches and other configuration parameters. • Incoming data goes to an input queue and subsequently to the output queue, any of the routing lanes, or to the local processor.
• Data from the local processor, any routing lane, or an input queue can go to the output queue.
• Data is stored in the recovery RAM just before it is put on an output wire.
To demonstrate that wiring overhead is manageable, figure 17 illustrates the wiring channels and chip size to scale. According to the ITRS 2002 update, global wiring will pack on a 205 nanometer (nm) pitch in the 45 nm technology node. If the 80 bit data pathways running externally at 4x clock are passed across the chip at a 1x clock rate, there will be 320 conductors occupying a channel 205 × 320 = 65,600 nm in width. No more than three channels will be needed across the circumference of the chip, or just short of 200,000 nm or 200 μm or .2 mm. The scale diagram below shows a 1 cm x 1 cm chip with three channels of .2 mm around the edge. According to the 2002 ITRS, repeaters should be installed every 54 μm for optimal propagation speed. This would correspond to about 200 repeaters across each chip edge, for a delay of about 400τ (τ is the characteristic RC time delay for a minimum size gate) or .6 ns. As can be seen in the figure, there is plenty of space left for the processor.
XYZ routing is not sufficient in the presence of faults. We therefore propose to create linkages between the pathways shown above to permit arbitrary routing patterns, but with a speed penalty due to the sharing of internal busses.
200 repeaters at 54 μm intervals -.6 ns delay Scale diagram shows a 1 cm x 1 cm die with three 320 conductor routing channels around the edge 
Virtual Channels
Virtual channels are an esoteric but well-studied concept in networking. We will first explain the concept and then propose some general principles for their use in this machine. Figure 18 illustrates the virtual channel concept. The "wormhole" data communications in a supercomputer include a flow control that makes the channel like a garden hose. To be specific, data available to flow through a channel can become blocked when the receiver is unable to handle the datasimilarly to a garden hose with the nozzle shut off. In this case, the data waits until the receiver turns on the flow.
However, a physical channel can contain two or more virtual channels. In this case, each of the virtual channels has independent flow control. This means one channel can be blocked while the other is not, and so forth. Figure 19 is a simplified illustration of how virtual channels can support this design, including the concept of deadlock and protocol handling. Figure 19 illustrates the flow of messages between three nodes, shown as horizontal regions designated as 0, 1, and 2. Messages flow between nodes via the diagonal communications channels and buffers in the direction of the arrows. Deadlock could occur if there were a circular dependency in the data flows between buffers, such as the one illustrated by the wavy line (including the symbol on the illegal pathway that would be required to complete the cycle). We have drawn the diagram where all diagonal buffers point rightward. The rightward motion assures that figure 19 is an acyclic graph and as a result, a machine built this way would never incur deadlock.
The proposed network for this machine will not have deadlock-enabling cycles due to the nature of the turns permitted in table 2. Thus one could in principle draw a diagram like figure 19 for this machine. However, the diagram would have over a quarter million buffers and would not be practical to draw.
To assure proper operation at speed, we propose special support in the network for the protocols that underlie the operation of supercomputers, such as shared memory and message passing. Both these protocols have a step where a message is received (either a shared memory address or a message data packet) and is followed immediately with a response (memory data return or data acknowledgement). Without virtual channels, the immediate response creates a leftward flowing buffer dependency, a cycle, and the possibility of deadlock.
The virtual channel in figure 19 permits efficient protocol handling while avoiding deadlock. The strategy is to put the initial messages in each pair on one virtual channel and the response messages on the other virtual channel. As one can see from figure 19 , the entire flow of the request through the first virtual network, the generation of the response, and the flow of the response through the second virtual network creates leftward flowing buffer dependencies through the entire diagram.
We are therefore proposing two sets of virtual channels for requests and responses in a request-response protocol. This design is well known for distributed shared memory systems and there is extensive information on its performance. However, this approach is not widely used in message-based systems.
Protocol Engine and NIC
To meet speed requirements while maintaining sufficient functionality, the network interface will need substantial improvement over those in common use today. Today's distributed shared memory systems (such as X1) are engineered for low latency and may serve as a model in this respect. However, today's software stacks for MPI, Portals, and other message passing protocols have functionality enhancements that permit them to scale to enormous sizes. We therefore propose to merge these two approaches into a protocol engine that can achieve both low latency and broad functionality.
The architecture of the protocol engine in figure 20 is followed by table 3 showing how the interface can be applied to various communications protocols.
[DeBenedictis]
The protocol engine is a hardware device operating at the clock rate of the main CPU chip. Unlike a CPU, the protocol engine operates on incoming messages, updating the state of protocols based on a protocol state transition table, and transmitting output messages. The protocol engine can also read and write the main memory for successive bytes of long messages, etc. and interface to user programs in other ways.
The protocol engine can be applied to many common communications paradigms by applying the protocol state and transition tables properly. This is illustrated in the We propose a two-hop error detection and recovery method as illustrated in figure 21 . In this method, each router chip (labeled "self") retains a copy of all transmitted messages (messages are green; copies are red) until the message has been successfully received two hops away -its neighbor's neighbor. Figure 21 illustrates the Eastbound output of a router. Outgoing traffic from this output goes to the neighbor on the right, after which it is either consumed by the neighbor or relayed to one of five neighbor's neighbors (North, Up, East, South, or Down -but not back to the West via a 180° turn).
Copies of all Eastbound outgoing traffic is first stored in the East Recovery Memory until it has been successfully received by one of the neighbor's neighbors. The message deletion mechanism involves two acknowledgement messages: a regular "ack" message indicating that the message has been successfully received by the neighbor and a "ack2" message routed one additional hop upstream indicating that the message has been successfully received by the neighbor's neighbor.
However, data consumed by a neighbor is retained only until the neighbor acknowledges receipt of the data (not illustrated). While the recovery memory approach substantially increases storage requirements in the router, it saves memory elsewhere. Without the recovery memory, fault recovery would have to come about through a different mechanism. Typically, this involves the sending processor keeping a copy of the data in its memory until it has been assured the data has arrived at the final destination. These memories are often enormous to accommodate odd data patterns and network congestion. Memory is no longer needed for end-to-end data acknowledgement and retransmission.
We anticipate that the recovery memory would be used to mitigate transient data loss, but this is beyond the scope of this document.
Deadlock Avoidance in Event of Faults
To achieve the lowest latency, we propose a method where data packets are routed immediately upon receipt and before the error detection codes have been validated. If an error occurs in an address or control field, the packet may be sent to the wrong destination. To prevent deadlock, we propose each router have a changeable list of allowable "turns" derived from the "turn model" of routing. A message requesting an unallowed turn will be discarded. This prevents deadlock.
The "turn model" is a mathematical treatment of message routing. In this treatment, the routing designer creates a set of allowable turns (such as east-tonorth, north-to-west, etc.) that obey a set of mathematical properties. The turn model then guarantees no deadlocks as long as all messages follow allowable turns.
We propose to use this method for fault mitigation. While the system will generate routes for messages that obey the allowable turns (and are also follow the shortest path and avoid hotspots), the routing information could become corrupted and the message could attempt to make an illegal turn. To prevent this, the hardware will perform a second, on-the-fly, routing check and discard any route before it can make a wrong turn.
Error recovery method:
1. System detects malfunction 2.
System goes into a diagnostic mode where user programs are halted and the real time clock stops incrementing (so user programs really and truly will not know they were halted). 3.
System reconfigures, aborting jobs that can't continue. 4.
In-transit messages move from the "recovery" RAM to output queues. May need to identify duplicated messages.
Reliability Standard
For hard failure of one chip, the system will continue to run, although an application with state on the failing chip will abort. Failed chips will be deconfigured from the system for hot swap replacement in FRUs (that may contain multiple nodes). A system with failed nodes is permitted to have degraded communications performance.
For hard failure of up to 10 chips: same as above for 99% of failure patterns.
For hard failure of one link: System and all applications continue running. Communications performance may degrade.
For hard failure of up to 10 links: same as above for 99% of failure patterns.
Soft errors: System can be engineered with tolerance for all single bit soft errors on memory and flip flops (but not logic). Probability of unmitigated multiple bit soft errors to be sufficiently low that there will be less than 1 undetected error in 5 year lifespan of machine. Soft error rate to be computed based on Los Alamos altitude.
Conclusions
We have outlined a strategy to achieve 100 ns user-to-user latency at Petaflops scale, with sufficient bandwidth to balance CPU rates. The strategy uses a threedimensional packaging structure that maps the signal flow within the network to the three-dimensional structure of the machine room. As a consequence, latency in this network is within a constant factor of optimal as determined by the speed of light. We also proposed a packaging and cooling structure for the network and associated computational elements that would be sufficient at very large scales.
