ABSTRACT
INTRODUCTION
Advances in semiconductor technology have allowed microprocessors to integrate more than a hundred million transistors on a single chip. The Alpha 2 1364 microprocessor uses 152 million transistors to integrate an aggressive Alpha 2 1264 processor core, a 1.75 megabyte second-level cache, cache coherence hardware, two memory controllers, and a multiprocessor router on a single die (Figure 1 a) . In the 0.18 pm bulk CMOS process, the 21 364 will run at 1.2 GHz and provide 12.8 gigabytedsecond of local memory bandwidth and 22.4 gigabytedsecond of router bandwidth (Figure 2 ). This paper describes the Alpha 21 364 network and router architectures. The Alpha 21 364's tightly-coupled multiprocessor network connects up to 128 such processors' in a two-dimensional torus network (Figure 1 b) . Such a fully-configured 128-processor shared-memory system can support up to four terabytes of Rambus memory and hundreds of terabytes of disk storage.
Such an aggressive multiprocessor configuration allows us to support the massive computation and communication requirements of a variety of application domains, such as high-performance technical computing, database servers, web servers, and telecommunication applications. We designed the Alpha 2 1364 network architecture to meet the communication demands of these memory-and IiO-intensive applications. The Alpha 21364's router architecture's novelty lies in its extremely low latency, enormous bandwidth, and support for directory-based cache coherence. The router offers extremely low latency because it operates at 1.2 GHz, which is the same clock speed as the processor core. The pin-to-pin latency within the router is 13 cycles or 10.8 nanoseconds. In comparison, the ASICbased SGI Spider router runs at IOOMHz and offers a 40 nanosecond pin-to-pin latency [2] . Similarly, the Alpha 21364 offers an enormous amount of peak and sustained bandwidth. The 21364 router can sustain between 70% and 90% of the peak bandwidth of 22.4 gigabytesisecond.
The 2 1364's router can offer such enormous bandwidth because of aggressive routing algorithms, carefully crafted distributed arbitration schemes, large amount of on-chip buffering, and a ' The Alpha 21364 can be easily redesigned to support a much larger configuration. fully-pipelined router implementation to allow an aggressive operational clock rate of 1.2 GHz. Finally, the network and router architectures have explicit support for directory-based cache coherence, such as separate virtual channels for different coherence protocol packet classes. This helps to avoid deadlocks and improves the performance of the 2 1364's coherence protocol. The rest of the paper is organized as follows. Section 2 briefly describes the 21364 network's packet classes. Section 3 and Section 4 describe the 21 364's network and router architectures, respectively.
NETWORK PACKET CLASSES
Network packets and flits are the basic units of data transfer in the 21364's network. A packet is a message transported across the network from one router to another and is composed of one or more flits. A flit is a portion of a packet transported in parallel on a single clock edge. The size of a flit is 39 bits, 32 of which are for payload, and 7 bits are for per-flit ECC. Thus, each of the incoming and outgoing interprocessor ports ( Figure Ib) Class (threeflits) . A processor generates Read 1 0 packets to load data from I/O space. Special Class (one or threeflits). These are packets used by the network and coherence protocol. The special class includes Noop packets, which can carry buffer deallocation information between routers. The packet header (one to three flits long) identifies the packet's class and function. The header also contains routing information (see Section 4.1) for the packet, (optionally) the physical address of cache block or data block in I/O space corresponding to this packet, and flow control information between neighboring routers. Besides the header, the Block Response and Write IO packets also contain 16 flits or 64 bytes of data. The 2 1364's coherence protocol and 110 devices use these packet classes to communicate between processors, memory, and I/O devices. A description of the 2 1364's directory-based coherence protocol can be found in Bannon, et al [6] .
NETWORK ARCHITECTURE
The 21364 network is a two-dimensional torus ( Figure Ib) . In addition, the network can also support limited configurations of imperfect tori, which can map out faulty routers in the network. This section discusses the 2 1364's virtual cut-through routing, adaptive routing algorithm, and deadlock avoidance techniques.
Virtual Cut-Through Routing
The 21364 uses virtual cut-through routing in which flits of a packet proceed through multiple routers until the header flit gets blocked at a router. Then, all flits of the packet are buffered at the blocking router until the congestion clears. Subsequently, the packet is scheduled for delivery through the router to the next router and the same pattem repeats. To support virtual cut-through routing, the 2 1364's router provides buffer space for 3 16 packets (see Section 4.3).
Adaptive Routing
The 2 1364's network uses adaptive routing to maximize the sustained bandwidth. However, the adaptive routing algorithm is very simple, which enables a simpler implementation of the arbitration scheme compared to more elaborate fully-adaptive routing algorithms. In the 2 1364 scheme, packets adaptively route within the minimum rectangle. Given two points in a torus (in this case, the current router and the destination processor), one can draw four rectangles that contain these two points as their diagonally opposite vertices. (Figure 3a ). The minimum rectangle is the one with the minimum diagonal distance between the current router and the destination.
The adaptive routing algorithm picks one output port among a maximum of two output ports that a packet can route in at any router. Thus, each packet at the current router's input port and destined for a network output port has only two choices: either it can continue in the same dimension (e.g., North Input to South Output) or it can tum (e.g., North Input to East Output). This is because with every hop a packet will reduce its Manhattan distance to its destination. This shrinks the size of the minimum rectangle the packet is routing in. If the adaptive algorithm has a choice between both the available network output ports (i.e., neither of the output ports is congested), then it gives preference to the route that continues in the same dimension. This allows a source and destination pair of processors to maximize the bandwidth between them by allowing multiple packets to route on separate routes (Figure 3b & Figure 3c ). 
Deadlock Avoidance Rules
Both coherence and adaptive routing protocols can introduce deadlocks in a network because of cyclic dependences created by these protocols. This section describes the 21 364's deadlock avoidance rules.
Avoiding Deadlocks in the Coherence Protocol
The coherence protocol can introduce deadlocks due to cyclic dependence between different packet classes. For example, request packets can fill up a network and prevent block response packets from ever reaching their destinations. 
Avoiding Deadlocks in Adaptive Routing
Adaptive routing can generate two types of deadlocks? namely, intra-dimension and inter-dimension. Figure 4a and Figure 4b show examples of these two kinds of deadlocks. 21364 breaks these two deadlocks using Jose Duato's theory [4] , which states that adaptive routing will not deadlock a network as long as packets can drain via a deadlock-free path. The 2 1364 creates logically distinct adaptive and deadlock-free networks using virtual channels. Each of the virtual channels corresponding to a packet class, except the special class (described in Section 2), is further subdivided into three sets of virtual channels, namely, adaptive, VCO, and VCI. Thus, the 21364 has a total of 19 virtual channels (three each for the six non-special classes and one for the special class). The adaptive virtual channels form the adaptive network and have the bulk of a router's buffers associated with them. The VCO and VC 1 combination creates the deadlock-free network, which provides a guaranteed deadlock-free path from any source to any destination within the network. Thus, packets blocked in the adaptive channel can drain via VCO and VCI. The VCO and VCI virtual channels must be mapped carefully on to the physical links to create the deadlock-free network. The 2 1364 has separate rules to break deadlocks within a dimension and across dimensions. Within a dimension, the 21364 maps the VCOs and VCls in such a way that there is at least one processor that dependence chains formed by VCOs do not cross. The same applies to VCI mappings. This ensures that there is no cyclic dependence in a virtual channel within a dimension. The 21364 can choose among a variety of such virtual channel mappings because the virtual channel assignments are programmable at boot time. Perhaps the simplest scheme that satisfies the property stated above was proposed by Dally [ 5 ] in which all processors in a dimension are numbered incrementally. Then, for all source and destination processors, we can make the following virtual channel assignments: if source is less than the destination. that source-destination pair is assigned VCO. If source is greater than destination, then that pair is assigned VCI .
Unfortunately, in this scheme, the virtual channel to physical link assignments are not well-balanced ( Figure 5 ), which can cause under-utilization of network link bandwidth under heavy load. In the 21364, we search for an optimal virtual channel to physical link assignment using a hill climbing algorithm. This scheme does not incur any overhead because we run the algorithm off-line and only once for a dimension with a specific size (ranging from two to 16 processors). This works for the 21364 because virtual cut-through routing buffers entire packets at a router, even though packets that are not blocked can span multiple routers at the same time. The 21364's ability to buffer an entire packet at a router removes dependence between consecutive routers, which allows packets to move from the deadlock-free VCO and VCI channels to the adaptive channel. Additionally, a 21 364's choice of direction and virtual channel are independent of a packet's prior route in the network, which helps remove cyclic dependences among routers.
ROUTER ARCHITECTURE
The 2 1364's router has nine pipeline types based on the input and output ports. An input or an output port can be of three types: local port (cache and memory controller), interprocessor port (offchip network), and I/O. Any type of input port can route packets to any type of output port, which leads to nine types of pipelines. Figure 6 shows two such pipeline types. In addition to the pipeline latency, there are a total of six cycles of synchronization delay, pad receiver and driver delay, and transport delay from the pins to the router and from the router back to the pins. Thus, the on-chip pin-to-pin latency from a network input to a network output is 13 cycles. At 1.2 GHz, this leads to a pin-topin latency of 10.8 nanoseconds.
The network links that connect the different 2 1364 chips run at 0.8 GHz, which is 33% slower than the intemal router clock. The 2 1364 chip runs synchronously with the outgoing links, but asynchronously with the incoming links. The 21364 sends its clock with the packet along the outgoing links. Such clock forwarding provides rapid transport of bits between connected 2 1364 chips and minimizes synchronization time between them. table is programmable by software at boot-time, which allows software the flexibility to optimize the desired routing for maximal performance and to map out faulty nodes in the network. The first flit of a packet entering from a local or I/O port accesses the configuration table and sets up most of the 16-bit routing information in a packet's header. These 16 bits are:
Router
two bits for the east-west and north-south directions (positive or negative), eight bits for the destination coordinates along the two dimensions, one bit (used by incomplete torus networks) to indicate if the packet can route in the adaptive channel, one bit to indicate if the packet is an 1 0 packet, two bits to encode the virtual channel number (Adaptive, VCO, and VCI), and two reserved bits. Each configuration table entry contains 24 bits that include the header's routing information (except the two bits that encode the virtual channel number). six access control bits, three bits to encode routing information for incomplete tori networks (with mapped out nodes), and one bit of parity. The decode stage identifies the packet class, determines the virtual channel (by accessing the virtual channel table), computes the output port, and figures out the deadlock-free direction. The decode phase also prepares the packet for subsequent operations in the pipeline.
Error Correction Code Manipulation
Each 32-bit flit of a 21364 network packet is protected by 7-bit ECC. The router checks ECC for every flit of a packet arriving through an interprocessor or an I/O port. The router regenerates ECC for every flit of a packet leaving through an interprocessor or an 1/0 output port (Figure 6 ). ECC regeneration is necessary particularly for the first flit of the packet because the router pipeline can modify the header before forwarding the packet. If the router pipeline detects a single-bit error, it corrects the error and reports it back to the operating system via an interrupt. However, it does not correct double bit errors. Instead, if it detects a double-bit error, the 21364 alerts every reachable 21364 of the occurrence of such an error and enters into an error recovery mode.
Input Buffering
The 21364 router provides buffering only at each of the input ports. Table 1 shows the distribution of input buffers at each input port. Each input buffer can hold a complete packet of the specific packet class, except for local ports for which packet payloads reside in the 21364's intemal buffers. The interprocessor ports are subdivided into adaptive and deadlock-free channels, whereas the local ports have a single monolithic buffer space. The router has a total of 316 packet buffers. Each input port has an entry table that holds the in-flight status for each packet and input buffers that hold the packets. The first flit of a packet writes the corresponding entry table entry in the DW stage ( Figure 6 ). An entry table entry contains a valid bit, bits for the target output ports (Figure 2 ), bits to indicate if the packet can adapt andor route in the adaptive channels, bits supporting the anti-starvation algorithm', and other miscellaneous information. This information is used by the readiness tests in the LA phase of arbitration and read in the RE phase to decide the routing path of each packet (Section 4.4). Flits are written to and read from the packet buffers in the WrQ (Write Input Queue) and RQ (Read Input Queue) stages, respectively, after scheduling pipeline has made the routing decision for the first flit. Either the previous 21364 router in a packet's path or the cache, memory controller, or I/O chip where the packet originated controls the allocation of input buffers at a router. Thus, each resource delivering a packet to the router knows the number of occupied packet buffers in the next hop. When a router deallocates a packet buffer, it sends the deallocation information to the previous router or IiO chip via Noop packets (Section 2) or by piggybacking the deallocation information on packets routed for the previous router or IiO port.
Arbitration
The most challenging component of the 21364 router is the arbitration mechanism that schedules the dispatch of packets arriving at its input ports. To avoid making the arbitration mechanism a central bottleneck, the 2 1364 breaks the arbitration logic into local and global arbitration (Figure 7 ). There are 16 local arbiters, two for each input port. There are seven global arbiters, one for each output port. In each cycle, a local arbiter may speculatively schedule a packet for dispatch to an output port. Two cycles following the local arbitration (Figure 6 ), each global arbiter selects one out of up to seven packets speculatively scheduled for dispatch through the output port. Once such a selection is made, all flits in the X (Crossbar) stage (Figure 6) follow the input port to the output port connection, as shown in Figure 7 . The local arbiters perform a variety of readiness tests to determine if a packet can be speculatively scheduled for dispatch via the router. These tests ensure that: the nominated packet is valid at the input buffer and has not been dispatched yet, the necessary dispatch path from the input buffer to the output port is free, the packet is dispatched in only one of the routes allowed, the target router, 110 chip, or local resource (in the next hop) has a free input buffer in the specific virtual channel, ' Because of the distributed and speculative nature of 21364's arbitration mechanism, packets residing at the input buffers can be starved. The 21 364 provides a sophisticated anti-starvation mechanism, which detects starved packets and drains them via the output ports.
, - 
~~
A global arbiter selects packets speculatively scheduled for dispatch through its output port. The local arbiters speculatively schedule packets not selected by any global arbiter again in subsequent cycles.
To insure faimess, the local and global arbiters use a leust-r.ecent/,v selected (LRS) scheme to select a packet. Each local arbiter uses the LRS scheme to select both a class (among the several packet classes) and a virtual channel (among VCO, VCI, and Adaptive) within the class. Similarly, the global arbiter uses the LRS policy to select an input port (among the several input ports that each output port sees). Additionally, the 21 364 provides two special modes-the Rotcity Rule mode and CDP Rule mode-in which packets are prioritized according to the input port they arrive from and the packet class they belong to. The Rotary Rule gives priority to packets arriving from an interprocessor port to allow older packets residing in the network to move sooner than younger packets generated from the local or I/O ports. The CDP (Coherence Dependence Priority) Rule prioritizes the packets according to their class ordering (Section 3.4). Thus, the CDP Rule prioritizes Block Response packets over Request packets.
