The Quadrics interconnection network (QsNet) 
QsNet
QsNet consists of two hardware building blocks: a programmable network interface called Elan [9] and a high-bandwidth, low-latency, communication switch called Elite [lo] . With respect to software, QsNet provides several layers of communication libraries that trade off between performance and ease of use. These hardware and software components combine to enable QsNet to provide the following: (1) efficient and protected access to a global virtual memory via remote DMA operations and (2) enhanced network fault tolerance via link-level and end-to-end protocols that can detect faults and automatically re-transmit packets.
Elan Network Interface
The Elan network interface connects the highperformance, multi-stage Quadrics network to a processing node containing one or more CPUs. In addition to generating and accepting packetslo and from the network, the Elan provides substantial local processing power to implement high-level, message-passing protocols such as MPI. The internal functional structure of the Elan, shown in Figure 1 , centers around two primary processing engines: the pcode processor and thread processor.
Figure 1. Elan Functional Units
The 32-bit pcode processor supports four threads of execution, where each thread can independently issue pipelined memory requests to the memory system. Up to eight requests can be outstanding at any given time. The scheduling for the pcode processor enables a thread to wake up, schedule a new memory access on the result of a previous memory access, and go back to sleep in as few as two system-clock cycles.
The four pcode threads are described below: (1) inputter thread: Handles input transactions from the network.
(2) DMA thread: Generates DMA packets to be written to the network, prioritizes outstanding DMAs, and time-slices large DMAs so that small DMAs are not adversely blocked. (3) processor-scheduling thread: Prioritizes and controls the scheduling and descheduling of the thread processor. (4) command-processor thread: Handles operations requested by the host (i.e., "command") processor at user level.
The thread processor is a 32-bit RISC processor that aids in the implementation of higher-level messaging libraries without explicit intervention from the main CPU. In order to better support the implementation of high-level messagepassing libraries without explicit intervention by the main CPU, its instruction set was augmented with extra instructions to construct network packets, manipulate events, efficiently schedule threads, and block save and restore a thread's state when scheduling.
The MMU translates 32-bit virtual addresses into either 28-bit local SDFL4M physical addresses or 48-bit PCI physical addresses. To translate these addresses, the MMU contains a 16-entry, fully-associative, translation lookaside buffer (TLB) and a small data-path and state machine used to perform table walks to fill the TLB and save trap information when the MMU faults. 
Elite Switch
The Elite provides (1) eight bidirectional links supporting two virtual channels in each direction, (2) an internal 16 x 8 full crossbar switch,l (3) a nominal transmission bandwidth of 400 MB/s in each link direction and a flow-through latency of 35 ns, (4) packet error detection and recovery with routing and data transactions CRC-protected, (5) two priority levels combined with an aging mechanism to ensure fair delivery of packets in the same priority level, (6) hardware support for broadcasts, and (7) adaptive routing. The switches are interconnected in a quaternary fat-tree topology, which belongs to the more general class of k-ary n-trees [7,61. Elite networks are source-routed, and the transmission of each packet is pipelined into the network using wormhole flow control. At the link level, each packet is partitioned in smaller units called flits (flow control digits) [3] of 16 bits. Every packet is closed by an End-Of-Packet (EOP) token, but this is normally only sent after receipt of a packet acknowledge token. This implies that every packet transmission creates a virtual circuit between source and destination.
Packets can be sent to multiple destinations using the broadcast capability of the network. For a broadcast packet to be successfully delivered a positive acknowledgment must be received from all the recipients of the broadcast group. All Elans connected to the network are capable of receiving the broadcast packet but, if desired, the broadcast set can be limited to a subset of physically contiguous Elans.
Global Virtual Memory
The Elan can transfer information directly between the address spaces of groups of cooperating processes while maintaining hardware protection between the process groups. This capability is a sophisticated extension to the conventional virtual memory mechanism and is known as virtual operation. Virtual operation is based on two concepts: (1) the Elan virtual memory and (2) the Elan context. ' The crossbar has two input ports for each input link, to accommodate two virtual channels.
Elan Virtual Memory
The Elan contains an MMU to translate the virtual memory addresses issued by the various on-chip functional units (Thread Processor, DMA Engine, and so on) into physical addresses. These physical memory addresses may refer to either Elan local memory (SDRAM) or the node's main memory. To support main memory accesses, the configuration tables for the Elan MMU are synchronized with the main processor's MMU tables so that the virtual address space can be accessed by the Elan. The synchronization of the MMU tables is the responsibility of the system code and is invisible to the user programmer.
The MMU in the Elan can translate between virtual addresses written in the format of the main processor (e.g., a @-bit word, big Endian architecture as the Alphaserver) and virtual addresses written in the Elan format (a 32-bit word, little Endian architecture). For a processor with a 32-bit architecture (e.g., an Intel Pentium), a one-to-one mapping is all that is required.
In Figure 2 , the mapping for a @-bit processor is shown.
The @-bit addresses starting at Ox1FFOC80800Oare mapped to Elan's 32 bit addresses starting at OxC808000. This means that virtual addresses in the range OxlFFOC808000 to OxlFFFFFFFFFF can be accessed directly by the main processor while the Elan can access the same memory by using addresses in the range OxC808000 to OxFFFFFFFF. In our example, the user may allocate main memory using malloc, and the process heap may grow outside the region dircctly accessible by the Elan delimited by Ox IFFFFFFFFFF. in order to avoid this problem, both main and Elan memory can be allocated using a consistent memory-allocation mechanism. As shown in Figure 2 , the MMU tables can be set up to map a common region of virtual memory called the memoty-allocator heap. The allocator maps physical pages, of either main memory or Elan into this virtual address range on demand. Thus, using allocation functions provided by the Elan library, portions of virtual memory can be allocated either from main or Elan memory, and the MMUs of both the main processor and Elan can be kept consistent.
For reasons of efficiency, some objects can be located on the Elan, e.g., communication buffers or DMA descriptors which the Elan can process independently of the main processor.
Elan Context
In a conventional virtual-memory system, each user process is assigned a process identification number (PID) which selects the set of MMU tables used and, therefore, the physical address spaccs accessible to it. QsNet extends this concept so that the user address spaces in a parallel program can intersect. The Elan replaces the PID value with a context value. User processes can directly access an exported segment of remote memory by using a combination of a context value and a virtual address. Furthermore, the context value also determines which remote processes can access the address space via the Elan network and where those processes reside. if the user process is multithreaded, the threads will share the same context just as they share the same main memory address space. If the node has multiple physical CPUs, then the individual threads may actually be executed by different CPUs. However, they will still share the same context.
Network Fault Detection & Fault Tolerance
QsNet implements network fault detection and tolerance in hardware.' Under normal operation, the source Elan transmits a packet (i.e., route information for source routing, followed by one or more transactions). When the receiver in the destination Elan receives a transaction with an "ACK Now" flag, it means that it is the last transaction for the packet. The destination Elan lhen sends a packet acknowledgment (PA) token back to the source Elan. Only when the source Elan receives the PA token is it allowed to send an EOP acknowledgement token to the destination to indicate the completion of the packet transfer. In short, the fundamental nile of Elan network operation is that, for every packet that is sent down a link, a single PA token will be sent back. The link will not be re-used until the PA token has been sent.
If an Elan detects an error during the transmission of a packet over QsNet, it immediately sends out an error message without waiting for a PA token to be received. If an ~ '11 is important to note that this fault detection and tolerance occurs be- Each process in a parallel job is allocated a virtual process id (VPID) and can map a portion of its address space into the Elan. These address spaces, taken in combination, constitute a distributed virtual shared memory. Remote memory (i.e., memory on another processing node) can be addressed by a combination of a VPID and a virtual address. Since the Elan has its own MMU, a process can select which part of its address space should be visible across the network, determine specific access rights (e.g., write-or read-only) and select the set of potential communication partners.
two 64-biV66-MHz PCI slots (one of which is used by the Elan3 PCI card QM-400). The interconnection network is a quaternary fat-tree of dimension two, composed of eight 8-port Elite switches integrated in the same board. The operating system used during the evaluation is Linux 2.4.0-test7.
To expose the basic performance of QsNet, we wrote our benchmarks at the Elan3lib level. We also briefly analyze the overhead introduced by Elanlib and an implementation of MPI-2 [4] (based on a port of MPI-CH onto Elanlib).
To identify different bottlenecks, the communication buffers for our unidirectional ping. bidirectional ping, and hotspot tests are placed either in main or in Elan memory. The communication alternatives include main memory to main memory, Elan memory to Elan memory, Elan memory to main memory, and main memory to Elan memory. 
Unidirectional Ping

User Applications
The graphs in Figure 4 (a) can be logically organized into three groups: those relative to Elan3lib with the source buffer in Elan memory, Elan3lib with the source buffer in main memory, and Tports and MPI. In the first group, the latency is low for small and medium-sized messages. This basic latency is increased in the second group by the extra delay to start the remote DMA over the PCI bus. Finally, both Tports 
Experiments
We tested the main features of our QsNet on an experi- 
Hotspot
In this experiment, we read from and write to the same memory location from an increasing number of processors 
Conclusion
In this paper, we presented two novel innovations of QsNet: (1) the integration of the virtual-address spaces of individual nodes into a single, global, virtual-address space and (2) network fault tolerance that can detect faults and automatically re-transmit packets. Next, we briefly presented the results of benchmark tests on QsNet, targeting essential performance characteristics. At the lowest level of the communication hierarchy, the unidirectional latency is as low as 2 ps and the bandwidth as high as 335 MB/s. Bidirectional measurements indicate a degradation in performance which we analyzed and explained in the paper. At higher levels in the communication hierarchy, Tports still exhibit excellent performance figures, comparable to the ones at Elan3lib level.
In summary, our analysis shows that in all the components of the performance space we analyzed, the network and its libraries deliver excellent performance to the end user. Future work includes scalability analysis for larger configurations, performance of a larger subset of collective communication patterns, and performance analysis of scientific applications.
