omputational scientists who depend on parallel C computing to let them run larger models in less time will be disappointed unless the processors can pass information back and forth quickly. The interconnection networks through which processors communicate in tightly coupled parallel machines thus remain a vital research topic for computer architects.
C computing to let them run larger models in less time will be disappointed unless the processors can pass information back and forth quickly. The interconnection networks through which processors communicate in tightly coupled parallel machines thus remain a vital research topic for computer architects.
Over the last two decades we have seen an evolution in the demands placed on these networks. Early SIMD machines required the simultaneous transfer of data from each network input to each output for a relatively small set of communication configurations or permutations; whereas the SIMD and MIMD machines of today need to support varied patterns of synchronous and asynchronous traffic, respectively. Interconnection networks can be categorized according to a number of criteria such as topology, routing strategy, and switching technique. They vary widely in their cost, their fault tolerance, their simplicity, their amenability to partitioning, and their aggregate bandwidth for local and nonlocal traffic patterns. 1-3
Topologies
Interconnection networks are built up of switching elements (switches), which are devices that contain multiple input and output ports with a crossbar interconnection between them. Topology is the pattern in which the individual switches of the network are connected to other switches and to processors and memories (nodes) . Connecting all nodes via a single nonblocking crossbar switch provides optimal performance. However, a crossbar to connect k nodes requires k I/O ports and has complexity that grows with K2, making it impractical for large-scale parallel systems. Thus larger systems employ many smaller switches in direct or indirect networks. Direct topologies connect each switch directly to a node, while in indirect topologies at least some of the switches connect only to other switches. Figure 1 illustrates several direct networks, with each circle representing a node-switch combination. The 2D mesh of the 1970s-era Illiac I V ' is similar to the 1990s-era 2D mesh of the Intel Paragon and the toroidal 3D mesh of the CrayT3D' and T3E. Direct networks often excel at routing local traffic patterns such as the passing of boundary data in grids. Higherdimensional direct networks such as hypercubes are also adept at certain nonlocal traffic patterns and permutations.
Indirect networks (such as k-node MINs) can provide a variety of global communication paths by passing though the multiple stages of switches. They typically have a communication delay proportional to log2 k. This is in contrast to a direct network such as a 2D mesh, which has a worst-case delay proportional to & . Indirect networks can be constructed from a fixed-size building block for a range of values of k. This contrasts with a direct network such as a hypercube, whose optimal switch size is log2 k x logz k (that is, a function of the machine size). Although unidirectional MINs are not optimized for local traffic, BMINs profit from the shorter routes between nodes within the same subtree.
Switching techniques
One attribute of networks that has evolved significantly is switching technique. Many early designs used circuitswitching, in which an entire circuit (path) through the network is reserved before a message is transferred along that path. This technique is efficient for large data transfers but it may reserve network links for a long time, blockmg the formation of other circuits and increasing the variance in message latency. For small messages, circuit setup and release causes a disproportionately large overhead.
Other designs use packet-switching, in which large messages are broken into smaller self-routing units called packets. The first implementations of packet-switching were of the store-and-fomvard variety, in which an entire packet is received into a switch buffer before being forwarded to the next switch in the path. A switch output port forwards the packet when a buffer for the packet is available in the next switch. This method is very efficient when a full packet can be passed in one switch cycle. To minimize latency for large packets, virtual cuttbhrougb was proposed.12 In VCT, each switch can begin forwarding the packet immediately after it determines an appropriate switch output, but blocked packets are completely buffered as in store-and-forward. To reduce switch cost, womhole relaxes the constraint of buffering entire blocked packets in a single switch, and instead blocks packets in place in multiple switches along the currently allocated path. Wormhole routing is currently the most common switching technique for networks in commercial parallel machines.
Queuing techniques
The queuing technique within each switch has an enormous impact on the potential aggregate bandwidth of the network. The simplest technique to implement is input queuing, in which packets queue at the switch inputs, awaiting the availability of the desired switch output. Packets blocked at the head of a queue also block the packets behind it, even if some of these packets are destined for idle switch outputs. This is often referred to as the "head-ofthe-line bloclung" problem. A higher-performing but more complex technique is output queuing. Output queuing is difficult to implement because, to maintain bandwidth through a switch, each switch output queue must be able to accept packet data from all switch inputs concurrently.
A strategy for gaining performance is to form multiple queues at each switch input (for instance, one queue for each possible destination switch ~utput'~). Another strategy is to create a cencal buffer shared among all switch inputs and outputs.1o This latter technique allows an alternative implementation of output queuing: only one entity-the central buffer-needs to receive and send packet data from each switch input and to each switch output concurrently. In addition, the space within this central buffer can be shared dynamically among the inputs and outputs on a demand basis, further improving bandwidth through the switch. Most recent commercial parallel network implementations employ input queues or central buffering.
Routing techniques
Network routing techniques can be categorized according to path length, adaptivity, and mechanisms for deadlock avoidance and recovery. Packet routing can adhere to minimal or nonminimal path length. Individual switches can adaptively react to network congestion or faults by choosing among more than one available switch output, or packet routing can be nonadaptive (also called deterministic or oblivious). Some adaptive schemes are nonminimal.
Deadlock occurs in networks when cyclic dependencies form among packets, and can be avoided by methods such as restricted route paths. Another strategy is to allow deadlock to occur, but to detect it and recover from it. To date, commercial parallel systems have exclusively used deadlock avoidance as opposed to deadlock recovery, or have employed networks such as MlNs that create no cyclic dependencies for minimal routing.
Developments to watch
Comparing networks to determine which is "best" is extremely difficult owing to differences in performance criteria, application domains, operating environments, cost constraints, and implementation technologies.
No clear consensus exists on optimal topologies. However, meshes, hypercubes, MINs, and BMINs have been the most widely used in recent implementations. Circuit-switching and store-and-forward switching have yielded to wormhole routing, and it is possible that, as improvements in VLSI increase the buffer capacity per switch, virtual cut-through or hybrid wormhole-VCT solutions" will dominate in the future. Likewise, these VLSI improvements will favor the increasing use of multiple switch input queues or central buffering to attack head-of-the-line blocking. Other developments to watch include the influence of ATM (asynchronous transfer mode) architecture and technology, the emergence of low-cost optical interconnections, and optical switching.
A major issue inseparable fi-om parallel networks is the component of message latency due to sending and receiving overhead at the nodes. This overhead dominates the latency for traversing a lightly loaded network in many current parallel machines. Thus, the improvement of communication protocols and the processor-network interface is fundamental to the successful evolution of parallel networks. +
