I. INTRODUCTION
One very important approach to the design of supercomputers i s the use of large-scale parallel processing systems, i.e., computer systems with 26-2'6 processors working together to solve a problem. The domain of problems that require the use of supercomputers are those that have a "need for speed" due to some combination of the complexity of the algorithms needed to compute the solution to the problem, the sizeof thedata set to be processed, and the time constraint on when a solution to the problem must be attained. Examples of such problem domains are aerodynamic simulations, air traffic control, ballistic missile defense, biomedical signal processing, chemical reaction simulations, map making, missile guidance, robot vision, satellite-collected imagery analysis, seismic data processing, speech understanding, and weather forecasting.
A critical component of any large-scale parallel processing system is the interconnection network that provides a means for communication among the system's processors and memories. Attributes of the multistage cube topology that have made it an effective basis for interconnection networks and the subject of much ongoing research are overviewed in this paper. The goal is to survey a variety of features of the multistage cube topology that make it attractive for use in supercomputers employing large-scale parallel processing. Appropriate references are listed to allow the interested reader to probe in more depth.
A. The Problem
Designing a vehicle for providing communications among N processors and N memory modules in a large-scale parallel processing system, where N may be in the range 26-2'6, is a difficult task. The interconnection scheme must provide the needed communications and be of reasonable cost. One extreme, a single shared bus, would become a bottleneck for N in this range if the processors communicate frequently. The other extreme i s to link directly each processor to every other processor so that the system is completelyconnected, requiring N -1 unidirectional lines foreach processor,foratotalofN(N -1)links.Thisisunreasonable for large N. The crossbar network can emulate a completely connected system; however, it uses N 2 crosspoint switches [I] , and so, given current technology, it is also infeasible for large N.
Agreat number of networks between theseextremes have been proposed in the literature. The cost-effectiveness of a particular network design depends on such factors as the computational tasks for which it will be used (which impacts the frequency of communication, interprocessor communication patterns, and message lengths), the desired speed of interprocessor data transfers, the actual implementation of the network (both hardware and associated software protocols), the number and characteristics of the processors and memories in the system, and constraints on the development and construction costs. Because of the wide range of values these factors can have, there is no one 0018-921918911200-1932$01.00 G 1989 IEEE network that is considered "best" for all possible situations. A variety of network designs are described in [2]- [9] .
B. Appeal of the Multistage Cube Network Topology
Some of the approaches to parallel processing system network design that have been discussed in the literature are based on the multistage cube network topology. The multistage cube network topology has been used or proposed for use in systems such as the Ballistic Missile Defense Agency test bed [IO] , [ I l A. SlMD Machines A model of a "single instruction stream-multiple data stream (SIMD) machine" consist I of a control unit, N processors, N memory modules, anti an interconnection network [SI] . The control unit broaclcasts instructions to the processors, and all active p r o c t s o r s execute the same instruction at the same time. Thus there i s a single instruction stream. Each active processor executes the instruction on data in its own associated menory module. Thus there are multiple data streams. The interconnection network, sometimes referred to as an alignment or permutation network, provides for communicaticms among the processors and memory modules. Example, of SlMD machines that have been constructed include S rARAN [52] , M P P [53] , [541, and the Connection Machine [51] , [56] .
B. Multiple-SIMD Machines
Avariation on the SlMD model that may permit more efficient use of the system processors and memories is the "multiple-SIMD machine," a parallel processing system that can be dynamically reconfigured to operate as one or more Generalized Cube topology is used in this paper to represent this class of topologically equivalent networks. Theadvantages ofthe multistagecube network approach include the number of components being proportional to SlMD submachines of various si res. Examples of multipleSlMD machinesarethe proposed MAP [57] , [58] and existing Connection Machine 2 [56] .
N log, N, efficient distributed control schemes, partitionability, availabilityof multiple simultaneous paths, and ability to employ a variety of different implementation techniques. It is these advantages and good overall network performance in both the SlMD and M l M D modes of parallelism that make the multistage cube topology appealing.
C. Overview
Section II presentsdifferent architectural modelsof parallel processing systems and provides some basic terminology. The multistage cube topology is defined and properties such as routing tag control and partitionabilityare described in section Ill. In section IV, thedifferent modesof operation that the network switches can use are explored. The way in which the network can be used to support SlMD (synchronous) parallelism and M l M D (asynchronous) parallelism are examined in sectionsV and VI, respectively. Various existing machines employing the multistage cube are overviewed in section VII. Section Vlll considers using the multistagecube topologyas a single-stage network, whereeach networks switch i s associated with a processor and memory. Section I X discusses the optimality of the multistage cube network and contains some concluding remarks.
I I. ARCHITECTURAL MODELS OF PARALLEL MACHINH
There is a variety of ways to organize the processors, memories, and interconnection network in a large-scale parallel processing system. In this section, models of a few of the basic structures are briefly introduced. More information is available i n textbooks such as [6] and [44] - [50] .
C. MlMD Machines
In contrast to the SlMD mackine, where all processors follow a single instruction streari, each processor in a parallel machine may follow its owr instruction stream, forming a "multiple instruction strcam-multiple data stream (MIMD) machine" [51] . One otganization for an M l M D machine consists of N processor',, N memory modules, and an interconnection network, wh3re each of the processors executes its own program on its own data. Thus there are multiple instruction streams and multiple data streams. The interconnection network providtks communications among the processors and memory modules. While in an SlMD system all active processors ust the interconnection network at the same time (i.e., synchronously), in an M l M D system, because each processoi is executing its own program, inputstothe networkarriv-independently(i.e.,asynchronously). Examples of large M l M D systems that have been constructed are Cm* [59]-1611, the BBN Butterfly [29] , the BBN GP 1000 [13], the BBN --C2000 [14], the Intel iPSC cube [62] , and the NCube [63] .
D. Partitionable SIMD/MIMD Mt chines
Afourth model of system organization combines thefeatures of the previous three. A "')artitionable SIMDIMIMD machine" is a parallel processing system that can be dynamically reconfigured to opi?rate as one or more independent SlMD and/or M l M D sutimachines of various sizes.
The N processors, N memory nodules, interconnection network,and Ccontrol unitsof a partitionable SIMD/MIMD system can be partitioned to f x m independent subma-chinesaswith multiple-SIMD machines. Furthermore, each processor can follow i t s own instructions (MIMD operation) in addition to being capable of accepting an instruction stream fromacontrol unit(SIMDoperation 
E. System Configurations
With any of these four models, there are two basic system configurations [6] . One is the "PE-to-PE configuration," in which each processing element, or PE (formed by pairing a processor with a local memory), is attached to both an input port and an output port of an interconnection network (i.e., PE/' is connected to input port; and output port j ) . This is also referred to as a distributed memory system, or private memory system. In contrast, in the "processorto-memory configuration," processors are attached to one side of an interconnection network and memories are attached to the other side. Processors communicate through shared memories.This i s also referred toasaglobal memory system, or the "dance-hall" model (boys on one side of the room, girls on the other). Hybrids of the two approaches are also possible, such as using a local cache in a processor-to-memory machine (e.g., Ultracomputer [24] ). Deciding which of the configurationsor hybrid of them is "'best" for a particular machine design involves the consideration of many factors, such as the types of computational tasks for which the machine is intended (e.g., are most data andlor programs shared by all processors or local to each processor), the operating system philosophy (e.g., will multitasking bedonewithin each processor to hide any latency time for network transfer delays when fetching data [64]), and the characteristics of the processors and memories to be used (e.g., clock speed, availability of cache).
Beware of the term shared memory as applied to parallel machines. Some researchers use this term to refer to the way in which a machine is physically constructed (i.e., a processor-to-memory configuration) and others use it to refer to the logical addressing method. A logically shared address space paradigm can be supported byeither of the two basic physical configurations (PE-to-PE or processor-to-memory), or hybrids of them.
F. Remark
The multistage cube can support all four models of parallel computation and both types of system configurations. To simplify discussions, the PE-to-PE system configuration will be assumed in sections Ill-VI unless stated otherwise. However, the information contained in those sections is also relevant to the processor-to-memory configuration and hybrids.
I. MULTISTAGE CUBE NETWORK TOPOLOGY PKOPEKTitS
Various inherent properties of the multistage cube topology are presented in this section. These include path establishment, distributed routing tag control, and partitionability. These general properties are not unique to the multistage cube. The single-stage (e.g., hypercube) and other multistage (e.g., ADM) networks have analogous but not identical properties [6] , [65] , [661.
A. Network Structure of multistage cube networks [26] . It has n = log,N stages, numbered from 0 to n -1, where each stage consists of a set of N lines ("links"), numbered from 0 to N -1, connected to N/2 interchange boxes. Each "interchange box" i s a two-input two-output device that can be set as shown in Fig.1 .Thelabelsofthelinksenteringtheupperand lower inputs of an interchange box are used as the labels for the upper and lower outputs, respectively. At stage i, links whose numbers differ only in the ith bit position are pairec' at interchange boxes. PE/ is attached to network input/ and
The narne"mu1tistagecube network"wil1 be used to refer to the network consisting of the Generalized Cube topology and interchange boxes with the capabilities shown in Fig. 1 , where each interchange box is controlled independently. The term multistage cube has its origin in the multidimensional cube network. In a"multidimensiona1 cube" network, the vertices can be labeled in binary so that vertices whose labels differ only in bit position iare connected across dimension i. This pairing of vertices that differ in bit position I along dimension i corresponds to pairing links whose labels differ only in bit position i at an interchange box in stage i of the multistage cube network. This demonstrates the relationship of the multistage cube to the hypercube (multidimensional cube) networks used in such systems as the Connection Machine [56] , Intel iPSC cube [62] , and NCube [63] . output ; .
B. Establishing Paths
"One-to-one connections" use the straight and exchange interchange box settings. To go from a source S = s, -. . . only one path from a given source to a given destination. This unique path property limits the fault tolerance of the Generalized Cube topology in that a single network fault will prevent some source/destination pairs from being able to communicate. Design techniques for fault-tolerant variations of the Generalized Cube are surveyed i n [38].
A "conflict" occurs in a multistage cube network when the messages on the two input links of an interchange box want to go out the same output link. Typically, when a situation like this arises, one message is blocked and must wait until the other has completed its transmission. Both requests cannot be accommodated simultaneously. This is discussed further in Section IV.
A "broadcast (one-to-many) connection" i s performed when the lower and/or upper broadcast states of interchange boxes are used in a path. For example, i n item) of each message to be trarismitted through the network, For one-to-one (nonbroad1:ast) connections, an n-bit routingtagcan becomputed b y t iesource PEfrom its num- The advantages of the destiniition scheme over the XOR method are that it is easier to 1:ompute and a destination PE can compare the destinatiori tag that arrives against its own address to determine if the message arrived at the correct network output (if it did not, the network must be faulty). The destination tag sck erne has the disadvantage that it cannot be used to detetmine the source; the XOR scheme can. Sending the souice address along with the destination tag or sending the destination address along with the XOR tag are methods that provide the capability to determine both the source of the data and if the data arrived at the proper destinz.tion. The destination tag scheme is more practical, whil? the X O R scheme is more mathematically pleasing (and is used in section V l l l for the single-stage variant). A broadciist routing tag scheme that consists of an n-bit broadcast riask along with either type of n-bit routing tag can be used o specify a variety of broad- 
D. Partitionability
The "partitionability" of the multistage cube network i s the ability to divide the netwirk into independent subnetworks of different sizes so tliat each subnetwork of size C. Distributed Routing Tag Control Network control for the multistage cube i s distributed among the PES by using a routing tag as the header (first N ' 5 N has all of the interconnection capabilities of a multistage cube network built t o be of size N ' . The methods for partitioning the multistage cube network assume a PEto-PE configuration, where PE i is connected to both network input i and network output i, so both input i and output i must belong to the same partition [61, [26] , [66] , [68] .
These methods can also be used to partition the processorto-memory configuration, but there are some partitionings that will support the processor-to-memory configuration and not the PE-to-PE.
The ability to partition an interconnection network into independent subnetworks implies that the network can support the partitioning of the system for which it is providing communications. Partitioning i s necessary to support multiple-SIMD and partionable SIMDIMIMD systems. It can be used to partition M l M D systemsand, in somecases, improve the efficiency of SlMD machines [6, pg. I l l ] . The advantages of partitionable systems [20] include fault tolerance (if a processor fails, only those partitions that include the failed proc.essor are affected), fault detection (for situations where high reliabitity is needed, more than one partition can run the same program on the same data and compare results), multiple simultaneous users (because there can be multiple independent partitions, there can be multiple simultaneous users of the system, each executing a different parallel program), program development (rather than trying to debug a parallel program on, for example, for example, 1024 PES, a user can debug the program on a smaller size partition of 32 PES and then expand it to 1024 PEs),efficiency(if atask requiresonly Nl2of Navailable processors, the other N/2 can be used for another task), and subtask parallelism (two or more independent parallel subtasks that are part of the same job can be executed in parallel, sharing results if necessary).
Consider partitioning a multistage cubeof size N into two independent subnetworks, each of size Nl2. There are n choices for doing this, each one based on a different bit position of the inputloutput port addresses. One choice is t o set all interchange boxes in stage n -1 to the straight state. This forms two subnetworks, one consisting of inputl output ports 0 t o ( N / 2 ) -1 (those with a 0 in the high-order bit position of their addresses) and the other consisting of ports NI2 to N -1 (those with a 1 in the high-order bit position). These two disjoint sets of input/output ports could communicate with each other only by using the exchange in stage n -1. By setting this stage t o straight, the subnetworks are independent and have full use of the rest of the network (stages n -2 to 0). This is shown in Fig. 4 .
Because each subnetwork has the properties of a multistage cube, it can be further subdivided. Assume the size NI2 subnetworks were created by setting stagej to straight, 0 5 j < n. A size NI2 subnetwork can be divided into two size NI4 subnetworks by setting all the stage i interchange boxes in the size NI2 subnetwork to straight, for any i, 0 5 i < n , i # j . Partitioning one subnetwork into halves based on bit position n -2 is shown in Fig. 5 . This process of dividing subnetworks into independent halves can be repeated on any existing subnetworks (independently) to create any size subnetwork from one to Nl2. The sizes of the subnetworks may differ. The only constraints are that the size of each subnetwork must be a power of two, each inputl output port can belong to at most one subnetwork, the physical addresses of the input/output ports of a subnetwork of size 2' must all agree in any fixed set of n -s bit 
27. E can be generated by evaluating the following logical expression:
where . is bitwise logical AND and + i s bitwise logical OR.
Using the preceding example, but with destination tag routing, the effective tag i s E = s, s, d:
. . . d,do, again forcing the stages 9 and 8 interchange boxt s used to be set to straight.
E. Interchange Box Size
Throughout this section it has been assumed thatthe network is constructed from 2 x 2 interchange boxes. Consider for N a power of b using t x b crossbars as the interchange boxes. All of the properties described in this section for the 2 x 2 case can be adapteii for the b x b case by using base b arithmetic as the basis i istead of binary. For example, the b links that differ in the ith digit of their base b representation will enter the same interchange box at stage i ,
IV. MODES OF OPERATION FOR NTERCHANCE BOXES
The paths through the multit.tage cube and routing tags for determining these paths wc,re discussed in section Ill.
In this section, different operati m a l modes that can be used by the interchange boxes to e Gtablish paths and transmit data over these paths are described.
A. Circuit a n d Packet Switchin,;
In the "circuit-switched mode," once a path i s established by the routing tag, the interchange boxes in the path remain in their specified state uiitil the path is released. Thus there is a complete circuit est: blished, from input port to output port, for that path. Data is sent directly from the source to the destination overt his circuit. Circuit switching must be used in networks constructed from combinational logic where there are no buffers in the interchange boxes for storing data.
Different strategies can be employed in a circuit-switched network when a conflict occurs during path establishment [69] . Under the "hold" algorithm the blocked path request remains pending (holds) at the blocking interchange box until the path that is blocking its progress is relinquished. Then the previously blocked request (which is holdingjcan proceed through the network. Using the"drop"algorithm, a path request is immediately dropped when a conflict is encountered. A new attempt can be made at a later time to establish the path. The tradeoff is between a dropped request having to re-request the blocked subpath already established versus a held request possibly blocking other paths with that blocked subpath. A hybrid of the two is the "modified hold" algorithm. A blocked request is held for a predetermined length of time before it is dropped under the modified hold algorithm. One advantage of the modified hold algorithm is that paths blocked for short periods of time are not dropped and do not incur the added delay associated with requesting the same path again.
In the "packet-switched mode," the routing tag and data to be transmitted are collected together into a "packet." A packet consists of one or more words and can be of fixed or variable size. Packet switching uses data buffers in each interchange box t o store packets as they move through the network.As in thecircuit-switched case, the routing tag sets the state of the interchange box. However, in contrast to the circuit-switched case, a complete path (circuit) from the source to the destination is not established. Instead, the packet makes itsway from stagetostage, releasing linksand interchange boxes immediately after using them. In this way, only one interchange box is used at a time for each message. This differs from circuit switching, where n interchange boxes, one from each stage, are used simultarieously for the entire duration of the message transmission.
One way t o handle conflicts in a packet-switched network is to make one packet wait until the other is transmitted. The "wait" is implemented by storing the packet in the interchange box's packet buffer. When an interchange box's packet buffer is full, it will not accept new packets from the previous stage. Networks that employ packet switching and can store multiple packets at each interchange box are often referred t o as "buffered networks."
Wormhole routing [70] and virtual cut-through routing [71] , [72] are hybrids of circuit and packet switching. "Wormhole routing" differs from packet switching in that an entire packet i s not stored i n an interchange box before being forwarded to the next interchange box. Instead, an interchange box forwards a word of a packet to the appropriate interchange box in the next stage of the network immediately after the word is latched at one of its input ports. The contents of each packet are thus pipelined through the network [68] . As packet words are forwarded through the network, the packet becomes spread across many interchange boxes. When the header of a packet is blocked, all the words of that packet stop advancing. Any other packets requiring the use of an interchange box output port for which any blocked packet word is waiting are also blocked, analogous to the hold circuit-switching protocol. "Virtual cut-through" routing is similar to wormhole routingexcept that when the header gets blocked, the other words from the packet continue to advance and are buff-, ered in the interchange box that contains the blocked packet header.
Thus there are a variety of choices for switching techniques (e.g., circuit switching, packet switching, or their variations of cut-through and wormhole routing) and routing strategies (e.g., drop, hold, and modified hold algorithms). The routing tags described i n section Ill-C are appropriate for operating the network in any of these modes of operation. Details of communication protocols and interchange box physical implementations arediscussed in the literature (e.g., [ I l l , [70] - [75] ).
There are a large number of factors that can affect the relative performance of the different switching techniques. These factors include implementation technology, communication protocols, size of the network (i.e., number of I/O ports), size of interchange box (e.g., 2 x 2 versus 4 x 4), if the message size is fixed or variable (and if variable, the distribution of the sizes), width of each path through the network, network load (the amount of data input into the network at a given time, which is affected by the processor speed, memory access time, actual application program being executed, etc.), the extent to which paths being established through the network conflict, the modes of parallelism (SIMD, MIMD, or both) in which the network is being used, cost constraints, and transfer time requirements. Because of the many variations in switching techniques and the large number of parameters that can affect network performance, determining which switching technique i s "best" (even with a subset of the parameters fixed) isextremelydifficultand i s a problem currentlyunder study by many researchers (e.g., [70] , 1761). An interchange box implementation that allows both packet-and circuit-switching capabilities is discussed in [4], [77] . Interchange boxes designed to support fiber optics communication are described in [78] .
B. Hot Spots and Combining
This subsection assumes that the processor-to-memory configuration (described in section Il-E) uses a bidirectional multistage cube (as opposed to two unidirectional networks). The techniques can be adapted to hybrid configurations. Also, packet switching, as discussed in the preceding, is assumed.
The strategies just reviewed work well assuming that requests are randomlydirected attheoutputs, which is usually a reasonable assumption. In fact, the hardware or software can hash (map) logical to physical memory locations, thereby guaranteeing that requests look random and are spread across memory modules. There is, however, one exception: multiple memory requests can be directed at the samememoryword. Such requestswill attheveryleast have to queue up at the associated memory module, which then becomes a bottleneck. Furthermore, these requests to the same memory word will start to back up into the network, notonlycausing requests to that module to takea long time to be serviced, hut also blocking and therefore slowing down requests to other modules. This phenomenon has been termed "hot spots" [79] and has also been studied by various researchers (e.g., [80] - [82] ).
To avoid hot spots, one can use a technique called "combining." When two load requests to the same memory location meet at an interchange box, they can be combined without introducing extra delay by forwarding just one of the two (identical) loads and then satisfying both requests with the value returned from memory to that interchange box. This does assume that the load value will return to the processor via the same route through the multistage cube network (but in the reverse direction) that the load request used to get to memory. This gives a machine the ability to satisfy a large number of loads to the same location in about the same time it takes to satisfy one such load. The combining idea can be generalized to handle stores and combinations of loads and stores [24] . When a load meets a store, forward the store and return its value to satisfy the load. When a store meets a store, forward either store and ignore the other.
Asynchronous parallel machines typically have synchronization primitives such as test-and-set [49] . A more powerful synchronization primitive is the "fetch-and-add. 
V. PERFOKMANCC I N SlMD MODE
This section considers the ability of the multistage cube to permutedata in an SIMDenvironment. Recalling that the input and output ports of a network are numbered from 0 to N -1, a "permutation" of data among the PES is mathematically representable as a bijection from the set (0, 1,
This occurs for all enabled PES simultaneously.
In general, there are N ! ways to construct such permutations. However, not every permutation can be realized by a network in a single network transfer. A permutation is said to be "passable" by an interconnection network when all N inputs can send data to the appropriate N outputs simultaneouslywithout any conflicts occurring; i.e., PE ican send data to PE f ( i ) , v i , simultaneously, without anyconflicts. The routing tags described in section Ill-Ccan be used toestablish the paths of any passable permutation. An example of the multistage cube network set to the permutation input to output j -1 modulo N, 0 5 j < N, i s shown in Fig. 6 .
In the following, the permuting power of the multistage cube network with respect to the permutations that are passable is discussed. Performing arbitrary permutations with the multistage cube is also overviewed. Consider the permuting povier of the multistage cube network. When routing a perniutation, each of the Nn/2 interchange boxes is either in :he straight state or in the exchange state. Because each c f the Nn/2 individual interchange boxes can assume two r,eparate states (and no two distinct settingsof all the intercl-ange boxesyieldsthesame permutation), the total number of unique switch settings, and thus permutations, possible with the multistage cube is p i 2 = NNi2. In general, NhL2 << N!; e.g., for N = 8, To demonstrate the types of permutations passable by the multistage cube, some particular types will be defined. Groups of useful permutations that arise from patterns of communication seen in parallel implementations of algorithms such as FFT, matrix operations, and divide-and-conquer are characterized in [85] . 1 hese "frequently used per- 
These basic permutation groul)s were used in [85] to represent families of permutations frequently used in SlMD processing.
The first of these families is i he Xi , "L family that connects input X to output port j X + k inod 2" ( j odd), 0 I X < N.
These permutationsare used toaccess rows,columns,diagonals, and blocks of matrices w i e r e the matrices are stored across the memories of a parallel processing system. The ArL permutations are the same as ~r ' .
TheSj,"L permutation familydictates thecyclic shift of data within segments of size 21. Let 2, and So be bit substrings of X = x, -,x, -. . . xl xo, whe .e il = x, -l x, -. . . x, and io = X , -~X , -~ . . . xo. Then S)fL 'epresents the permutation input port (i,, So) -+ output pixt (i,, ~k )
These permutations can perform cyclic shifts within halves of the network, quarters of thi, network, etc. "Divide and conquer" algorithms utilize these permutations frequently.
The remainder of these frequently used bijections from [85] are generalizations of the bit reversal permutations within segments and the perfect shuffle permutations within segments. These permutations are useful in FFT computations, for example.
The multistage cube network can perform the Xjfl and SjfA permutations in one pass, but cannot route the perfect shuffle or the bit reversal permutations in one pass [85] . Based on the results in [32] it can be shown that the multistage cube can perform the bit reversal permutation in two passes. The routing tags needed to do this can be precomputed and stored. It is possible for the multistage cube network to pass the perfect shuffle (i.e., U") (x,-,x,-~ . . .
x,xo)) in two passes if, in thefirst pass,onlythoseinput ports numbered X = Oxfl-* . . . xl xo send data into the network.
Then, in the second pass, those ports numbered X = Ix,-~ . . . xlxo send data into the network. The routing tags to do this can be precomputed or computed dynamically, using the methods given in section Ill-C. As described in section Ill-D, the multistage cube network can be partitioned into independent subnetworks, each with the connectivity of a full multistage cube network of that smaller size [66] . This property allows the multistage cube to perform the bit reversal and perfect shuffle permutations within segments in two passes.
The "universality" of a network is the ability of that network to route an arbitrary permutation [86] . From the results in [87] , it can be shown that the multistage cube network can perform any permutation in three passes for all N = 2".
Two passes are necessary, three passes are suff icient. Showing that the lower bound of two passes i s both a necessary and sufficient number of passes to route any permutation i s an open problem for N > 8. Control schemes for routing an arbitrary permutation in two or three passes are not known; i.e., a technique for computing, without centralized control, the routing tags for each pass is not known.
VI. PERFORMANCE IN MlMD MODE
The MlMD performance of multistage cube networks is analyzed under the following two assumptions: 1) at each cycle, packets are generated at each source independently with probability p ;
2) each packet i s sent with equal probability to any destination.
The packet "rate," i.e., the number of packets issued per cycle by a PE, i s p, where p 5 1. The "traffic intensity per processor" i s loosely defined as the packet ratep times the average packet length. The "bandwidth per processor" is the largest traffic intensity per processor that the network can support. The "bandwidth of a network" is N times the bandwidth per processor. The "delay" is the average time for a packet to reach its destination. One of the most important performance measurements of an interconnection network is the delay for agiven traffic intensity. Ideally, one desires a network that handles heavy traffic with small delay. However, these two quantities must generally be traded off: to obtain small delay, the traffic has to be relatively light, and to handle heavy traffic, the delay has to be relatively large.
In this section it is assumed that a multistage cube network with N input and output ports is composed of k 
A. Dilated a n d Replicated Networks
There are many possible enhancements to the basic multistage cube network defined in section Ill. Two such enhancements are dilation and replication [88] . The "d-dilation" of a network G is the network obtained by replacing each connection by d distinct connections. This is shown in Fig. 7 . A request entering an interchange box may exit First, consider packet-switching networks built of k x k unbuffered interchange boxes, where only one word can be stored at an input link t o the interchange box. Recall that for purposes of analysis, at each cycle, each PE generates a packet with probability p. To simplify the analysis it is assumed that when multiple packets entering an interchange box (on different interchange box input ports) are routed to the same interchange box output, one randomly chosen packet is transferred out and the others are dropped (i.e., they are deleted from the network and would have to be retransmitted).
The relevant figure of merit for such networks is the probability p, that there is some packet on any particular input at the ith stage of the network. This quantity will immediately yield the bandwidth; it will be used later to approx-imate the delay. In [30] the following recurrence is derived:
with the boundary condition pa = p (where p is the probability of packet creation at a source node). To see this, note that p,-'/k is the probability that a packet exists at a particular input of stage i -1 and is directed to a particular output,sol -p , -, / k istheprobabilitythatfromaparticular input a packet is not directed to that particular output, so
(1 -~, -' / k )~ is the probability that no packet is directed to that particular output from any input, so 1 -(1 -~, -~/ k )~ is the probability that some packet is directed to that particular output from some input. This recurrence allows one to compute numerically the value of p , (for any initial po).
In [88] , asymptotic formulas for p , are derived and it is shown that p , can be approximated as This closed-form approximation provides insight as to how p , behaves. It is shown in [89] that this is actually an upper bound forp,, and a (much more complicated) lower bound is presented. The bandwidth of a network is N,, (where 11 = logkN). In [88] it is also shown that the bandwidth of a network is asymptotically
as N --t 03, for any fixed po. 
1 (aworst-casesituation that is unlikely to occur) and u5ing (I) .) The performance graph in Fig. 9 [go] (removing the simplifying assumption of [89] ). This is also generalized to k x k interchange boxes in [89] .
The performance of a d-rep1 cated network is easy to approximate: it consists of d ccpies of a multistage cube network, so (1) (and related asymlitotic results) apply to each copy independently. Assume th2t every source issues one packet with probabilityp at each cycle and randomly sends it tooneofthe dcopies. Makingthesimplifying assumption that the d copies are independ-nt, the probability that a packet exists after stage 7 is 1 -
Wl.
Using these equations, one c i n compare the bandwidth of dilated and replicated networks using comparable hardware [88] . It For the system to be in equilibrium, new packets must enter the system at the same rate that successful packets exit the system, so
Solving for q gives q n 'P2
.
The probabilitys that an issued packet survives, whether new or previously dropped, is the probability that a packet exits the system divided by the probability that a packet enters the system (as a new packet or reissued previously dropped packet). Thus, using this fact and the approximation for q in the preceding,
The expected number of times r that a packet is issued or reissued is one over the probability that a packet survives. 
I--
A packet will always require at least one attempt to traverse the system (its final and successful attempt). Note that p must be less than 2kl((k -I)'), which is the bandwidth per processor, otherwise the network will be unstable (i.e., will saturate).Tofully interprettheseequations, one must make further assumptions about the hardware. As long as p is somewhat less than 2kl((k -I)'), the expected number of resubmissions will be bounded by a constant. The time between resubmissions depends on how soon a dropped packet can be reissued, but anywhere between a few cycles and a few 7 cycles seems reasonable. Even with the latter conservative assumption, the delay for a message is e(').
bilityp (at the beginning of each cycle) to establish acommunication path with a randomly chosen destination, the preceding analysis of unbuffered packet-switching networks applies. For asynchronous networks (where messages are not necessarily issued simultaneously), the hold and drop schemes are extremely difficult t o analyze in closed form. However, they can be compared qualitatively: the advantage of the hold scheme compared to the drop scheme is that messages do not waste time entering and backing out of the network. The disadvantage is that while a message is being blocked, it blocks other potential messages. In [69] , the performance of circuit-switching networks is numerically approximated and their analyses are supported with simulations. It is found that for heavytraffic, the drop scheme is better than the hold scheme when the time to transfer a message is longer than about ten cycles.
D. Buffered Packet-Switching Networks
This subsection discusses the performance of buffered networks, where there are queues at the output ports of each interchange box t o store blocked packets (see section IV-A). The "waiting time" of a packet at a stage is how long it spends in the buffer at the stage; the "delay" of a packet at a stage is the waiting time at the stage plus the length of the packet. The waiting time and delay of a packet through the entire network are similarly distinguished. Initially, only the waiting time is discussed; later the delay is analyzed by adding in the packet length.
The performance of a buffered interconnection network is very sensitive t o the size of the buffers. Networks with small buffers (i.e., buffers that can store just a few packets) are difficult to analyze in closed form. Networks with moderate and large buffers have approximatelythe same behavior as networks with infinite buffers [24] , which is advantageous because these latter networks are, mathematically, much more tractable. The following analyses assume infinite buffers.
Assume that at each cycle, a processor issues a packet with probability p, and each packet is c cycles long (i.e., it takes c cycles for the whole packet to pass through an interchange box, which is called the "service time"). Queuing theoretic analysis shows that the expected waiting time of a packet in the buffer at the first stage [88] , [91] is (c -2) CP
(3)
The waiting time at later stages i s not the same because the packets do not leave the buffers in the first stage or later stages independently (i.e., there is temporal dependence because, for example, if a packet has just finished exiting a buffer, it is more likely that a new packet will start exiting the buffer). The waiting time at later stages can be approximated [91] by 2(1 -cp) .
(1 + 2) ( . -;
) cp
C. Circuit-Switching Networks 2(1 -cp)
As discussed in section IV-A, there are several variants of circuit-switching networks. In a synchronous network in which each source attempts simultaneously with probaThere is an interesting and important point that can be learned from these equations (this point has been made in [24] , [88] , [91] but does not seem to be widely appreciated).
When packets are large (i.e., take many cycles to service), unless a network is very lightly loaded it is desirable to split packets into several smaller packets, each directed to the same location, despite the fact the same routing information will have to be duplicated on all of the smaller packets. Similarly, it is not desirable to bunch together several packets directed to the same location in order to preclude each packet carrying the same routing information. This follows immediately from the preceding two formulas on waiting time at an interchange box: for the same traffic intensity (cp), the delay at each interchange box grows linearly in the packet service time (c).
In [91] there are also general formulas for the expected value and variance of the waiting time at a buffer in the first stage, and approximations for later stages. In particular, there are formulas for various combinations of nonsquare (a x 6, a # 6) interchange boxes, bulk arrivals, nonuniform traffic, and multiple service times.
As noted at the beginning of this subsection, the delay of a packet at an interchange box is its waiting time plus its service time, and the delay of a packet through the entire network is the sum of its waiting times through all of its interchange boxes plus its total service time. If a network does not pipeline packets, the total service time is sc, so the total delay is the sum of the delays at each stage plus qc. If the network pipelines packets usingcut-through, thetotal service time is q + c -1, so the total delay is the sum of the delays at each stage plus only 1) + c -1. Related ideas were discussed in [68] .
The bandwidth per processor of a buffered multistage cube network with packets with service time c is l/c, independent of the interchange box size. For comparison, a buffered crossbar network is simply a buffered network consisting of one stage of one N x N interchange box. The delay is obtained from (3) by substituting N for k. Equation (3) shows that each stage of a buffered network has, up to about a factor of two, the same delay as the one stage in a buffered crossbar network. So the main cause of delay in a buffered network is the number of stages. Using the preceding formulas it is easy to analyze the delay in a replicated buffered network. When comparing replicated networks based on different interchange box sizes but containing comparable total hardware, for very light traffic 2 x 2 interchange boxes perform best and, as the traffic gets heavier, larger interchange boxes are better [88] . Dilated networks are much harder to analyze.
E. Network Comparisons
From a performance point of view, the actual choice of network depends on many factors, including machine size, traffic intensity, traffic pattern, message sizes, and implementation technology. The performance of like circuitswitching and unbuffered packet-switching networks differing only in their use of either dilation or replication has been studied. Also, the relative merits of the use of the hold and drop algorithms in circuit-switching networks have been established to some extent.
Ci rcu it-switc h i ng and unbuffered packet-switch i ng networks are easier and cheaper to build than buffered networks, while only buffered networks provide bandwidth proportional to the network size N. Thus, i n ageneral sense, it appears that buffered networks are better for very large machines and the other networks are better for small machines. This qualitative analysis is supported quantitatively in [92] , where the perforriance of interconnection networks is approximated using Markov chains. They compare unbuffered networks, netviorks with buffers of size one, and networks with infinite Iluffers, and conclude that buffering significantly improves performance.
As mentioned in section I-A, what network i s "best" depends on a great many factors. This is also true when selecting an implementation for i i multistagecube network. Therefore, when one reads a study of network performance, one must take great care to understand all of the assumptions made about these factors in that study. For someone designing a new systeri, these assumptions must be compared to the values f o r t h s e factors expected in the new system. For someone condc cting interconnection network research, these assumptions must be compared to one's own view of what is reascmable.
Because of all of the differen. factors involved, no easy way currently exists to predict t l e "best" implementation for any given set of actual opera ing conditions or range of conditions (although some spec:ial cases may be known). The development of a general prc:diction methodology such as this is a difficult and open ptoblem.
VII. CASE STUDIES
In this section some existing systems using multistage cube networks are briefly overiiewed. These systems are ASPRO, Cedar, CP1000(and TC2000), PASM, RP3, and Ultracomputer. These case studies dctmonstrate some of the differentwaysthe multistagecube iletwork is beingemployed.
They also provide examples of c'perational speeds that can be attained. Information about these systems are from the references cited and from persmal communications with principal people involved in each project. The flip network, located bet\veen the processors and the memory, is needed to implement MDA memory. The flip network is a multistage cube nthtwork with individual stage control, i.e., interchange boxes in a given stage are all set to exchange or all set to straighi. Control of the flip network is centralized; routing tags are not used. Instead, for a flip network with 2" = 32 inputs, 11 = 5 control lines from the central controller (one control line for each stage) set the network to any of 2" permutations.
A. The Goodyear Aerospace Corporation
In the ASPRO, 32 processors along with a 32 line flip network are packaged o n a VLSl chip. Each 32-processor VLSl chip is connected to two custom memory chips containing the memory for the 32 processors. The MDA is permitted among groups of 32 processors. A read access to the MDA memory takes 600 ns, including the time to access the memory and go through the flip network. All VLSl processor chips are connected to a common I/O bus. Unlike the STARAN systems, where the processors could use the flip network to pass data among themselves, processor connectivity in the ASPRO is limited tocommunicating through the MDA memory.
B. The Cedar System
Cedar [I61 is a hybrid processor-to-memoryconfiguration multiprocessor system under construction a t the University of Illinois. In Cedar, multiple computing clusters are connected through a multistage cube network to a global shared memory. Each Cedar cluster is a slightly modified Alliant FX/8 minisupercomputer with eight 64-bit floatingpoint microprocessors. The global shared memory is intended to hold data that must be shared among clusters. A Cedar cluster has memory modules that provide local memory for the processors of the cluster. The cluster memory modules may be shared among the processors of that cluster. The prototype under construction will have 32 processors, while the architecture i s designed to support 1024 processors.
Cedar's prototype multistage cube interconnection network is based on a unidirectional 8 x 8 interchange box (implemented asan8 x 8crossbar).Thustwo unidirectional networks are required to have a path from the processors totheglobal shared memoryand apath back from theglobal shared memory to the processors. Each unidirectional network has two stages of 8 x 8 interchange boxes. Two stages of 8 x 8 interchange boxes form a network with 64 inputs and 64 outputs. For the planned 32-processor prototype, either a subset of the links and/or interchange boxes that compose afull64 x 64 will be used, or the full network will be used and path redundancywill beexploited for improved performance and fault tolerance [94] . The number of memory modules to be included in the shared global memory is yet to be finalized.
Packet switching is used t o route packets through the interconnection networks. Packets are one, two, or three words long for read, write, and synchronization primitives, respectively. Each packet word is 64 bits wide. The remaining lines of each 80-bit network link are used for parity and control.
It takes one 85-ns cycle for a packet word to traverse an 8 x 8 interchange box. The access speed of the global memory is eight cycles but i s pipelined to service a read/write request every four cycles. Thus aglobal memory read access takes 12cycles (1020 ns): two cycles to traversethe two-stage network from the processor to global memory, eight cycles to retrievethedata, and two cycles to traverse theother twostage network from global memory back to the processor. Each PE is a processor/memory pair based on the Motorola MC68020 microprocessor with four M-bytes of main memory. Although each PE processor has only a portion of the machine's total memory physically located in the PE, it "sees" the rest of the system's memory (i.e., other PES' memories) in its address space. That is, it can access both local and nonlocal memory by placing a valid memory address on the PE's internal bus. If the address maps into the PE's own locally held portion of the system's memory, the appropriate memory transaction is carried out with no use of the network. For accesses that map to a portion of memory held in another PE, useof the network is required. Each PE has a processor node controller (PNC) that acts as the PE's intelligent interface to the butterfly switch. When a nonlocal access is made, the PNC captures the address placed on the local PE bus by the processor and determines which PE contains the target memory location. The PNC then forms a packet and sends it through the network to the remote PE. The PNC of the remote PE receives the packet and, in the case of a write access, updates the addressed location, or in the case of a read access, reads the addressed location. It sends a reply packet back to the originating PE's PNC. For the case of the read access, the data item contained in the reply packet is placed on the PE's internal bus and i s read by the processor.
The CPlOOO butterfly switch is unidirectional and uses 4 x 4 interchange boxes. Each interchange box is implemented on a single VLSl chip. A form of wormhole routing (section IV-A) routes packets through the butterfly switch by using the destination tag routing scheme (section Ill-C). However, blocked packets are immediately dropped from the network rather than kept waiting for an open channel. If a packet i s blocked at any interchange box on the way to the destination PE, a reject signal is asserted back along the established portion of the path to the originating PE. This reject signal relinquishes the portions of the path already established and informs the originating PE that the packet has been rejected. All packets are long enough (or are padded so that they are long enough) such that the head of the packet reaches the destination PE before the last portion of the packet leaves the PE originating the packet. This is necessary to ensure that the reject signal reaches the originating PE before the end of the packet leaves its network interface. A blocked (rejected) packet is retransmitted after a random time delay.
Each path through the network is four bits wide and all interchange boxes are clocked at eight MHz (i.e., four bits of a packet get through each 4 x 4 interchange box in 125 ns). However, the potential bandwidth for a network path of 32Mbitsls is not realizabledueto theoverheadof routing bits, control bits, checksums, etc. For example, with N = 128, the effective bandwidth of a network path ranges from 5.3 Mbitsls for packets containing 16 data bits of 24.4Mbitsl s for packets containing 256 data bits.
BBN has recently announced the TC2000 system [14], which is similar to the GP1000. Some of the differences germane to this discussion are as follows: the design can support over 500 PES; each PE i s a Motorola 88000 microprocessor with four or 16 M-bytes of memory; the network is bidirectional; 8 x 8 interchange boxes are used; each interchange box is implemented by gate array chips; each path through the network is eight bits wide; and all interchange boxes are clocked at 38 M H z (i.e., eight bits of a packet get through each 8 x 8 interchange box in 26 ns); the potential bandwidth for a network path is 304 Mbits/s; and the effective bandwidth of a network path for N = 128 ranges from 86.9 Mbitsls for packets containing 32 data bits to 187.1 Mbitsls for packets containing 128 data bits.
D. The PASM Parallel Processing System
PASM is aprtitionable SIMDlMIMD parallel processing system design intended to support as many as 1024 PES [20] , [21] . PASM can be partitioned dynamically to work as one or more independent or cooperating submachines of various sizes. Each submachine can independently switch between SlMD and M l M D modes of operation at instruction-level granularity with negligible overhead. These features, in conjunction with the multistage cube type of interconnection network used, make PASM a highly reconfigurable architecture. A 30-processor prototype, with 16 PES in the computation unit, has been constructed at Purdue University [21] .
The PASM prototype inter-PE network is a fault-tolerant variation of the multistage cube called the extra-stage cube [37] . The network is circuit switched and uses the destination tag routing scheme. Each interchange box is a 2 x 2 crossbar implemented with standard TTL components. The path width of each network connection is 16 data bits plus two parity bits. Other features of the network hardware include correct destination verification on each path established and parity checking on all transmitted data.
The PE Motorola MC68000 CPUs communicate through the network with other PES via their network interfaces. The network interface is a set of data and control registers used by the PE CPU to monitor and control its connection to the network (e.g., establish paths, transmit, and receive messages). The operating speed of the PE CPU is therefore the limiting factor in determining how fast the network must transfer data. The use of standard TTL components in the prototype network yield an inexpensive network that does not limit throughput on established paths. Assuming no conflicts, the time to establish a path from PE A to PE B in the PASM prototype is approximately 2 ps. (The goal of the prototype construction was to implement a tool for studying the attributes of such a reconfigurable architecture, not to maximize the raw speed of the prototype hardware.) Once a path is established, the PASM prototype network hardware can support a transfer rate of 24 Mbitsls, but is limited to a sustained transfer rate of 3.8 Mbits/s due to the processing rate of the PE CPU.
E. IBM RP3
IBM has built an experimental PE-to-PE logically shared memory parallel machine, called RP3 (Research Parallel Processor Prototype) [19] . Each proct!ssor is paired with a memory module, along with a memory map unit, a cache, and a network interface, to form, ir, RP3 terminology, a processorlmemory element (PME). The memory module is partitioned into local and global adilress space. This partition is programmable and can be all global, all local, or any proportion of local and global; it #:an be selected independently on a PME by PME basis. PME virtual memory references are first translated by tlie memory map unit [95] . Memory references to locations already in cache are handled by the cache. Otherwise, thl? network interface passes local memory references directly to the memory module and routes global memory refereices through the network. In the RP3 architectural design, ivhich has 512 PMEs, there are two buffered packet-switched multistage cube networks: one combining and ont' noncombining (see sections IV-A and IV-B). The nonco nbining network consists of four stages of 4 x 4 interchan;;e boxes. In principle, this would provide 256 ports, but the iiumber of ports i s reduced to 128 in exchange for providin; dual paths between the ports. Furthermore, each port i:; connected to four PMEs through a multiplexer and a derr ultiplexer. The network is actuallycomposed of two unidirectional networks-one for requests and one for replies. A single unidirectional network handling both requests arid replies can deadlock if requests start accumulating in the network and thereby not allowing replies to make progress. Two networks also improve performance.
In the architectural design, thi: combining network uses 2 x 2 interchange boxes and cain combine the following operations: load, store, fetch-an cl-store, fetch-and-store-if-=-zero, fetch-and-add, fetch-and min, fetch-and-max, fetchand-OR, and fetch-and-AND. O n y like operations can be combined, but loads are treated as fetch-and-adds of zero allowing loads and fetch-and-acids to combine with each other. A prototype has been operat onal since October 1988. The network has 128 ports, as d(!scribed in the preceding, but there are only 64 PMEs; each PME has its own port and half the ports are unused. The noncombining network, which is based on ECL components, has the buffers on the input ports of an interchange bo:: rather than on the output ports as described in section VI-D. This has the advantage of being easier to build, but the disadvantage of increased congestion. The input buffers c,m hold 32 bytes, which is nominally eight words. Each memory request i s for two or three words. The network can o ierate with a cycle time of 20 ns, but has been slowed to 7C ns to conform to the processor cycle time. An interchange box has a latency of two cycles, so a request takes eight cycles to traverse (in one direction) this four stage netw0.k. After the first byte of a message arrives (in eight cycles assuming no conflicts), a subsequent byte arrives each cy8:le. There is no combining network, but the fetch-and-OP o >erations (listed in the preceding)are still handled indivisitlyatthe memory modules.
An important feature of RP3 i s that it can measure performance statistics without degrading machine performance [96] . For example, the machine can determine the rate and delay of memory requr!sts. 24] . The most novel feature of this machine is that is supports the combining of concurrent memory operations, including concurrent fetch-and-adds. This is done using a buffered packet-switched combining multistage cube network (see section IV-B). The network mode of operation is cut-through routing (section IV-A). Each processor is connected to the network via a processor-network-interface, which supports caching, and each memory module is connected to the network via a memory-network interface.
F. NYU
Afull prototype has yetto be built. Several bus-based prototypes based on the Motorola 68010 have been constructed. The largest has eight processors and 16 Mbytes of memory. Combining NMOS and noncombining NMOS and CMOS interchange boxes have been built. The CMOS (noncombining) interchange box has been used to construct a four-processor machine.
The project is now proposing the construction of a 256-processor machine based on the AMD 29000 processor family. Each memory module will have four Mbytes of memory for a total of one gigabyte in the entire machine. The network will use 2 x 2 interchange boxes. The network path width will be 32 bits; memory requests will use two or four network words. The cycle time of an interchange box is expected to be between 30 and 50 ns. The machinewill support the following operations that combine in the network: load, load doubleword, store, store double word, fetch-andstore, fetc h-and-store-if-=-zero, fetch-and-store-if-2 -zero, fetch-and-add, fetch-and-OR, and an operation that is essentially equivalent to fetch-and-AND. Itwill also support reflection [97] (which allows a process to communicatewith other processes without having to know on which processors the other processes are executing) and partial word store operations (e.g., store byte and store half-word operations), but these will not be combined in the network.
VIII. SINGLE-STAGE VARIANTS OF THE MULTISTAGE CUBE

NETWORK
Several problems occur as multistage cube networks get larger. First, the number of components in the network (network interchange boxes, network links) grows as N log, N, where N is the number of PES. For a computerwith avery large number of PES, the network cost can dominate the machine cost. Second, the "distance" (in this case meaning conflict-free communication time) between all arbitrary pairs of PES is the same and is proportional to log,N. If some pairs of PES communicate more frequently than others, then these PES cannot be placed "closer together" (in a network distance sense) to reduce the time required to transfer information. Converting the multistage cube network to a single-stage network addresses both of these problems. Thissection will describe howtodothisconversion and will discuss properties of such single-stage networks.
A. Single-Stage Network Structure
To convert an N input/output Generalized Cube topology intoa single-stage network, simplyconnect networkoutput port i directly to network input port i, for 0 5 i < N. For thisclassof networks, each interchange box includes aconnection to one (or more) PES and will be referred to as a "routing node." Because each node is directly connected to other nodes, to move data from one PE to an adjacent one, only a single link must be traversed. Hence this i s referredtoasa"single-stage"network [6] .This is incontrast to a multistage network where, for any PE to communicate with any other, the traversal of multiple stages of links i s required.
For this discussion, it will be assumed that each node in the single-stage network is associated with one or two PES; actually, there may be a cluster of multiple PES at each node and the processors of these PES may or may not share their memory modules. The growth in the number of nodes and links is thus proportional to the number of PES. A one-PEper-node single-stage version of the multistage cube network having m2"-' routing nodes has m2m-7 PES. Sending data from one PE to another may require transferring the datathrough intermediate nodes associated with other PES. Assuming one PE per node, the shortest distance is the distance between an adjacent pair of nodes or a"single stage." Thus,while in a multistage network there is afixed distance between any pair of communicating PES, in a single-stage network there is a range of distances between pairs of PES. The transmitted information is packet switched between the sending PE and the receiving PE. Both unidirectional and bidirectional information flow approaches have been proposed for this type of network.
The hypercube network topology (mentioned in section Ill-A) is, in a sense, also a single-stage version of the multistage cube. One of the major differences between the hypercube and the networks described in this section is that the networks here use a fixed number of inputioutput ports (independent of N ) for each node, while a hypercube requires log,N. A detailed comparison of hypercube networks to the type presented in this section appears in [411. Routing in this network is performed in two phases. A transmitted packet first moves between loops until it is on one of the two loops going to the destination PE. Then it follows that loop to the stage of the destination PE.
For the first phase, the packet's destination address is broken into two parts, the loop number and the stage number. If the destination PE is in sta;;e i, then the loop number is expanded to a full-loop label tjy inserting a "don't care" into the ith bit position of the locp number. When a packet entersanodeinstagej,thepacket is routedoutoftheupper node output if b i t j in the destin; tion loop address is 0 and out the lower node output if bit j in the destination address isl.This isanalogous to thedestination tag routing scheme described in section Ill-C. Once he packet is on one of the two loops running through the destination PE's routing node (i.e., the first phase is complete), the packet is routed along that loop until it enters a stage whose stage number is equal to that of its destination PE.
Because there are m bit positi ins in the loop labels, the packet may have to go through r 7 -1 routing nodes to get on one of the two loops entering the destinatioo node.
Counting the packet movement between adjacent routing nodes as a step, this can require rn -1 steps. Once on the desired loop, then the packet i s at most m -1 stages (or steps) from its destination. Therefore the maximum distance in the network with m2m-1 PES is 2m -2 steps. However, in contrast to the multistage cube network, there are a range of distances between pairs of PES. Fig. 12 shows a The ith bit in the loop label is now relevant for a PE attached to a routing node in stage i; the address of a PE is the concatenation of its stage number with the complete loop label of the loop to which it is attached. Routing is performed similarly to the one-PE-per-node networks except the loop label address has a single bit value in bit position i instead of a "don't care." The performance difference between the single-stage LSSN and the multistage cube network family (specifically, the baseline network) was analyzed in [42] . The simulations used request interval as the varying parameter, where request interval was defined as the mean time between the last successful packet transmission from a PE and the next packet generation in that PE. Network performance was characterized in terms of throughput (the mean number of packets moving through the network per unit time) and mean packet delay (the mean time between generation of a packet at the sending PE and reception of the packet at the destination PE). Both 16-and 64-PE multistage cube networks and a64-PE single-stage LSSN (that has the same numberof interchange boxes as the 16-PE mu1tistagecube)were modeled. It was found that when the request interval was very short the LSSN had about the same throughput as the 16-PE multistage cube (which i s about one-fourth the throughput of the 64-PE multistage cube), but about three times the mean delay of a 16-PE multistage cube. However, if the request interval was moderate or long, the throughput and delay of the LSSN approached that of the64-PE multistage cube network. Because the 64-PE LSSN has one-sixth the number of components (routing nodes and links) of the 64-PE multistage cube network, a substantial network hardware savings is possible without performance degradation for all but very short request intervals.
D. Bidirectional with One PE per Node
The "Lambda network" [41] is a bidirectional single-stage network based on the multistage cube network topology. The Lambda network is very similar to the MAN-YO network except that bidirectional information travel i s permitted over each link. PE positions, stage numbers, and loop labels are identical to those of the MAN-YO network. Each network routing node has five input and five output ports: one input-output port pair for the single attached PE and one input-output port pair for each of the four links incident to the node. The effect of bidirectional information travel is twofold: 1) it reduces the mean and maximum distances between PES through the network and 2) it adds network fault-tolerancewhile still usinga local routing scheme.
The reduction in mean and maximum distances is shown in Fig. 13 as a There are two drawbacks to the bidirectional information transfer in the Lambda network. First, the routing nodes become more complicated because there are more input and output ports for the information to be switched between inside the node. Second, routing through the network becomes more complicated. Routing is no longer "move until you are on the right loop and then follow that loop to the destination stage"; routing now becomes the choiceof moves between loops and stages that achieves the minimum distance path between an arbitrary communicating pair of PES. to the XOR routing tag scheme of section Ill-C. Because the loop label portion of the destination routing tag contains a "don't care" (as in the MAN-YO network), the loop resultant has a "don't care" in i t s bit position corresponding to the packet's destination stage number (see Fig. 14) . There are two pointers to positions on the ring: one pointer to a packet's current stage and another pointer to the packet's destination stage. Each step of a packet's travel moves the current stage pointer one position clockwise or counterclockwise on the ring. The loop resultant bit at a packet's current stage position is changed when the packet switches between the loops in a node at that stage (this i s because the label of the loop the packet i s traveling on has changed).
When the loop resultant is all zeros with a "don't care" at the destination stage bit position, this means that the packet is on one of the two loops going through the destination PE routing node. Every bit position in the initial loop resultant that contains a one indicates a stage that the packet must pass through on its way to its destination. The shortest length path between a sending PE and a receiving PE is a series of steps around the routing ring beginning at the stage position of the sending PE, ending at the stage position of the destination PE, and touching every nonzero position in the initial loop resultant.There may be several minimum length paths between a communicating PE pair. Each of these paths will require zero, one, or two changes in direction of the packet motion around the routing ring, corresponding to changes of packet direction through the network.
There are two routing decisions for each packet as it moves through anode.Thefirst iswhethertochangeloops, and this is determined by the bit position in the loop resultant corresponding to the current stage number (if it is a "1," the packet should change loops). The second decision is the direction (forward or backward) of packet exit. The optimal choice of direction is a complex function of loop resultant, current stage, and destination stage. The Lambda network uses a routing table in each node to determine exit direction. This routing table must have 2N entries for a networkwith N PES (thereare2N possiblevalues resultingfrom the concatenation of all stage numbers with all loop labels
[41]). However, the exit direction can be written in a sumof-products form (as a function of the loop resultant and the difference of the current stage number and the destination number) and this sum-of-products form can be minimized. For example, the sum-of-products form for a 1024-PE Lambda network has eleven products, the largest of which has only four binary inputs.
The Lambda network can be s i o w n to be similar to the lens network [99] . They differ in t ?at the lens network uses busses to connect sets of PES (tyoically three PES per bus). There are no point-to-point connections as in the Lambda.
E. Deadlock Considerations
Single-stage versions of the inultistage cube network share a common problem: the possibility of deadlock. Deadlock is a condition where packets in transit get stuck indefinitely in intermediate nodes [IOO] . This occurs because there can be a cyclical demand for network resources, e.g., a packet in nodtb A wants to go to node B but cannot because node B i s firll of packets, a packet in node Bwants togo to nodeC,wh ch is alsoful1,and a packet in node C want to go to node A, )ut unfortunately A is also full. Schemes to avoid deadlock are either specific for particular networks (e.g., in [42]) or Ire designed for a general types of networks [101]- [103] . l h e schemes fall into two classes: 1) ensuring that a possible deadlocked state cannot be entered [103] , and 2) detecting and resolving a found deadlocked state, usually by ftermanently discarding a packet and forcing it to be resent [I021 or by temporarily discarding a packet by placing it n a special buffer and later reentering it in the network [IO I].
F. Partitionability
Observe that single-stage variants share with the multistage cube network the ability i o be partitioned that was discussed i n section Ill-D. By restricting the exchange movement of a packet between I3ops in all nodes in a given stage of these networks, the networks are effectively partitioned in half just like the multistage cube network. By preventing the exchange moverr ent i n r stages, the systems are broken into 2' identically s zed subsystems. If PES in nodes forced to the straight state are not used, each subsystem i s a Lambda network. V3riations that make use of the PES in nodes forced to strai:ht are also possible.
G. Theoretical Results
The unidirectional single-stag? networks with one PE per node discussed in this section iire often referred to in the theory community as "butterily" networks. These networks are considered more realistic than the "shared memory parallel computer" model, where each processor can access any memory location irl constant time (this theoretical model assumes there ar. no delays resulting from network or memoryconflicts). An N processor shared memory parallel computer is at least as fast as an N processor butterfly network. The questio? is how much faster is it? One way to determine this is to show how many steps it takesfor a butterfly network to s mulateone stepof a shared memory network. There are t\vo problems that must be overcome in such simulations. 1 he first is to determine how fast a butterfly network can roirte messages between processors. The second comes frcim the fact that a butterfly network has N memory modules (one associated with each processor), and many processors may want to access the same memory module, thereby causing a bottleneck. It has been shown that a butterfly network can simulate one step of a shared-memory parallel computer deterministically in time B(log?N) and probabilistically (i.e., on average Theoreticians have also studied networks like the shown that any size N interconnection network (composed of two-input two-output interchange boxes) that has this routing property in both directions is isomorphic to the multistage cube network topology [28] . Thus routing considerations argue for the multistage cube network.
Oneway of measuring thecost of an interconnection network is to count the number ot wires (which is essentially the same as counting the number of interchange boxes).
Lambda, but with two PES per node. The cube-connected cycles network [I081 is essentially of this type [log] . It was studied for its ability to perform "ascend-descend" algorithms (such as merging, sorting, and routing) with optimal i m plemen tat ion layou t a rea.
H. Summary
Briefly summarizing, the single-stage variants of the multistage cube network address some of the problems of the multistage cube network but introduce problems of their own. The variant's one PE per network node structure provides a range in network distances between PES in contrast to the fixed distance between PES in the multistage cube network. It has been shown that the single-stage network performs the same as the multistagecube network in terms of network throughput and mean packet transfer time when the network communication load is medium or low, butthe multistage network outperforms the single-stage networks for high network loading conditions. The single-stage networks are subject to deadlock, and provisions for deadlock avoidance in these networks are necessary if the networks are to be implemented. However, both unidirectional and bidirectional variants have substantially less network hardware requirements per PE than the multistage cube. Under what conditions one should use the multistage cube itself or one of i t s single-stage variants (and which variant, i.e., the unidirectional or bidirectional) is an open problem.
IX. CONCLUSION
Characteristics of the multistage cube topologywere presented. A variety of different attributes of this interconnection network family were explored. The topics examined spanned theoretical issues, practical design and implementation concepts and tradeoffs, and examples of networks that have been constructed. Many open problems in this important area of research were pointed out to indicatesomeofthetopicswherefurther study is needed.
It is natural to ask why this paper concentrates on the multistage cube network. There could be many other network topologies that are just as good, if not better, for constructing parallel machines. This paper overviews the multistage cube network and then lists many of its desirable features. One can reverse the situation by describing desirable features that an interconnection network should have, and then listing which networks have these properties. The next few paragraphs examine several important properties: routing control, performance versus cost, and layout. In each case, the multistage cube network is among the networks that are optimal with respect to the property. Typically, in a processor-to-memory configuration (section Il-E) the address of a memory location can be divided so that n = log,N bits denote the desired memory and the remaining bits denote a memory location within the memory. It is then very convenient to use the destination tag scheme (described in section Ill-C): a packet needs only to know its destination, and the same routing scheme can be used for all packets (independent of the issuing processor). With a processor-to-memory system, it is also desirable that thedestination tag scheme can be used when routing in the reverse direction; i.e., the same destination tag can be used to route from any memory to the same processor. It can be The performance of anetwork can be measured by its bandwidth or by its delay (see section VI). It can be shown that the multistage cube network is among the networks that have optimal bandwidthicost ratio. When network traffic is MIMII (Poisson arrivals, exponential service times), as is often assumed in queueing theory, the multistagecube network is also among the networks that have optimal delay for their cost [IIO] , [ I l l ] .
Another measure of interconnection network cost is Wires have constant thickness, so the area of a wire is proportional to its length. The average length of a wire in the layout of Fig. 4 is B(NllogN) , because the area is 8 ( N ' ) and there are B(NlogN) wires. However, the longest wires (which are from the first stage) have length W N ) . This could slow the network down if the network cycle time is governed by the longest wires. Also, long wires have less desirable electrical properties. It has recently been shown that the multistage cube network can be laid out in area 8 ( N L ) without any long wires (i.e., all wires having length at most B(N/logN)) [113] .Thus multistagecube networksare among the optimal area networks that also have optimal maximum wire length.
The goal of this paper was to explore the different ways inwhich themultistagecubetopologycan beused in supercomputer systems employing large-scale parallel processing. The attempt was made to present this material at a level appropriate for a reader who has a technical background, but not necessarily in the field of computing. Due to the importance of this network, there is a vast amount of research related to it being conducted, and hence a great deal of information in the open literature. Thus this paper was not intended to be an exhaustive description of all that is known about this network family, and the list of over one hundred references provided is representative and not exhaustive. The paper described what the authors consider to be some of the many important features and capabilities of this network topology. The reader interested in more depth and/or breadth can use the knowledge gained from this paper as the background needed t o do further reading, employing the list of references as a guide t o the relevant literature.
