Abstract-The Extra Stage Cube (ESC) interconnection network, a fault-tolerant structure, is proposed for use in large-scale parallel and distributed supercomputer systems. It has all of the interconnecting capabilities of the multistage cube-type networks that have been proposed for many supersystems. The ESC is derived from the Generalized Cube network by the addition of one stage of interchange boxes and a bypass capability for two stages. It is shown that the ESC provides fault tolerance for any single failure. Further, the network can be controlled even when it has a failure, using a simple modification of a routing tag scheme proposed for the Generalized Cube. Both one-to-one and broadcast connections under routing tag control are performable by the faulted ESC. The ability of the ESC to operate with multiple faults is examined. The ways in which the ESC can be partitioned and permute data are described.
II. DEFINITIONS
An SIMD (single instruction stream-multiple data stream) [7] machine typically consists of a control unit, N processors, N memory modules, and an interconnection network. The control unit broadcasts instructions to all of the processors, and all active processors execute the same instruction at the same time. Thus, there is a single instruction stream. Each active processor executes the instruction on data in its own associated memory module. Thus, there are multiple data streams. The interconnection network, sometimes referred to as an alignment or permutation network, provides a communications facility for the processors and memory modules [20] . The Massively Parallel Processor (MPP) [3] is an example of an SIMD supersystem.
An MIMD (multiple instruction stream-multiple data stream) machine [7] typically consists of N processors and N memories, where each processor can follow an independent instruction stream. As with SIMD architectures, there are multiple data streams and an interconnection network. Thus, there are N independent processors which can communicate among themselves. There may be a coordinator unit to help orchestrate the activities of the processors. Cm* [25] is an example of an MIMD supersystem.
An MSIMD (multiple-SIMD) machine is a parallel processing system which can be structured as one or more -independent SIMD machines (e.g., MAP [14] ). A partitionable SIMD/MIMD machine is a system which can be configured as one or more independent SIMD and/or MIMD machines (e.g., the DCA [9] and TRAC [17] supersystems).
The Extra Stage Cube (ESC) network can be used in large-scale SIMD, MIMD, MSIMD, and partitionable SIMD/MIMD supersystems. It can be defined by first considering the Generalized Cube network from which it is derived. The Generalized Cube network is a multigtage cube-type network topology which was presented in [24] . This network has N input ports and N output ports, where N = 2". It is shown in Fig. 1 for N = 8. The network ports are numbered from 0 to N -1. Input and output ports are network interfaces to external devices called sources and destinations, respectively, which have addresses corresponding to their port numbers. The Generalized Cube topology has n = log2N stages, where each stage consists of a set of N lines connected to N/2 interchange boxes. Each interchange box is a twoinput, two-output device and is individually controlled. An interchange box can be set to one of four legitimate states. Let the upper input and output lines be labeled i and the lower input and output lines be labeled j. The four legitimate states are: 1) straight-input i to output i, inputj to outputj; 2) exchange-input i to output j, input j to output i; 3) lower broadcast-input j to outputs i and j; and 4) upper broadcast-input i to outputs i and j [ 1]. This is shown in Fig. 1 .
The interconnection network can be described as a set of interconnection functions, where each is a permutation (bijection) on the set of interchange box input/output line labels [ 18] . When interconnection function f is applied, input S is connected to outputf(S) = D for all S, 0 < S < N, simultaneously. That is, saying that the interconnection function maps the source address S to the destination address D is equivalent to saying the interconnection function causes data sent on the input port with address S to be routed to the output port with address D. SIMD systems typically route data simultaneously from each network input via a sequence of interconnection functions to each output. For MIMD systems, communication from one source is typically independent of other sources. In this situation the interconnection function is viewed as being applied to the single source, rather than all sources. The connections in the' Generalized Cube are based on the cube interconnection functions [ 18] . Let P =-Pn-I * P lpo be the binary representation of an arbitrary I/O line label. Then the n cube interconnection functions can be defined as cubei(Pn-I ... PJP0) = Pn-I * * * Pi+ lPiPi-I
..
PIPO
where 0 < i < n, 0 < P < N, and 5i denotes the complement ofpi. This means that the cube interconnection function connects P to cube' (P), where cubei(P) is the I/O line whose label differs from P in just the ith bit position. Stage enabling and disabling is performed by a system control unit. Normally, the network will be set so that stage n is disabled and stage 0 is enabled. The resulting structure is' that of the Generalized Cube. If after running fault detection and location tests a fault is found, the network is reconfigured. If the fault is in stage 0 then stage n is enabled and stage 0 is disabled. For a fault in a link or box in stages n -1 to 1, both stages n and 0 will be enabled. A fault in stage n requires no change in network configuration; stage n remains disabled. If a fault occurs in stages n -1 through 1, in addition to reconfiguring the network the system informs each source device of the fault by sending it a fault identifier.
Intuitively, for both the Generalized Cube and the ESC, stage i, 0 < i < n, determines the ith bit of the address of the output port to which the data are sent. Consider the route from source S = sn-_ I--I to destination D = dn *-. dIdo. If the route passes through stage i using the straight connection, then the ith bit of the source and destination addresses will be the same, i.e., di = si. If the exchange setting is used, the ith bits will be complementary, i.e., di = s-. In the Generalized Cube, stage 0 determines the 0th bit position of the destination in a similar fashion. In the ESC, however, both stage n and stage 0 can affect the 0th bit of the output address. Using the straight connection in stage n performs routings as they occur in the Generalized Cube. The exchange setting makes available an alternate route not present in the Generalized Cube. In particular, the route enters stage n -1 at label Sn_ ... SIso, instead of s,_1 * s1s0.
III. FAULT TOLERANCE A. Introduction In the fault model to be used, failures may occur in network interchange boxes and links. However, the input and output ports and the multiplexers and demultiplexers directly connected to the ports of the ESC are always assumed to be functional. If a port or the stage n demultiplexers or stage 0 multiplexers were to be faulty, then the associated device would have no access to the network. Such a circumstance will not be considered. Once a fault has been detected and located in the ESC, the failing portion of the network is considered unusable until such time as the fault is remedied. Specifically, if an interchange box is faulty, data will not be routed through it, nor will data be passed over a faulty link. The extra stage of the ESC does increase the likelihood of a fault compared to the Generalized Cube due to the additional hardware. However, analysis of an independently developed related network shows that for reasonable values of interchange box reliability there is a gain in network reliability as a result of an extra stage [27] . It should also be noted that a failure in a stage n multiplexer or stage 0 demultiplexer has the effect of a link fault, which the ESC can tolerate as shown in this section.
Techniques such as test patterns [6] or dynamic parity checking [22] for fault detection and location have been described for use in the Generalized Cube topology. Test patterns are used to determine network integrity globally by checking the data arriving at the network outputs as a result of N strings (one per input port) of test inputs. With dynamic parity checking, each interchange box monitors the status of boxes and links connected to its inputs by examining incoming data for correct parity. It is assumed that the ESC can be tested to determine the existence and location of faults. This paper is not concerned with the procedures to accomplish this, but rather with how to recover once a fault is located. Recovery from such a fault is something of which the Generalized Cube and its related networks are incapable.
B. Single Fault Tolerance: One-to-One Connections
The ESC gets its fault-tolerant abilities by having redundant paths from any source to any destination. This is shown in the following theorem.
Theorem 1: In the ESC with both stages n and 0 enabled there exist exactly two paths between any source and any destination.
Proof: There is exactly one path from a source S to a destination D in the Generalized Cube [11] . Stage n of the ESC allows access to two distinct stage n -1 inputs, S and cubeo(S). Stages n -1 to 0 of the ESC form a Generalized Cube topology, so the two stage n -1 inputs each have a single path to the destination and these paths are distinct (since they differ at stage n -1 at least). 0 The existence of at least two paths between any source/ destination pair is a necessary condition for fault tolerance. Redundant paths allow continued communication between source and destination if after a fault at least one path remains functional. It can be shown that for the ESC two paths are sufficient to provide tolerance to single faults for one-to-one connections.
Lemma 1: The two paths between a given source and destination in the ESC with stages n and 0 enabled have no links in common.
Proof: A source S can connect to the stage n -1 inputs S or cubeo(S). These two inputs differ in the 0th, or low-order, bit position. Other than stage n, only stage 0 can cause a source to be mapped to a destination which differs from the source in the low-order bit position. Therefore, the path from S through stage n -1 input S to the destination D contains only links with labels which agree with S in the low-order bit position. Similarly, the path through stage n -1 input cubeo(S) contains only links with labels agreeing with cubeo(S) in the low-order bit position. Thus, no link is part of both paths. 3 Lemma 2: The two paths between a given source and destination in the ESC with stages n and 0 enabled have no interchange boxes from stage n -1 through 1 in common.
Proof: Since the two paths have the same source and destination, they will pass through the same stage n and 0 interchange boxes. No box in stages n --1 through 1 has input link labels which differ in the low-order bit position. One path from S to D contains only links with labels agreeing with S in the low-order bit position. The other path has only links with labels which are the complement of S in the low-order bit position. Therefore, no box in stages n -1 through 1 belongs to both paths. o Theorem 2A In the ESC with a single fault there exists at least one fault-free path between any source and destination.
Proof: Assume first that a link is faulty. If both stages n and 0 are enabled, Lemma 1 implies that at most one of the paths between a source and destination can be faulty. Hence, a fault-free path exists. Now assume that an interchange box is faulty. There are two cases to consider. If the faulty box is in stage n or 0, the stage can be disabled. The remaining n stages are sufficient to provide one path between any source and destination (i.e., all n cube functions are still available). If the faulty box is not in stage n or 0, Lemma 2 implies that if both stages n and 0 are enabled, then at most, one of the paths is faulty. So, again, a fault-free path exists.
Two paths exist when the fault is in neither of the two paths between source and destination. o
C. Single Fault Tolerance: Broadcast Connections
The two paths between any source and destination of the ESC provide fault tolerance for performing broadcasts as well.
Theorem 3: In the ESC with both stages n and 0 enabled there exist exactly two broadcast paths for any broadcast performable on the Generalized Cube.
Proof: There is exactly one broadcast path from a source to its destinations in the Generalized Cube. Stage n of the ESC allows a source S access to two distinct stage n -1 inputs, S and cubeo(S). Any set of destinations to which S can broadcast, cubeo(S) can broadcast, since a one-to-many broadcast is just a collection of one-to-one connections with the same source. 0 Lemma 3: The two broadcast paths between a given source and destinations in the ESC with stages n and 0 enabled have no links in common.
Proof: All links in the broadcast path from the stage n -1 input S have labels which agree with S in the low-order bit position. All links in the broadcast path from the stage n -1 input cubeo(S) are the complement of S in the low-order bit position. Thus, no link is part of both broadcast paths. 0l
Lemma 4: The two broadcast paths between a given source 446 and its destinations in the ESC with stages n and 0 enabled have no interchange boxes from stage n -1 through 1 in common.
Proof: Since the two broadcast paths have the same source and destinations, they will pass through the same stage n and 0 interchange boxes. No box in stages n -1 through 1 has input link labels which differ in the low-order bit position. From the proof of Lemma 3, the link labels of the twos broadcast paths differ in the low-order bit position. Therefore, no box in stages n -1 through 1 belongs to both broadcast paths. 0 Lemma 5: With stage 0 disabled and stage n enabled, the ESC can form any broadcast path which can be formed by the Generalized Cube.
Proof: Stages n through 1 of the ESC provide a complete set of n cube interconnection functions in the order cubeo, cuben-1, , cube,. A path exists between any source and destination with stage 0 disabled because all n cube functions are available. This is regardless of the order of the interconnection functions. So, a set of paths connecting an arbitrary source to any set of destinations exists. Therefore, any broadcast path can be formed. o Theorem 4: In the ESC with a single fault there exists at least one fault-free broadcast path for any broadcast performable by the Generalized Cube.
Proof: Assume that the fault is in stage 0, i.e., disable stage 0, enable stage n. Lemma 5 implies that a fault-free broadcast path exists. Assume that the fault is in a link or a box in stages n -1 to 1. From Lemmas 3 and 4, the two broadcast paths will have none of these network elements in common. Therefore, at least one broadcast path will be fault-free, possibly both. Finally, assume the fault is in stage n. Stage n will be disabled and the broadcast capability of the ESC will be the same as that of the Generalized Cube. The ESC path routing S to D corresponding to the Generalized Cube path from S and D is called the primary path. This path must either bypass stage n or use the straight setting in stage n. The other path available to connect S to D is the secondary path. It must use the exchange setting in stage n. The concept of primary path can be extended for broadcasting. The broadcast path, or set of paths, in the ESC analogous to that available in the Generalized Cube is called the primary broadcast path. This is because each path, from the source to one of the destinations, is a primary path. If every primary path is replaced by its secondary path the result is the secondary broadcast path.
Given S and D, the network links and boxes used by a path can be found. As discussed in [11] , for the source/destination pair S and D the path followed in the Generalized Cube topology uses the stage i output labeled dn_1*. d/Li.disi l** s1so. The following theorem extends this for the ESC. Also, where some pair of fault labels fails the test of Theorem 6, complete fault-free interconnection capability is lost.
For an SIMD system where interconnection network routing requirements are limited to a relatively small number of known mappings, multiple faults that preclude fault-free interconnection capability might not impact system function. This would occur if all needed permutations could be performed (although each would take two passes). Similar faults in MSIMD or MIMD systems may leave some processes unaffected. For these situations, and if fail-soft capability is important, it is useful to determine which source/destination pairs are unable to communicate. The system might then attempt to reschedule processes such that their needed communication paths will be available, or assess the impact the faults will have on its performance and report to the user. IV. ROUTING TAGS The use of routing tags to control the Generalized Cube topology has been discussed in [ 11] and [22] . A broadcast routing tag has also been developed [22] , [28] . The details of one routing tag scheme are summarized here to provide a basis for describing the necessary modifications for use in the ESC.
For one-to-one connections, an n-bit tag is computed from the source address S and the destination address D. The routing tag T = S @ D, where ED means bitwise EXCLUSIVE-OR [22] . Let tn-1 -*. t1to be the binary representation of T.
To determine its required setting, an interchange box at stage i need only examine ti. If ti = 0, the straight state is used; if ti = 1, an exchange is performed. For example, given S = 001 and D = 100, then T = 101, and the box settings are exchange, straight, and exchange. Fig. 4 illustrates this route in a faultfree ESC.
The routing tag scheme can be extended to allow broadcasting from a source to a power of two destinations with one constraint. That is, if there are 2' destinations, 0 < j < n, then the Hamming distance (number of differing bit positions) [12] between any two destination addresses must be less than or equal to j [22] . Thus, there is a fixed set of j bit positions where any pair of destination addresses may disagree, and n -j positions where all agree. For example, the set of addresses $010, 01 1, 1 10, 11 II meets the criterion.
To demonstrate how a broadcast routing tag is constructed, let S be the source address and D 1, D2, --*, D2' be the 2i destination addresses. The routing tags are Ti = S (1S Di, 1 < i < 2i. These tags will differ from each other only in the same j bit positions in which S may differ from Di, 0 < i < 2i.
The broadcast routing tag must provide information for routing and determining branching points. Let Now routing tag and broadcast routing tag definitions for use in the ESC with a fault will be described. With regard to routing tags, the primary path in the ESC is that corresponding to the tag T' = Otn-1 ... tIto, and the secondary path is that associated with T' = 1 tn-I-*. tIto. The primary broadcast path is specified by R' = Orn I*--* r1ro and B' = Obn_1 ... bibo, whereas R' = Irn_-... Tiro and B' = obnI ... b1bo denote the secondary broadcast path.
It is assumed that the system has appropriately reconfigured the network and distributed fault labels to all sources as required. With the condition of the primary path known, a routing tag that avoids the network fault can be computed. 3) If the fault is in stage n, use T' = t't,_1 * tIto, where tn is arbitrary.
Proof: Assume that the fault is in stage 0, i.e., stage n will be enabled and stage 0 disabled. Since stage n duplicates stage O (both perform cubeo), a routing can be accomplished by substituting stage n for stage 0. The tag T' = tot"I t1to does this by placing a copy of to in the nth bit position. Stage n then performs the necessary setting. Note that the low-order bit position of T', to, will be ignored since stage 0 is disabled.
Assume that the fault is in a link or a box in stages n -1 to 1. T specifies the primary path. If this path is fault-free, setting T' = Otn-1 tl to will use this path. The 0 in the nth bit position is necessary because stages n and 0 are enabled, given the assumed fault location. If the path denoted by T contains the fault, then the secondary path is fault-free by Theorem 2 and must be used. It is reached by setting the high-order bit of T' to 1. This maps S to the input cubeo(S) of stage n -1. To complete the path to D, bits n -1 to 0 of T' must be cubeo(S) @ D = tn-I ... tIto. Thus, T' = Itn_l ... tIto.
Finally, assume that the fault is in stage n. Stage n will be disabled, and the routing tag needed will be the same as in the fault-free ESC. Obn_ ... b1 bo causes the broadcast to be performed using the secondary broadcast path. Finally, assume the fault is in stage n. Stage n will be disabled, and the broadcast routing tag needed will be the same as in the fault-free ESC. 0 Theorems 7 and 8 are important for MIMD operation of the network because they show that the fault-tolerant capability of the ESC is available through simple manipulation of the usual routing or broadcast tags. Table I summarizes routing  tags and Table II The Generalized Cube can be partitioned into two subnetworks of size N/2 by forcing all interchange boxes to the straight state in any one stage [21 ] . All the input and output port addresses of a subnetwork will agree in the ith bit position if the stage that is set to all straight is the ith stage. Proof: The cube functions n -1 through 1 each occur once in the ESC. Setting stage i, 1 < i < n -1, to all straight separates the network input and output ports into two independent groups. Each group contains ports whose addresses agree in the ith bit position, i.e., all addresses have their ith bits equal to 0 in one group, and 1 in the other. The other n stages provide the cubej functions for 0 < j < n and j # i, where cubeo appears twice. This comprises an ESC network for the N/2 ports of each group. As with the Generalized Cube, each subnetwork can be further subdivided. Since the addresses of the interchange box outputs and links of a primary path and a secondary path differ only in the 0th bit position, both paths will be in the same partition (i.e., they will agree in the bit position(s) upon which the partitioning is based). Thus, the fault-tolerant routing scheme of the ESC is compatible with network partitioning.
If partitioning is attempted on stage n the result will clearly be a Generalized Cube topology of size N. Attempting to partition on stage 0 again yields a network of size N, in particular a Generalized Cube with cubeo first, not last. In neither case are independent subnetworks formed. O In Fig. 7 the ESC for N = 8 is shown partitioned with respect to stage 2. The two subnetworks are indicated by the labels A and B. Subnetwork A consists of ports 0, 1, 2, and 3. These ports addresses agree in the high-order bit position (it is 0). Subnetwork B contains ports 4, 5, 6 , and 7, all of which .agree in the high-order bit position (it is 1).
Partitioning can be readily accomplished by combining routing tags with masking [22] . By logically ANDing tags with masks to force to 0 those tag positions corresponding to interchange boxes that should be set to straight, partitions can be established. This process is external to the network and, so, independent of a network fault. Thus, partitioning is unimpeded by a fault. In PASM, partitioning is designed to be based on I/O port addresses within a group agreeing in some number of low-order bit positions. The ESC as defined cannot support this type of partition. However, a variation of the ESC can perform loworder bit partitioning. Beginning with a Generalized Cube, an ESC-like network can be constructed by adding an extra stage to the output side of the network which implements cube,_ 1.
Call this new stage -1. Thus, from the input to the output, the stages implement cube,-,, cube,-2, -.., cube1, cubeo, and cube,,-,. The same fault-tolerant capabilities are available in this new network, but partitioning may be done on stage 0. Hence, low-order bit partitioning is available. Partitioning on stages n -1 and -1 is not available.
VI. PERMUTING
In SIMD mode generally all or most sources will be sending data simultaneously. Sending data from each source to a single, distinct destination is referred to as permuting data from input to output. A network can perform or pass a permutation if it can map each source to its destination without conflicts.
Conflict is when two or more paths include the same stage output.
The fault-free ESC clearly has the same permuting capability as the Generalized Cube. That is, any permutation performable by the Generalized Cube is performable by the ESC. If stage n in a fault-free ESC is enabled, the permuting capability is a superset of the Generalized Cube. Also, the ESC routing tags discussed in Section IV are entirely suitable for use in an SIMD environment.
Because of its fault-tolerant nature, it is possible to perform permutations on the ESC with a single fault, unlike the Generalized Cube. It can be shown that in this situation two passes are sufficient to realize any Generalized Cube performable permutation.
Theorem 10: In the ESC with one fault all Generalized Cube performable permutations can be performed in at most two passes.
Proof: If a stage n interchange box is faulty, the stage is bypassed and the remainder of the ESC performs any passable permutation with a single pass. If the fault is in a stage 0 box the permutation can be accomplished in two passes as follows. In the first pass, stages n and 0 are bypassed and the remaining stages are set as usual. On the second pass, stage n is set as stage 0 would have been, stages n -1 through 1 are set to straight, and stage 0 is again bypassed. This simulates a pass through a fault-free network.
While stages n to 1 of the ESC provide the complete set of cube interconnection functions found in the Generalized Cube, a single pass through the stages in this order does not duplicate its permuting capability. For example, the Generalized Cube can perform a permutation which includes the mappings 0 to When the fault is in a link or a box in stages n -I to 1, then at the point of the fault there are less than N paths through the network. Thus, N paths cannot exist simultaneously. The permutation can be completed in two passes in the following way. First, all sources with fault-free primary paths to their destination are routed. One source will not be routed if the failure was in a link, two if in a box. With a failed link, the second pass routes the remaining source to its destination using its fault-free secondary path. With a faulty box, the secondary paths of the two remaining sources will also route to their destinations without conflict. Recall that paths conflict when they include the same box output. From Theorem 5, the primary path output labels for these two paths at stage i are dln_ d!+1d!s3_) sjs' and d _.1 * *_* S2S2, 0 < i < 
VII. CONCLUSIONS
The reliability of large-scale multiprocessor supersystems is a function of system structure and the fault tolerance of system components. Fault-tolerant intercommunication networks can aid in achieving satisfactory reliability. This paper has presented the ESC network, a derivative of the,Generalized Cube network that has fault tolerance. The fault-tolerant capabilities of the ESC topology were proven. The partitioning and permuting abilities of the ESC were discussed. A minor adaptation of the routing tag and broadcast routing tag schemes designed for the Generalized Cube was described. This allows the use of tags to control a faulted as well as fault-free ESC.
The family of multistage interconnection networks of which the Generalized Cube is representative has received much attention in the literature. These networks have been proposed for use in supersystems such as PASM, PUMPS, the Ballistic Missile Defense Agency test bed, Ultracomputer, the Numerical Aerodynamic Simulator, and data flow machines. The ESC has the capabilities of the Generalized Cube plus fault tolerance for a relatively low additional cost. Distributed control of the network with routing tags is straightforward. Thus, the ESC has the potential for being a useful interconnection network for large-scale parallel/distributed supersystems. 
