The omega network has various attractive topological properties. It supports both one-to-one message routing and broadcast routing. Independent of the system size, every node in the network has a fixed size; therefore, it is used intensively in large-sized systems. In this paper, we examine a reliable omega-based multiprocessor system that preserves its full rigid omega configuration even in the presence of faults. The proposed omega interconnection network can tolerate any single and many multiple node failures, giving rise to significantly improved reliability. Reconfiguration in response to a single or multiple faults in the new design is easy and may be performed in a distributed manner. Unlike the reliable butterfly network, in the proposed reliable omega network, if a node at stage zero fails, the system will not lose a connection to one of its input/output ports.
INTRODUCTION
The interconnected topology of a large distributed memory multiprocessor is critically important to overall system performance. The omega network has many attractive features, such as simple routing, low diameter and good support for broadcasting capabilities. In addition, every node in an omega network requires a constant degree, independent of the system size. As a result, it has been built into the Illinois Cedar multiprocessor [1] , into IBM RP3 [2] and into the NYU Ultracomputer [3] .
The omega network is a member of a class of multistage interconnection networks (MINs). These networks include the flip network [4] , the indirect binary n-cube network [5] , the data manipulator [6] and the regular SW banyan network with spread and fanout of 2(S = F = 2) [7, 8] . Each of these networks consists of L inputs and L outputs. All of these networks are capable of connecting an arbitrary input terminal to an arbitrary output terminal. However, when we try to connect two or more input and output terminal pairs simultaneously, there may be conflicts among their connection paths. Because of this, these networks are known as blocking networks. It has been proven that these networks are topologically equivalent. Converting one network to another can be achieved by rearranging the position of switching elements without breaking the link connections in one network. The resulting network is equivalent to another except for the input and output terminal numbers. 1 Currently working with the Digital Telecommunication Institution in the West Bank.
A crucial design attribute of a large network system is its reliability. When the system size grows, the probability of having all system nodes fault free during a given operation period falls quickly and could reach an unacceptable low point. It is, thus, necessary to incorporate redundancy into the system design to guarantee proper continuous operation even after the failure of some nodes. As soon as a fault arises and is detected, a fault-tolerant system reconfigures itself so as to isolate the failed nodes. If a system fails to retain its rigid original structure, it may no longer support the service level the system is designed to perform after reconfiguration. The problem of reliability and fault-tolerance is considered the most important issue in real-time applications. Therefore, this problem has recently attracted many researchers [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20] .
We propose an omega-based parallel system with enhanced reliability.
The reconfiguration process in response to an arising fault in this design involves several nodes and can be carried out in a distributed manner. A reconfigured system is assured to deliver the same high performance since it maintains the full rigid omega topology. We assume fault-free links; the link failure problem is beyond the scope of this paper. However, we believe that link failure is as important as node failure even though it occurs less frequently than node failure and the degrading of the overall network performance as result of a node failure is less than that caused by a link failure. Therefore, we have recently discussed this problem in [21] .
A similar work has been done for the butterfly network [22] . Butterfly networks are a restricted subclass of omega networks as no broadcast connections are allowed in butterfly networks [23] . Moreover, omega networks have many attractive features as mentioned earlier and they belong to a general class of blocking MINs. More interestingly, we have overcome the input/output (I/O) problem that arises in the reliable butterfly design proposed in [21] . In our design if a node at stage zero (the first stage to the left) fails, the system does not lose connection to one of its I/O ports. This was not true in the reliable butterfly design. This paper is organized as follows. Section 2 briefly reviews the omega architecture. The proposed reliable design is introduced in Section 3, followed by a description of its reconfiguration procedure in Section 4, and Section 5 evaluates the reliability of the proposed design. Section 6 extends the design approach for further reliability improvement.
OVERVIEW OF THE OMEGA TOPOLOGY
An omega distributed-memory multiprocessor ( ) of size N = L × (log 2 L + 1) is arranged as L levels of (log 2 L + 1) stages, with nodes in consecutive stages interconnected by a shuffle permutation pattern. N is the number of switches in the network. Figure 1 illustrates the network with 32 nodes (i.e. L = 8). Every link between two nodes can be used for bidirectional data transmission. Stages in the network are numbered in sequence from 0 to ln L, with 0 for the left-most stage. Similarly, the nodes in every stage are labeled from 0 to L − 1 (referred to as the level number), with 0 for the top node. We denote the node at stage g and level as (g, ).
Every node has four links; two of them connected to nodes in the preceding stage and the other two connected to nodes in the next stage.
The interconnection topology can be formally described as follows. Let the natural binary representation of the decimal number x = {b n , b n−1 , . . . , b 1 , b 0 }. The output link x from stage g is connected to input link σ n (x) of stage g +1. Where 0 ≤ g < log 2 L, σ n is the perfect shuffle permutation function (i.e. if x is an n + 1 bit number, that is x = {b n , b n−1 , . . . , b 1 Figure 1 , L = 8, N = 32, n = 3 and x = 0, 1, . . . , 15, so each link has a four-bit representation (b 3 , b 2 , b 1 , b 0 ).
THE PROPOSED RELIABLE OMEGA DESIGN
The main objective in the proposed design is to ensure that the system maintains its full strict structure despite the fault. In order to achieve that, we add one more stage at the end of the network and some extra links between the nodes in the network. The links are added according to a well-defined set of rules so that the faulty node can be replaced.
The rules to add extra links are as follows.
1. A link is added between stages g and (g + 2), 0 ≤ g < n − 1, where n = log 2 L. The new link adds a new connection between node (g, ) and node (g + 2,
, where b n is the value of the nth bit position in the natural binary representation of . As an example 2. A link is added between the two stages n − 1 and n + 1.
The new link adds a new connection between node (n − 1, ) and node (n + 1,
A link is added between some nodes at the same stage such that node (g, ) is connected to node (g, + 1) if is even, 0 < g ≤ n.
Finally two extra links are added between nodes (n, )
and (n + 1, ).
The extra links and the extra stage are shown in Figure 2 . It is very important to mention here that the extra link that connects node (g, ) to a certain node at stage (g +2) always carries the same information either on the upper or lower port of node (g, ). Node (g, ) transmits on the extra link the same information as that transmitted on its lower port if < L/2. On the other hand, if ≥ L/2, node (g, ) transmits on the extra link the same information as that transmitted on its upper port. The reason behind this will be clear when we discuss the reconfiguration process in the next section.
In the proposed design, we add control switches to connect (2 × i + 1, 2 × i + L) pairs of the original links, i = 0, 1, . . . , n. Circles in Figure 1 represent locations where control switches are added. The control switches are all set to the 'X' state in a normal operation (faultfree network), and when a node malfunctions some of those switches will be set to either '∨' or ' ' as shown in Figure 3 . In the case when a node fails, the reconfiguration procedure determines which of the control switches has to change from 'X' (normal state) to either '∨' or ' '. Note that the two states of the control switches depicted in Figure 3 are determined (suggested) based on how links are 'drawn' in Figure 1 . In actual designs, it can be different. What matters is that an appropriate connection must be established in the case when a fault occurs such that the omega topology is maintained.
Moreover, four switches are added at each node. Those switches can be configured so that if a node fails, it can be bypassed and viewed as if it is stuck in the straight state. Figure 4 clarifies the node structure and what happens if it is faulty. The control switch and the four added switches at each node are normally much simpler [24] than a node and can be made more reliable as pointed out in earlier studies [25] .
The proposed design requires the same number of spare nodes as the well-known FTBDM design [21] . Although it requires twice as many control switches than are needed for the FTBDM, it is more reliable. An omega ( ) network with added extra links and control switches is referred to as a FT (which stands for fault-tolerant omega) network.
NODE REPLACEMENT POLICY AND RECONFIGURATION PROCEDURE
When any node fails, a reconfiguration procedure is executed to replace the faulty node and to maintain the omega topology. In this section, we define the replacement policy and the reconfiguration procedure that must be processed in a distributed manner in the case when a node fails.
Replacement policy
When a node fails it has to be replaced and another node takes its role. The failure node can be either a spare node or an active node. If a spare node fails, it can be simply bypassed and no replacement is required. The method to bypass the spare node will be explained later in the reconfiguration procedure. However, if an active node fails, it is bypassed and its role is replaced by its dual node, which in turn is replaced by its dual etc. This process continues until a spare node is used to replace an active (original) node. The dual of a node is determined as follows.
• The dual of node (g, ) is node
where mod is the modulus operator and g ≤ n − 1. For example, the dual of node (0, 1) is node (1, 2).
• The dual of node (n, ) is node (n + 1, ) which is the spare node to its right in the same level.
Reconfiguration procedure
It is important to clearly define the node path set before introducing the reconfiguration procedure. Assume that a (g, ) node fails in the FT network, then all nodes in the faulty node path that are in stages greater than g participate in the replacement procedure. The node path set is the set of all nodes that replace each other when a fault occurs to a node in the first stage (stage 0). Thus any node path has n+1 nodes and can tolerate only a single fault. For example, if node (0, 1) fails, then node (1, 2) will replace node (0, 1), node (2, 4) will replace node (1, 2), node (3, 1) will replace node (2, 4), and finally node (4, 1) will replace node (3, 1)
as shown in Figure 5 . The nodes (1, 2), (2, 4), (3, 1) and (4, 1) constitute the node path set. If a (g, w) node fails, all nodes in the faulty node path and in stages greater than g will participate in the reconfiguration procedure. The action that each node should carry out is as follows.
• The failure node is bypassed by the four added switches as shown in Figure 4 .
• The node (k, ) in the node path set carries out the following steps in a distributed manner.
1. For k = g + 1 such that k < n:
Before reconf.
After reconf.
2. For g + 1 < k < n + 1:
(a) if is even Before reconf.
(b) if is odd
3. For k = n + 1:
After reconf. (b) if is odd Before reconf.
1,
In the above diagrams, the solid lines are the active links. So before and after configuration each node has two inputs (two solid lines) and two outputs (two solid lines). Although the dashed lines are physically connected, they are functionally inactive.
It is clear from the reconfiguration procedure that an arising fault at stage g causes no more than (n + 1 − g) nodes to participate in the reconfiguration procedure and a failure at later stages leads to fewer participators. When node (g, ) fails, node (g + 1, 2 × ( mod L/2) + b n ) detects the failure (i.e. using watchdog timers) and initiates a signal to all nodes in the path set located in later stages to start the reconfiguration procedure. After receiving a signal, every node starts reconfiguration independently. Since every node involved in the reconfiguration procedure performs its reconfiguration independently and simultaneously, it takes a constant time to complete the overall reconfiguration.
Illustrative example
In the following example we illustrate the replacement and reconfiguration procedures that take place if a node fails in the network.
As shown in Figure 5 , let node (0, 1) fail. First, we have to determine the node path set. According to the rules of the replacement policy, node (1, 2) is a dual to node (0, 1) and so it will replace it. Likewise node (2, 4) replaces (1, 2), node (3, 1) replaces (2, 4) and the spare node (4, 1) replaces node (3, 1). So the nodes (1, 2), (2, 4), (3, 1) and (4, 1) constitute a node path set. Note that after the replacement process is completed, the spare node becomes part of the system. In other words, the process of replacing each node by its dual in the node path continues until the spare node is used to replace an active node. The reconfiguration procedure is performed as follows.
Node (0 + 1, 2 × (1 mod 8/2) + 0) = (1, 2) detects the failure of node (0, 1), and initiates a signal to all nodes in the node path set located in higher stages to start the reconfiguration procedure. Namely, a signal is passed to nodes (2, 4), (3, 1) and (4, 1) . Thereafter, each node starts reconfiguration steps independently as follows.
S. BATAINEH AND G. E. QANZU'A
• The failure node (0, 1), is bypassed by reconfiguring the control switches as shown in Figure 3 .
• For node (1, 2), (k = g+1 = 0+1 = 1) < (n = 3) and is even, thus it is reconfigured according to rule 1(a). The dark lines connecting node (0, 1) and node (1, 2) demonstrate the new position of the switch.
• For node (2, 4), (g + 1 = 0 + 1 = 1) < (k = 2) < (n+1 = 4) and is even, so it is reconfigured according to rule 2(a). • For node (3, 1) , (g + 1 = 0 + 1) < (k = 3) < (n+1 = 4) and is odd, so it is reconfigured according to rule 2(b). So node (3, 1) changes the connection at its input as shown in Figure 5 . • For node (4, 1) , (k = 4) = (n + 1 = 4) and is odd, so it is reconfigured according to rule 3(b) and that is also clear in Figure 5 .
As depicted in Figure 5 , nodes (1, 2), (2, 4), (3, 1) and (4, 1) carry out the reconfiguration procedure. The affected links due to the reconfiguration procedure are darkened. After reconfiguration, every node has exactly four active links (even though intermediate nodes have seven links) and the full strict topology is preserved. One may think that the other nodes in Figure 5, such as (2, 0) , (1, 6) and (0, 5), should be involved in the reconfiguration procedure. As a matter of fact, there is no need to worry about those node configurations. They have already been taken care of when other nodes are reconfigured. For example, node (2, 0), as explained earlier, will transmit the same information on the extra link (middle) and the lower link. Since node (4, 1) activates the extra link on its input port, the information is passing through and received correctly at node (4, 1). At the same time, the lower port of node (2, 0) is blocked since the control switch at the upper port of the node (3, 1) changes its setting from 'X' to ' ' as shown in Figure 5 .
RELIABILITY ANALYSIS
Any single failure is tolerable in the FT network. However, in a node path, which has n + 1 nodes, only a single fault can be tolerated. For example, if node (0, 1) fails, nodes (1, 2), (2, 4), (3, 1) and (4, 1) take over the role of nodes (0, 1), (1, 2), (2, 4) and (3, 1) respectively, as shown in Figure 5 .
A failure in node (0, 5) will be intolerable. If node (0, 1) fails, node (1, 3), which should take over for node (0, 5), will not be able to provide an extra active link between node (0, 5) and node (2, 4) . This justifies the intolerable fault caused by the failure of node (0, 5). By the same token, any failure at nodes (0, 0), (0, 3), (0, 5), (1, 0), (1, 6) and (2, 0) is intolerable. Thus, following the failure of node (0,1), these six nodes and the nodes in the faulty node path are called critical nodes.
In general, the number of critical nodes, , due to a failure of a node at stage g is
Let φ be the set of all critical nodes, then a subsequent node failure is tolerable only if it is not in φ. After the second tolerable node fault, certain other nodes become critical and φ is updated to include those new ones, a third failure is tolerable provided that it is not in φ. If a faulty node belongs to the critical set φ, then the FT cannot be reconfigured to operate as a regular network.
Unlike the reliable butterfly network [21] , in the FT network, if a node at stage 0 fails, the system will not lose a connection to one of its I/O ports as shown in Figure 5 .
The reliability of all nodes is assumed to be equal and exponentially distributed, with a constant failure rate λ (i.e. R = e −λt ). The reliability of the added extra nodes is assumed to be R. Consequently, the reliability of the FT network can be expressed by
where N = L×(n+2) is the total number of nodes including the extra nodes in the system and β i is the number of ways in which i tolerable faults can occur in the system. β i equals 0 for i exceeding the number of maximum tolerable faults L. In order to evaluate R FT (t), it is necessary to calculate the number of possible locations for the ith tolerable faulty node to occur, given that the system has already reconfigured itself successfully in response to the (i − 1) faults. The number of possible locations depends on the positions of the i−1 faults. The two initial values β 0 = 1 and β 1 = N (the first fault can be at any node location) and a lower bound on the number of possible locations for the ith fault are derived.
The maximum value of is
which is obtained when g = 0. A subsequent fault is tolerable, provided that the faulty node is not at any one of the max critical nodes. Subsequently, the second fault is within N − max possible locations, which implies that
Continuing in the same way, it is easy to see that, in general, for any given fault pattern with i −1 failures, the ith tolerable fault can be within N − (i − 1) max possible locations, and
for all i ≥ 2.
Since β i is a lower bound, we may calculate the lower bound of R FT (t). The reliability lower bounds of the FT network with L = 2 4 and 2 8 levels are plotted in Figures 7 and 8 , respectively, where the node failure rate λ is assumed to be 1.0 per unit time. The lower bound it with FTBDM because the latest work in [21] showed that the reliability of FTBDM is superior over the butterflybased distributed-memory multiprocessor (called the BDM for short) and the reconfigurable chain-structured butterfly architecture (RECBAR) [26] . However, it seems that the 'little' reliability improvement obtained over the FTBDM is swamped by the extra cost one has to pay for the extra control switches used in the FT . This is not utterly true because the control switches are simple and cheap, thus they do not add a significant cost to the overall design. Moreover, reliability improvement is not the only benefit that is achieved by using the FT . A network designer prefers FT , even if its reliability is comparable to the FTBDM, because it possesses some good features which are not found in the FTBDM. For example, the FTBDM does not support broadcasting; hence, it is considered as a restricted class of the FT network. Moreover, omega can easily replace other blocking MINs by simply rearranging the position of switching elements without breaking the link connections. Finally, but most importantly, FT overcomes the I/O problem which arises in the reliable butterfly network proposed in [8] . In FT , if a node at stage 0 fails, the system does not lose a connection to one of its I/O ports. We believe that finding a lower bound is often sufficient since it is guaranteed that the system will be operational with the likelihood no less than what is specified by the lower bound. Moreover, it is extremely difficult to find an exact form. Thus, no further effort has been done to get an actual reliability.
DESIGN EXTENSION
The proposed design can tolerate at most one node failure in a node path. Large systems designed for a long mission duration may need to have an even more reliable architecture. Our method can be easily extended to incorporate more than one stage of extra nodes in for more reliability enhancement. In theory any number of extra spare stages can be added, but here we analyze only the case of two spare stages. A general case may be analyzed similarly.
Consider an with L = 2 n levels, one spare stage, called S1, is placed at the right-most side as before and the second spare stage, called S2, is added to the right of stage i, 0 ≤ i ≤ n. The resulting structure, denoted by FT n i , can be viewed as being composed of two parts, with the left part containing stages 0, 1, . . . , i and S2 and the right part containing stages i + 1, i + 2, . . . , n and S1. The extra connections between stages i − 1 and S2 are the same as those between stages n − 1 and S1, two parallel connections are added between node (i, ) and spare node (S2, ) for all 0 ≤ < L. The regular links between stages i and i + 1 are removed and replaced by regular links between S2 and i + 1. Figure 6 shows the FT 3 2 . Such a design can tolerate up to two failures per node path, provided that one failure is in the left part and the other is in the right part.
The overall reliability of FT n i is dependent on the position of the second spare stage S2 after stage i. As i increases, the right part will be more reliable, while the left part will be less reliable. This is because for large values of i, S2 has to cover failures in a larger part than S1 and vice versa for small values of i. So the highest reliability is achieved if spare stages are placed equidistantly. The reliability lower bound can be easily obtained by calculating the product of the lower bound reliabilities of the left and right part. To calculate the lower bound reliability of any of the parts we use the same equation R FT (t) used in Section 5.
To calculate max , we use the same equation, except that we substitute the number of stages covered in the part of interest instead of n (i.e. substitute i instead of n in the left part).
CONCLUSION
The reliability of large-scale distributed memory multiprocessor systems is a function of system structure and the fault tolerance of system components. Fault-tolerant interconnection networks can aid in achieving satisfactory reliability.
This paper has presented a fault-tolerant omega network that can tolerate any single and many multiple node faults. Our goal was to maintain the full rigid omega interconnection even in the presence of faults, so a reconfigured system can deliver the same high performance. The reconfiguration procedure is simple and involves only a small fraction of the system nodes and can be carried out in a distributed manner. It is shown that our design is more reliable and more general than an earlier reconfigurable design. This design style can be extended to incorporate more than one stage of extra nodes, to achieve a higher reliability.
