Reconfigurable tree architectures using subtree oriented fault tolerance by Lowrie, Matthew B.
I 
I 
I 
I 
I 
I 
I 
I 
I 
- I  
I 
I 
I 
I 
hlarch 1987 
1 
( B A S  A-CR- 18 0677) R ECC N P I G U B A  €LE I B  EE 
107-21421 
Unclas 
I- 
ARCHIIECTDRES C S I h G  S U E T B E E  C K l E b l l E D  PAUL? 
'LCLEEANCE ( I l l i n c i s  Uaiv,) 46 F A v a i l :  b'lIS HC A03/BP A01 CSCL 09B 
G3/61 0064834 
\ 
https://ntrs.nasa.gov/search.jsp?R=19870017988 2020-03-20T09:36:03+00:00Z
RECONFIGUR4BLE TREE ARCHITECTURES USING 
SUBTREE ORIENTED FAULT TOLERANCE 
Matthew B. Layrie 
Computer Systems Group 
Coordinated Science Laboratory 
University of Illinois 
1101 W. Springfield Ave. 
Urbana, U. 61801 
Abstruct- An approach to the design of reconfigurable tree architectures is presented in which 
spare processors are allocated at  the leaves. The approach is unique in that spares are associated 
with subtrees and sharing of spares between these subtrees can occur. The Subtree Oriented Fault 
Tolerance (SOFT) approach is more reliable than previous approaches capable of tolerating link and 
switch failures for both single chip and multi-chip tree implementations while reducing 
redundancy in terms of both spare processors and links. VLSI layout is O(n> for binary trees and is 
directly extensible to N-ary trees and fault tolerance through performance degradation. 
This research was supported in part by the Microelectronics and Computer Technology Corporation (MCC) under a 
VLSVCAD grant, by the National Aeronautics and Space Administration (NASA) under Contract NASA NAG 1-613, and by 
a fellowship from the A.T.&T. Be11 Laboratories. 
8 
I 
I 
I 
~I I. INTRODUCI'ION 
A continually rising demand for high performance computation has created a need for highly 
concurrent computer architectures. One architecture which has received significant attention is the 
tree topology [l. 2.3,4]. Tree architectures have an inherent ability to compute concurrently with 
typical communication times between the n processors being O(logn 1. However. as the number of 
processor nodes and communication links increases, the probability of single or multiple failures 
within structured concurrent architectures becomes unacceptably large. Consequently, recent 
interest has arisen in designing the ability to reconfigure concurrent architectures with one or more 
faults. Reconfigurability. which is one aspect of fault tolerance, is especially significant in tightly 
coupled tree architectures where the failure of a single link or processor can result in the 
subsequent loss of all communication with proceqors in the subtree below the faulty element. 
One of the initial reconfigurable binary tree proposal was made by Hayes. who developed a 
procedure for constructing 'optimal' 1-fault tolerant trees [51. which has since been extended by 
Kwan and Toida [6].  Raghavendra, Avizienis. and Ercegovac (RAE) [7], improved on these 
proposals by adding I sufficient redundant lines in order to tolerate multiple failures. Link 
redundancy for binary trees in the RAE approach is as high as 200%. and VLSI layout may require 
O(n1ogn) area [SI. Hassan and Agarwal [SI, recently presented a modular technique which 
allocates one spare to multilevel groups of processors. This scheme is conceptually similar to the 
RAE approach in that it dedicates every spare to one specific group of processors, but has the 
advantage of O(n ) layout and modularity for multichip architectures. Another proposal for 
reconfiguration which is applicable to tree architectures has been proposed by Rosenberg [9. 101. 
The approach requires a collinear layout with each node requiring access to a logn bus. 
Redundancy in terms of switching transistors is O(1ogn for each node. The switching structure 
provides for efficient utilization of spare processors. However, fault tolerance for the 
communication lines and the switching transistors is not considered. 
2 
One of the important objectives in designing for reconfiguration is efficient utilization of 
spares. If the architecture has k spare processors then the objective is to be able to tolerate any 
combination of k processor failures through reconfiguration. This should be accomplished with a 
reasonable increase in interconnect, manageable layout complexity for large numbers of processors. 
and a bounded number of pins per chip. In addition, failures in interconnect and switching 
structures should also be tolerable through reconfiguration. 
A strategy for satisfying these objectives for binary tree architectures is presented in this 
paper. The approach places spare processors at the leaves of the tree and provides for considerable 
flexibility in reconfiguration through sharing of spares between adjacent subtrees. This strategy, 
which is referred t o  as Subtree Oriented Fault Tolerance (SOFT), utilizes a virtual displacement 
technique to reconfigure a spare processor into the tree. The capability of sharing spare processors 
between subtrees provides the SOFT approach with significantly higher reliability than previous 
techniques allowing for switch and link fault failures, where reliability is t l e  probability that the 
tree is functional at a time t , given that it was fault free at time 0. In contrast to other proposals. 
SOFT is able to tolerate link and switch failures while reducing the number of redundant links 
between processors. For binary trees, the approach is shown to yield O(n ) layout. The architecture 
can be partitioned on separate chips for arbitrarily large trees, while providing fault tolerance for 
both on and off-chip connections. Fault tolerance through performance degradation is also possible 
with a SOFT design, as well as application to N -ary trees. 
In Section 11 of this paper, the SOFT architecture is presented for binary trees. Considerations 
for implementing a SOFT binary tree are discussed, including the placement of spare processors and 
communication links. Section III provides a formal analysis of reconfigurability in SOFT binary 
trees in the presence of processor. switch, and link failures. In Section IV. comparisons between the 
reliability of SOFT and past reconfigurable designs are presented. Finally. Section V extends the 
SOFT concept to N -ary trees. 
I 
I 
1 
I 
I 
1~ 
I 
I 
u 
1 
I 
8 
I 
I 
b 
I 
U 
3 
11. THE SOFT DESIGN FOR BINARY TREES 
The SOFT approach to reconfigurable tree architectures employs both spare processors at the 
leaves of the tree and additional links between processors to maintain a complete tree topology in 
the presence of multiple faulty processors and links. An example of SOFT architecture is 
illustrated in Figure 1. At levels high in the tree, failure of a node results in bypassing the node, 
thereby allowing information to flow directly between the faulty node’s father and one of its sons. 
Thus. the faulty node’s son assumes the tasks allocated to its failed father. Since the son is 
performing the tasks of its father. another processor must be found to assume the son’s tasks. In a 
similar fashion, one of the son’s sons assumes its responsibilities. This ’logical displacement’ 
continues until a spare is configured in at the leaves. A detailed discussion and an example of SOFT 
reconfiguration are presented in Section 111. 
All trees are said to have i +1 levels. The root is at level 0 and the leaves are on level i . The 
term ’upper levels’ refers to levels 0 through i-1. The root is labeled 1, the left child of any node 
n is labeled 2n and the right child is labeled 2n +l. Two nodes are adjacent if they are connected 
by a nonredundant or redundant communication link. The father of a node n on level k (f, is 
the adjacent node on level k -1. Similarly, the son of a node n is son,. f l  and sonl represent the 
father and son nodes of link 2 ,  respectively. The b r o t h -  of a node n (b, is the single node having 
the same f , . bl refers to either node connected to a redundant link 2 .  The left-most descendant of 
a subtree is the node which can be found by following only left descendants of the root of the 
subtree. Right-most descendants are defined similarly. The carsin of a node n (cous,) is the 
left(right.1-most node on the same level if n is the right(1eft)-most son of the root. For all other n . 
cuus,, is n -(+I1 if n is a left(right1 son of its father. The czncestor of a node n on a level q (AZ) 
is the single node on level q which contains n in its subtree. 
4 
B. AUocuhm of Redundant Nodes and LinRr 
The number of spare processors supported by the SOFT architecture is 2". where c is an 
integer: O<c di -1. Algorithm 1 is an algorithm for positioning these spares. The redundant links 
required by the SOFT architecture are allocated as described by Algorithm 2. The SOFT binary tree 
of Figure 1 was generated by Algorithms 1 and 2 with i =4 and c =2. A subtree with leaves 
x +k2"" to x + ( k  t1)2iSsr-l. where x is the left-most leaf of the root and Odk <2"-1. is 
referred to as a Spare SubTree. or SST. Each SST has an associated spare which is adjacent to its 
right-most leaf. The spare adjacent to an SST's left-most leaf is referred to as its mnussociated 
spare. In contrast to  X-tree or Hyper-tree structures [ll, 12.131. the SOFT topology is not a half- 
ring structure in which each level contains cousin connections instead of the n to b, connections 
utilized by SOFT. 
ALGORITHM 1: Placement of Spares. 
Begin 
issr :=i -c : 
for k := 1 to 2'-jSSr do begin 
{height of spare subtree} 
x := left-most leaf of root: 
n := x +k 2jSn-1 
add spare and connect as right son to n : 
connect spare to cous, ; 
end 
{associated spare of SST k} 
end. 
I 
I 
I 
5 
ALGORITHM 2 Placement of Redundant Links. 
Procedure Brother-connect (n : node) 
begin 
connect n to b, : 
if level(n ) < i 
then begin 
In is not a leaf) 
Brother-connect(2eft son of n >; 
Brother-connect(2eft son of b, 1; 
end 
if nodadjacent-to-spare(n >) 
then connect n to cous,, : 
else 
end 
begin 
Brother-connect(2eft son of root >; 
end. 
Figure 1. SOFT Architecture with c =2 and i =4. 
C. Redundmury Cahddbs  
The calculations for percentage of spare redundancy are straightforward: 
2' 9L node redundancy = 2i . 
6 
The number of redundant links is 
redundant links = 
Thus, the percentage of redundancy for links is 
2'-1 + 2i + 2 c  
2' +lei 9i link redundancy = 
For large i , this is approximately .5 + -25 + ( % redundancy of spares). Table 1 enumerates the 
possible percentage node redundancies for large i and the corresponding percentage of link 
redundancy. 
D. Imphenting SOlTArdritedures 
Implementing SOFT architectures involves the following assumptions. 
ASSUMPTION k Input/output through the leaves is not required. 17 
This is not a deficiency in the SOFI' philosophy but a convenience in describing the 
architecture. The assumption is not unreasonable since many of the algorithms appropriate for tree 
architectures do not require such I/O fl.2.41. In fact, the classical H-tree layout cannot 
accommodate 110 through the leaves for large trees. 
Table 1. SOFT Redundancy. 
= 12.5% 
-6.25% 
1 
1 
1 
I 
I 
I 
1 
1 
1 
I 
I 
I 
I 
I 
I 
1 
1 
1 
1 
7 
ASSUMPTION 2 All processors in the tree are identical. 0 
The necessity of this assumption is clear in light of the manner in which reconfiguration 
proceeds. The assumption may be relaxed somewhat since each processor on level I will only be 
asked to serve as a replacement for a processor on level 2-1. Most well-known algorithms for tree 
architectures utilize identical processors [l, 2,41. 
1 
I 
I )  Switching Scheme 
The virtual processor displacement concept of SOFT reconfiguration can be implemented with 
the switching scheme of Figure 2. Since all switching due to reconfiguration is performed by these 
switches, design of the processing elements is independent of the reconfiguration scheme. 
2) Multichip Trees 
If an entire tree cannot fit onto a single chip or wafer, then the tree must be partitioned for 
chip allocation. The major consideration on dividing a tree into subpieces is imposed by pin 
limitations. Partitioning a SOFT tree is straightforward. From Figure 1, it can be seen that at most 
one additional link per chip is required for chips containing no leaf processors. At level i , more 
than one link must be added per chip. However, by observing that there are two redundant links 
per node a t  the leaves, which equates to  the two father-to-son links of nodes at upper levels, it is 
evident that the number of pins per chip a t  the lower levels will be less for chips containing leaf 
brother father 
leaf leaf 
AI I I 
PE 
a. upper level switching. b. leaf switching c. Spare switching. 
Figure 2. SOFT switching in binary trees. 
8 
processors. While the implication is that a SOFT tree will require a t  least two different kinds of 
chips. the same may be true for modular sparing approaches such as E7.81. or even nonredundant 
trees if the tree has an odd number of levels. 
3) VLSI Layout of SOFT Trees 
To efficiently lay out a SOFT binary tree in VLSI. an adjustment in the general architecture is 
made in order to employ a variation of the optimal O ( n )  H-tree layout 114,151 in which the 
leaves are not fully connected. VLSI layout for a tree of i <4 follows the layout of Figure 3a. 
The location of spares depends upon the percentage of spares allocated to the tree. As in Algorithm 
1. the spares are located on the nonbrother redundant links. For trees with i >4 the layout 
algorithm presented below results in optimal area of O(n ). The result of Algorithm 3 with i =7 is 
depicted in Figure 3b. The ellipses represent the 5-level subtrees constructed in the first 'for loop' 
of the algorithm. 
Figure 3. VLSI Layout of SOFT Architecture I 
1 
9 
THEOREM 1: The SOFT layout, as described in Algorithm 3. for a binary tree of n leaves, has 
PROOF: It is well known that the area of an H-tree is O(n 1151. The layout can be thought of 
as having 0 (& ) rows and columns. The layout produced by Algorithm 3 (as in Figure 3) has at  
most one redundant link (additional row or column) parallel to each row or column of the O(n 
H-tree layout. Since each spare can be thought of as lying in a redundant link, the edge of the 
square corresponding to the layout of a SOFT tree is at most 0 (k 6 ), where k is a constant 
which, in the worst case is between 2 and 3. Since this is still O(& ). the area of the SOFT layout 
is O(n 1. 
The modification to the general SOFT approach necessary for VLSI layout has the following 
implications: 
(1) While the maximum number of spares is ==25%, the minimum is increased from an 
arbitrarily small percentage (1 per tree) t o  -3% (1 per 5 level subtree). 
ALGORITHM 3: VLSI Layout of Trees with i>4. 
begin 
for k := 1 to 2i4 do 
for k :== 1 to 2i-5 do 
construct main tree (m) as an i -5 level nonredundant binary tree: 
apply Algorithm 2 to m omitting all cousin connections; 
for k :== 1 to 2i-5 
z(k) := a 5-level SOFT tree constructed by Algorithms 1 and 2: 
connect root(z(2*k)) to root(2*k-l)); 
x := left-most leaf of root: 
move to n +k -1 
connect z(2*k) as left son; 
connect z(2*k-1) as right son; 
end 
end. 
. 
10 
(2 )  Since the 5 level subtrees do not share any connections among spares, SSTs cannot 'borrow' 
spares from neighbors not within that subtree. This does not, however, affect the reliability 
analysis presented in Section IV, which shows significant reliability enhancement is gained 
with the SOFT strategy. 
1 
1 
1 
1 
1 
I 
I 
1 
1 
I 
I 
1 
1 
I 
1 
1 
I 
I 
11 
111. FAULT TOLERANCE IN SOFT 
Necessary and sufficient conditions are presented for reconfigurability in SOFT architectures. 
Reconfiguration for faulty .processors is considered in Section A, followed by an analysis of 
reconfiguration for link failures and switch failures in Section B. The fault model for PES and 
links is functional in nature and includes any fault affecting the correct operation of the processor 
or the link under consideration. The fault model for switches consists of stuck-open or stuck- 
closed faults. Algorithms for dynamic reconfiguration are presented in Section C. Finally, static 
reconfiguration is discussed in Section D. 
A. Tderrmbe of Rocessor Failures 
In Section 1, basic properties of SOFT reconfiguration are presented. Based on these properties, 
necessary and sufficient conditions for reconfigurability are derived in Section 2. 
As discussed in Section 11. failure of a node in an upper level results in a series of 
displacements until a spare is configured in. If a leaf node n fails and it is adjacent to its SST’s 
spare, n is simply bypassed and replaced by the spare. If n is replacing f ,, n takes b, as one son 
and the spare as its other son. If n is not adjacent to the SST’s associated spare, it must be replaced 
by or take as its second son the non-brother leaf adjacent to it. G .  If a son of f a n  is adjacent to a 
spare, f 4r may use that spare as a second son, otherwise it takes a leaf adjacent to one of its sons. 
This displacement continues along level i until a spare is configured in. By convention, if a failure 
occurs in an SST. it displaces toward its associated spare if possible. If there are two failures in the 
SST. it must configure in the nonassociated spare. If the nonassociated spare had been configured in 
to its associated SST. then this SST must take its nonassociated spare. This continues until an 
unused spare is configured in. In Figure 4, an example of reconfiguration with 4 faults and 4 spares 
is presented. 
In analyzing the reliability of a recodguration scheme. it is necessary to determine both what 
fixed fault subsets (a set of processors in the tree designated as having failed) are reconfigurable 
c 
12 
Figure 4. Reconfiguration in an i-4, c-2 tree with 4 faults. 
(Le.. static reconfiguration), and for an existing machine with failures, whether the failure of an 
additional specific processor can be tolerated (i.e., dynamic reconfiguration). For SOFT dynamic 
reconfiguration. the order in which faults occur does not affect the ability of the architecture to  
reconfigure. As a result, if it is known exactly what sets of faults can be reconfigured. then a fault 
which occurs at time t can be reconfigured if and only if the new fault subset is reconfigurable. 
Similarly. if it is known exactly what failures can be reconfigured given a set of faults. then all 
sets of reconfigurable faults can be inductively determined. Consequently, the reconfigurability 
analysis of this section is applicable to both scenarios. 
I )  Properties of SOFT Reconjguration 
DEFINITION 1: Displacement is the logical movement of a node n to the physical position 
corresponding to f ,, in order to replace f due to either the failure or displacement of f n. At 
the leaves. displacement includes the logical movement of a leaf to  the left or right in order to 
replace 6, or cous,, . Displacement through node n refers to the act of displacing n . 0 
DEFINITION 2 Double displacement refers to an attempt to displace a node twice. For ex- 
ample. if n is displaced. it is assuming f 's tasks. If displacement were to occur through n 
again. n would assume g,, 's tasks. 0 
I 
I 
1 
1 
I 
I 
I 
1 
1 
1 
1 
I 
1 
I 
8 
I 
I 
I 
The following Lemmas present basic aspects of reconfigurability 
the necessary and sufficient conditions presented in Section 2. 
13 
which enable derivation of 
LEMMA 1: Double displacement occurs at upper levels if and only if displacement is attempted 
through a faulty node. 
PROOF: Displacement only occurs in descendants of faulty nodes. The lemma is not concerned 
with displacement of leaves so no double displacement occurs due to unavailability of spares. For 
two displacements to intersect at one node (double displacement, by Definition 2). the displacement 
due to a failure higher in the tree must intersect a previously displaced node. But it is necessary to 
0 pass through a father in order to reach one of its descendants. 
LEMMA 2 SOFT trees can not reconfigure using double displacement. 
PROOF: By Definition 3. a t  upper levels some node n is assuming g, 's  tasks. But n is not 
adjacent to b,,. Thus communication with bf, would be lost, and it would not be possible to 
maintain the rigid tree topology if double displacement occurred. At level i ,  the father of the 
0 double displaced node would be adjacent to only one leaf processor. 
LEMMA 3: At most one of n and b, can be displaced. 
PROOF: Displacement of both would imply that either f , has been displaced twice. which is 
not possible by Lemma 2, or there is a displacement through f , and f , has failed. which is also 
0 
LEMMA 4: Two failures within an SST, or displacement of the root of the SST, and a failure 
within the SST, are reconfigurable if and only if both spares adjacent to the SST can be reconfigured 
in as sons of leaves in the SST. 
impossible by Lemmas 1 and 2. 
PROOF: Displacement of the root of the SST is the same as failure of the root of the SST in 
terms of reconfiguration below that level. Consequently, only failures within the SST need be 
considered. If only one adjacent spare can be configured into the SST. and there are two failures in 
the SST. then reconfiguration is not possible by Lemma 2. If both adjacent spares can be used, then 
. 
14 
reconfiguration through fault free nodes is clearly possible, i.e.. a fault subset such as Figure 5a 
0 cannot exist with only t\ko failures. 
2 )  Analysis of Reconfigurability 
In previous approaches, unreconfigurable multiple failures correspond to more than one fault 
in a group of nodes which have been allocated a single spare [7,8]. Unreconfigurable multiple 
failures in the SOFT approach. regardless of the number and location of spares, correspond to 
faults which force double displacement. For example, consider the failure of a node n. 
Displacement must occur through a son,. Consequently. the presence of a fault subset as depicted 
in Figure 5a is not reconfigurable. If left son, has failed, then displacement must occur through 
right son, which implies that one of right son, 's sons must be fault free for reconfiguration to 
occur. The failure of Zeft son,, can be thought of as forcing the reconfiguration into the subtree 
with right son, as its root. Thus, Figures 5b and 5c are not reconfigurable since there is .no path 
from the highest faulty node to the leaves through only good nodes. 
In order to determine what fault scenarios are reconfigurable, spare sufficiency (SS is defined. 
SS is 2 boolean value associated with each node of the tree. 
A 
Figure Sa. 
Figure 5b. Figure 5c. 
Figure 5. Unreconfigurable faults a t  upper levels. 
15 
DEFINITION 3: The SS of a spare s is 1 if and only if s is fault free and not configured 
into the array, or, the nonassociated spare of the SST has SS of 1 and is not configured into the 
array as a son of s 's associated SST. or s is faulty, and the SST for which s is not associated 
has SS of 1. The SS of a node within an SST is equal to the SS value of its associated spare. 
The SS of a node n in a level above the SSTs is 1 if and only if SS(2eft son,,) or 
SS (right son,, ) is 1 and n is fault free, where SS (n ) denotes the spare sufficiency of n . 0 
In addition, the shiftability of a displaced node n , denoted as s (n ). is formally defined as: 
where good (n ) is 1 only if node n is fault free. 
Theorem 2 describes reconfigurability at the SST level. Corollaries 1-3 provide necessary and 
sufficient conditions for reconfiguration of a failure anywhere in a SOFT tree. Theorem 3 describes 
the class of fault subsets which are reconfigurable in SOFT architectures. 
THEom2: The failure of any node n within an SST whose root is not displaced is 
reconfigurable if and only if SS (n is 1. 
- PROOF: From Definition 3, for a node in SST S to have SS of 1, either (A) its associated spare 
is available, or (B) the associated spare of S is faulty and its nonassociated SST has SS of 1. or (C) 
there exists an SST whose spare is available such that all SSTs 'left' of S and 'right' of the 
available spare have either faulty spares or  spares configured in to their associated SSTs. If (A) is 
true, then the failure is reconfigurable (see Lemma 4). If (A) is not true but (B) is true, then either 
reconfiguration is possible through the failed spare into its nonassociated SST. by the above analysis 
and Definition 3, or (C) is true. If (C) is true (but (A) and (B) are not true). then the 
nonassociated spare of S must be configured into S .  If the SS of S is 1 then at most one fault has 
occurred in S prior to this failure, by Definition 3. Lemma 4 indicates that configuration of the 
nonassociated spare is possible. If the nonassociated spare was not being used, then the 
reconfiguration is finished. Otherwise. the nonassociated spare had been configured in as a son of its 
associated ST. If the nonassociated spare has failed then reconfiguration proceeds by bypassing the 
16 
spare. as though the SST were larger (using the switches of Figure 2c). Since SS of this spare was 
1. reconfiguration can proceed toward its nonassociated spare, by the same analysis. This continues 
until the available spare is configured into the array. The algorithms of Section C guarantee that a 
nonassociated spare is configured in only if there is no configuration such that only associated 
spares are configured in. As a result, if a spare is configured in to its nonassociated SST, then 
further reconfiguration cannot proceed in this direction without double displacement, thus 
satisfying the necessary condition. 0 
COROLLARY 1: Failure of node n , above the SST level, is reconfigurable if and only if the node 
is undisplaced and SS (n is 1. or, the node is displaced and s (n ) is 1. 
PROOF: The proof considers the displaced and undisplaced conditions separately. If an 
undisplaced node n above the S T  level fails, then it reconfigures toward the root of an SST. If 
S ( n )  is 1 then there is a path through fault free nodes to an SST which has SS of 1, which 
follows from Definition 3. Since failure of the root of an SST and its displacement are equivalent 
below the SST level, this is reconfigurable by Theorem 2. If SS (n is 0 then either there is no path 
through fault free nodes to the SST level, in which case the failure is not reconfigurable by Lemmas 
1 and 2,  or such a path exists only to the root of an SST with SS of 0 which is not reconfigurable by 
Theorem 2. If n is displaced, the failure of n is tolerable if the displacement can be moved into 
another sub-tree. When n fails, if b, can assume f , 's position, then n 's displacement can be 
shifted into the b, sub-tree. By Lemma 3. b, can not already be displaced. As above. b, can 
assume its father's logical position if and only if SS (b, ) is 1. Shifting the displacement to b, is the 
only alternative if f, is the failure creating n 's displacement. If f , has not failed, then an 
alternative is to shift the displacement into b,,. This can occur if and only if SS(b t , )  is 1. 
Similarly. if neither f, nor f r n  has failed. the displacement can be shifted to the brother of f t n .  
Consequently, the displacement can be shifted to the brother of any ancestor, A , if and only if bA 
0 has SS of 1 (as above) und all ancestors of n which are on level below A are fault free. 
17 
COROLLARY 2 Failure of a non-redundant node n within SST S can be tolerated if and only 
if either SS (n ) is 1. or the root of S , r , is displaced and s (t. ) is 1. 
PROOF: When a failure in an SST is detected, there are three possibilities for reconfiguration. If 
and only if S S ( n )  is 1. reconfiguration can proceed toward either the associated or the 
nonassociated spare, from Theorem 2. If t has been displaced, then the third alternative is to shift 
r 's displacement into another SST, and use S's associated spare to reconfigure for n 's failure. From 
is 1. 0 
COROLLARY 3: The failure of a spare n is tolerable if and only if the failure of a non- 
Corollary 1. this is possible if and only if s (r 
redundant node in either the associated or unassociated SST is tolerable. 
PROOF. If the spare is not configured into the array, then its failure is tolerable. A failure in 
either SST adjacent to the SST is also tolerable. If the spare has been configured into the array, then 
there are two possibilities for reconfiguration. The first is to undo the displacement causing the 
spare to be configured into the array. The second is to bypass the spare and use the neighboring 
SSTs other adjacent spare. To do the first, it is necessary and sufficient that the spare had been 
configured into its associated SST (if the spare is configured into its unassociated SST, then the 
displacement cannot be shifted out of that SST) and a failure within the SST is tolerable (this 
follows from Lemma 4 and Corollary 2). If the spare is to be bypassed, then it is necessary and 
sufficient that a failure in the neighboring SST be tolerable. also by Lemma 4 and Corollary 2. 0 
THEOREM 3: A SOFT tree is properly reconfigured if and only if each faulty node above the 
SST level has a path through fault-free nodes to the i th  level (i.e.. no subsets such as in Figure 5). 
and there are no more than x +1 faults or displacements of the root of an SST in any x adjacent 
SSTs. and there are a t  least x +1 spares in the tree. 
PROOF: If there is no failure such that there is no path through fault-free nodes to the roots 
of the SSTs. then reconfiguration above the SSTs can proceed through fault free nodes into the roots 
of the SSTs (from Corollary 1). The question of reconfigurability then concerns only the 
availability of spares at the SST level. If there are x +1 faults in x SSTs. there are exactly x +1 
18 
spares available. At most two failures or a failure and a displacement of the root can occur within 
an SST. otherwise the condition of the Corollary is violated for x =1. Lemma 4 indicates that any 
two failures within an SST can be tolerated as long as the nonassociated adjacent spare can be 
0 reconfigured in, Since there are x +l spares available, the faulty PES are reconfigurable. 
B. Tdermrce of Switch and Link Failures 
In this section, a SOFT architecture is shown to be capable of tolerating functional failures in 
communication links and stuck-open and closed faults in the switches of Figure 2. The following 
abbreviations are used in describing each node: PFF denotes the processor of a node as being fault 
free, SFF represents the switches associated with a node as being fault free, and LFF indicates the 
links connected to a node are fault free. 
I )  Failwe of Links 
DEFINITION 4: A node of a tree is replaceable if and only if it is PFF, and, in the given 
fault scenario, the failure of that processor is reconfigurable. 0 
The difference between replaceability and SS is that replaceability assumes that there are no 
link/switch failures to prevent reconfiguration. Similarly, replaceability is substituted for spare 
sufficiency in the definition of shiftability, s (n ). 
DEFINITION 5: A rwnredundant link I (link between n and f is isolated i f  A) f I ,  
left sorat,, and r i g k  sonf, are SFF and LFF (except for the failed link), and B) b,, is replace- 
able. 0 
LEMMA 5: If a faulty nonredundant link is isolated, then the tree can be reconfigured around 
that link. 
PROOF: The switching scheme of Figure 2 allows the faulty link to be removed from the tree 
if b, can be displaced. The restriction of Definition 5 (B) guarantees that b,,, can be displaced. 
Part A of Definition 5 guarantees that the switching around the failed link will allow this 
I 
I 
I 
I 
I 
I 
I 
I 
I 
8 
I 
1 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
1 
1 
I 
I 
1 
I 
I 
I 
I 
19 
reconfiguration to occur. 0 
If no ancestor has failed, then the link failure is considered as a failure of f I .  If an ancestor 
has failed, then its displacement is shifted through f thereby freeing a spare in another SST for 
use in case of another failure. 
DEFINITION 6: A redundant link I ,  in an upper level, is isolated if: A) f b1 is PFF. SFF, 
and LFF. and both brothers are SFF, and B) s ( f b I )  is 1. 0 
LEMMA 6: If a redundant link, at upper levels, is isolated then its failure is reconfigurable. 
PROOF: The restrictions of the definition, and Corollary 2. guarantee that the displacement 
which necessitates the use of the redundant link can be shifted into another subtree. Thus, the 
0 faulty redundant link is no longer configured into the tree. 
DEFINITION 7: A redundant link I in level i is isolated if either: A) the displacement is to 
the left (right) and the left (right) brother is replaceable, or B) the displacement is from above 
and f b I  is PFF, SFF and LFF. and s ( f b l )  is 1. 0 
Consequently. if f b, has failed then the displacement is not shiftable. 
LEMMA 7: If any redundant link is isolated then its failure is reconfigurable. 
PROOF: At upper levels this is true by Lemma 6.  At level i ,  Definition 7 applies. If the 
displacement is to the left or right (Definition 7A). and if that leaf is replaceable. the displacement 
which caused the use of this redundant link can be reversed to  configure in a different spare. Thus, 
the redundant link once again becomes isolated from the tree. If the reconfiguration involving the 
redundant link is a result of a displacement of the father, then the displacement of the father must 
be shifted into another subtree, and reconfiguration in level i must be in the direction away from 
the faulty link, otherwise the faulty link will not be codgured out of the tree (B). This is possible 
0 under the condition of part B of Definition 7 by Corollary 2. 
20 
THEOREM 4: Isolated link failures can be tolerated in SOFT architectures. 
PROOF: From Lemmas 5 and 7, all inter-processor isolated link failures can be tolerated. A 
fault in a link connected directly to a processor (i.e.. links .A and B in Figure 2a) can be modeled as 
a failure of the processor itself. In this case, the link fault is reconfigurable if the processor is 
replaceable. Failure of link C in Figure 2a only prevents displacement to the left if the PE has 
failed (displacement to the right is possible), or displacement to the right 'if the PE has not failed 
(displacement to the left is still possible). In Figure 2b. failure of link C prevents displacement of 
0 the node to the left or right if the node is not faulty, and to the left only if node has failed. 
2 )  Failure of Switches 
DEFINITION 8: Any switch in a nonspare node is defined to be isdated if it is in a node 
which is PFF. SFF (except for the faulty switch), LFF. and the node is replaceable. Any switch 
in a spare node is defined to be isolated if the spare is SFF (except for the faulty switch), LFF. 
and the repiaceability of the SSTs adjacent to the spare are 1. 0 
THEOREM 5: If a switch is isolated, then its failure is reconfigurable. 
PROOF: By symmetry, the only switches which need to be considered are switches 1 - 5 in 
Figures 2a and 2b and the switches of 2c. Since the node is PFF. SFF. and LFF prior to the switch 
failure, the node can support any reconfiguration for which either the faulty switch is not 
necessary (stuck-open fault). or the switch should be closed, or the closing of the switch has no 
effect. The fact that the node is replaceable indicates that the failure of this node is tolerable. As a 
result. if the node is displaced it can be returned to the undisplaced state. This can be done at upper 
levels by shifting the displacement into another subtree. At the leaves, the reconfiguration is 
shiftable since the failure of this leaf is reconfigurable. Failure of the switches of Figures 2a and 
2b are tolerable by the following analysis: 
(1) Switch 1: A stuck-open fault is recodgurable as a failure of the processor. Reconfiguration of 
stuck-closed is not required since the node is PFF. 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
1 
I 
I 
D 
I 
I 
I 
I 
1 
I 
I 
I 
I 
I 
I 
1 
I 
I 
I 
I 
21 
(2) Switches 2,3,4,5: Stuck-open fault is reconfigurable by placing the node in the undisplaced 
state. Stuck-closed can also be properly configured in the undisplaced state. 
In spare switching (Figure 2c). for the case of stuck-closed faults, the spare is not configured in 
unless it is needed. Consequently, stuck-closed faults have no effect. For stuck-open faults, if the 
spare is not configured in, then the replacebility in both SSTs is certainly 1. and failure of any 
switch is tolerable. If the spare is configured in, then a stuck-open fault is tolerable since either SST 
can reconflgure a different spare in, which means that there is a configuration which does not 
employ these switches. 0 
In the first section, the reconfiguration algorithms are presented assuming that the links and 
reconfiguration mechanism are fault-free. In Section 2. the algorithms presented in this section are 
extended to include link and switch fault tolerance. Finally, failure of the control mechanism is 
considered in Section 3. 
. 
I )  Reconjgtrration of Node Failures. 
The reconfiguration algorithm for high-level nodes is presented first. This corresponds to all 
processing elements above the SST level. If the SOFT tree has been allocated one spare per leaf, or 
a5070 spares, then all nodes except for the leaves and spares follow this algorithm. Reconfiguration 
algorithms for SST nodes, leaves, and spares are presented next. The algorithms are appropriate for 
SSTs of arbitrary size. 
The five basic configurations for upper level nodes are shown in Figure 6.  Figure 6a shows the 
normal or starting configuration of all nodes. Figure 6b depicts a node which has been displaced. In 
this section, the configuration for a displaced node is the same, independent of which son is 
displaced. Determination of which son is displaced (and the subsequent switch settings) is done 
within the algorithms. Figure 6c shows a node whose brother has been displaced. By Lemma 3, 
22 
a. b. C. d. e. 
Figure 6. Switch configurations for upper level nodes. 
I 
I 
I 
I 
only one of two brothers can be displaced a t  any one point in time. Figures 6d and 6e represent I 
I failed nodes whose brother is and is not displaced, respectively. 
A simple 
variables with 
technique for controlling the configuration of the switches is to associate three state 
each node: good, disp, and fatherdisp. The variables are defined as follows: 
good : True if and only if the node is functioning properly. 
disp : 
fatherdisp : True if and only if the node's brother has been displaced. 
True if and only if the node is displaced (see Definition 2). 
The node's configuration is based on Boolean expressions involving these three variables. Table 
2 gives the configurations from Figure 6 and the conditions under which each configuration occurs. 
The control of the internal state variables requires some form of communication between 
nodes. For this purpose, three signals are defined: recon, brorecon, and unrecon. 
I 
I 
I 
I 
I 
I 
recon : Recon is issued from a displaced or failed father to the son which is to assume the 
identity of the father in the reconfigured array. I 
Table 2. Configuration versus Expression for High-level Nodes. I 
I 
I 
I 
I 
I 
I 
1 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
23 
unrecon: Unrecon is issued from a son to its father and brother to indicate that the father 
should no longer be displaced. 
brorecon : Brorecon is issued from the brother or father of a node to indicate that its brother has 
been displaced. 
Additionally, a signal (fail ) which is internal to the node is required which indicates that the 
failure of the node has just been detected. 
Two global variables are used to  govern the direction of reconfiguration: spare sufficiency (SS 1 
and p a t h .  SS in these algorithms is defined below, and is not identical to the SS defined in Section 
A, although the intuitive meanings are similar. SS and p a t h  are recursively defined as follows. 
ss: SS of a leaf is 1 if and only if the processor is fault-free and its associated spare is 
both fault-free and not configured into the array. SS of nodes in levels 0 to i -1 are 1 
if and only if SS of the node's left or right son is 1 a d  the node is fault-free. 
' - 
p a t h  : The path of a leaf is one if the leaf is fault-free. The p a t h  of all other nodes is one if 
and only if the node is fault-free and the p a t h  of its left or right son is one. 
With these parameters, a reconfiguration algorithm for upper level nodes has been derived, 
and is presented as Algorithm 4. A subscript such as SSrig, denotes the spare sufficiency of the 
right son of a given node. The function of the algorithm is to wait for a signal and then respond 
(reconfigure) appropriately. Each of the five if statements corresponds to signals. For instance, if a 
node receives a recon signal, then it sets its own switches and issues a recon to one of its sons on 
the basis of its internal and global variables. 
For Algorithms 4 and 5,  the definitions of p a t h  and SS assign the leaves values on the basis of 
their internal variables, and the upper level nodes formulate their values for these variables by 
ORing their sons' values. If SSTs have more than one node. then the root of the SST takes the place 
of leaves in those definitions. Thus. only nodes in levels above the SSTs follow Algorithm 4. 
24 
ALGORITHM 4: 5wo SPARING. 
While true do 
if recon 
then if then *tree fails' 
serdisp : issue brorecon 
if ss k f t , f t + G P a t h  IC 
then issue recon to Cft son 
else issue recon to right son 
if brorecon 
if wfrecon 
then set fatherdisp 
then if s i g d  is  from brother 
then clear fotherdisp 
else clear disp 
if good 
then issue unrecon 
else set fail 
if fail adisp 
if fail e d i s p  
then clear good ; clear disp ; issue m e c o n  
then issue Femn to left son 
else issue recon to right son ' - 
end. 
Reconfiguration of nonleaf nodes within SSTs is similar to the reconfiguration just described. 
In Algorithm 4. reconfiguration proceeds in the direction of the left son where possible. For SST 
nodes, the opposite is the case. The path variable is redefined, and spare availability (SA takes the 
place of SS . 
path : path for a leaf node is defined as one if and only if the leaf is fault-free and it is not 
displaced (to the left, right, or up). path of all other SST nodes is high if the path of 
its left or right son is high and the node is fault-free. 
SA : SA is high if and only if the SST's associated spare is fault-free and not in use. 
rootdisp : rootdisp is true if and only if the root of the SST has been displaced. 
While both SA and rootdisp depend only on the status of one node, this does not imply that 
separate lines need to be run between the spare or the root and all nodes within the SST. 
Alternatively, the signals could be passed between nodes in the same manner as SS and path. The 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
1 
I 
I 
25 
reconfiguration procedure for nonleaf SST nodes is presented in Algorithm 5. 
The algorithm for leaf nodes is presented for the general case. The algorithm is appropriate for 
arbitrarily large SSTs. Simplifications exist for small SSTs (25 - 50% sparing). With 2540% 
sparing, the leaf algorithm becomes similar to the SST algorithm. 
The possible line configurations for leaf nodes are shown in Figure 7. Since the possible 
switch configurations are unique from those at upper levels, a new set of internal state variables is 
defined. 
ALGORITHM 5: SST NODES. 
While true do 
if recon 
then if Pood then *tree fails' 
setvdisp * issue bt-wecon 
if SA +pcIth,ft+rmtdisp 
then issue recon to right son 
else issue recon to left son 
if brorecon 
if unrecon 
then set fatherdisp 
then if signal is from brother 
then clear fatherdisp 
else clear disp 
if good 
then issue unrecon 
else set fail 
if fail .disp 
if fail m d i s p  
then clear good ; clear disp ; issue w e c o n  
- thencl targood 
if SA +p-+r+rotdisp 
then issue recon to riaht son 
else issue recon to le i t  son 
end. 
4 L J - A  d -pJ--- 
a. b. C. d. e. 
Figure 7. Switch configurations for leaf nodes. 
26 
Figure 
7b. 
7d. 
7a. 
7C. 
7e. 
diJpright: disprigh( indicates that the leaf is in the displaced state to the right, i. e.. it is assuming 
the identity of its right neighbor. 
Expression 
diSpright 
disp left-disp ri ht 
disp kfr-  + u p  right 
goad * a-+- iSpri ht LsPh 
g x e e - d  
displeft: as above to the left. 
When dispright and dispkD are both high, this indicates that the node is displaced to level i -1. 
good : as before. 
The leaf node's configuration is based on Boolean expressions of these three variables. Table 3 
enumerates the configurations of Figure 7 and the conditions for their use. 
Similarly. the internode signals need to be redefined. 
recon,, : recondi, is as recon, in the direction (left. right, or up) indicated by dir, i. e., reconup is 
a request, from father to leaf, to displace the node in order to replace its father. 
ultrecon has the same function: however, it is issued only by leaves which are right 
sons and is sent to both the node's father and brother. 
unrecon : 
brorecon : as before. 
fail : as before. 
Using these parameters, Algorithm 6, on page 28. was derived. As in the previous algorithms. 
signals corresponding to the if statements are input to the node, and the node's configuration and 
output signals react on the basis of these. 
I 
I 
D 
d 
I 
I 
I 
1 
I 
I 
I 
1 
1 
1 
I 
I 
1 
I 
1 
I 
I 
1 
1 
1 
I 
1 
I 
1 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
27 
Figure 8 depicts the possible spare configurations, and Table 4 indicates the expressions which 
must be true for each configuration to be realized. Algorithm 7 presents the control strategy for 
spare reconfiguration (See p. 28.). 
The reconfiguration algorithms require communication between adjacent nodes. This includes 
both updating values such as SS and path as well as issuing signals such as recon and unrecon . 
While separate lines could be allocated for these signals, it is also possible to allocate the data lines 
for these purposes. 
A similar issue is whether to allocate separate hardware for reconfiguration or to implement 
the algorithms in soft/firmware. Implementing the algorithms in hardware is straightforward. To 
implement the algorithms (and diagnosis) in software, however, requires that a failed node never 
perform diagnosis or reconfiguration. This can be accomplished by having the father perform 
diagnosis and reconfiguration for both of its sons. Thus, the sons would have registers for their 
switch configurations and no other hardware to determine its switch configuration. When a node 
failure on level i has been discovered, the failing node's father, on level i-1.  reconfigures the 
node's switches and then performs reconfiguration for its new son on level i 4-1. This requires mild 
modification of the preceding algorithms, but is straightforward. 
28 
ALGORITHM 6: LEAF NODES. 
While true do 
if reconri h 
then if (not adjacent to spare I+- 
then set disp right; issue recon right 
else issue unrecon ; clear disp lcfi 
if brarecon 
then if node is a left son 
then if disp right then issue recon kft 
if good 
then set disp &ft; clear disp right 
else clear disp right; clear disp hft 
if g d  '$%n clear disp ~cfl; clear disp right 
else set dis 
- 
if recon up 
then if good *disp 
thenif tK%e is a ri ht son 
then if +- tsp right then issue recon r i g ~  
elre issue recon rcf t  
set disp right; set disp kft; issue brorecon 
else *tree fails* 
if recon l ~ f t  
then if disp r i g u * F  
then clear t sp  right; issue recon lcft 
if g d  
then clear disp left 
if disp lefl.disp right 
if nude is  a right son 
then clear disp 
else *tree fails' 
clear disp &ft 
issue unrecon 
if--good 
then set disp left; issue recon ~cft 
then clear disp hft 
else setdis left 
if unrecon (implies node is a left son) 
then if good 
if fail Sdisp right* 6-4 tsp hft 
then issue recon clear good 
if fail *dispbft*& then *tree fails. 
if fail -disp kft*disp ri ht 
if node ts  a righ son 
then issue unrecon; clear disp kft; clear good 
else 'tree fails* 
if fail .displ,disp 
then if d + r O O ~ ~ ? p  
then issue recon right ; set recon right; clear good 
else issue recon left ; set recon kfi;  clear good 
end. 
1 
1 
I 
I 
I 
I 
I 
I 
29 
I 
I 
~I 
1 
I 
I 
I 
I 
1 
-I 
‘ I  
iI 
I 
I 
11 I 
Table 4. Configuration versus Expression for Spare Nodes. 
v 
0 b d @  
a. b. C. d. 
Figure 8. Switch configurations for spare processors. 
ALGORITHM 7: SPARES. 
While true do 
if reconri ht 
then if gomi 
then set dispright 
else issue recon right 
if disp .i,ht+good 
if recon 
then set disp hft 
then issue recon kfi; clear disp right 
if fail *dispkft 
if fail -disprigu 
then issue recon kft; clear good 
then clear good 
if SA right+rddiSpright 
then issue recon right 
else issue recon kft; clear disp 
end. 
2) Reconjiguratwn for Link and Switch Failures 
As with processing elements, the fault model employed for links and switches is a functional 
fault model. Stuck-at faults and shorting of lines have been considered. The fault model for 
switches consists of stuck-open and stuck-closed. Such a fault model applies to faulty input a t  the 
switch control as well as to failure of the switch. 
In Figures 6b. 6d. and 6e. a displaced or bypassed node is shown as codguring the switches 
such that the son or bypass data are passed on to both sons. Naturally, this is unacceptable if link 
30 
fault tolerance is to be provided (although it allows the three internal state variables to be set 
mnemonically). Instead, the switches must be set based on the direction of reconfiguration. Thus. 
the number of possible switch conflgurations grows from five to eight. Three internal state 
variables are, therefore, still sufficient. A switching scheme which allows for tolerance of link 
failures was presented in Figure 2. The corresponding modifications to Algorithms 4 and 5 consist 
only of setting the internal state variables to distinguish the direction of reconfiguration. 
Reconfiguration around a failed link I can be implemented by: 
1. Clear path and SS of the son of the link. 
2. Set fail of father of the link. 
If an ancestor of f I has failed, then the corresponding displacement can be shifted through f 1 ,  
thereby avoiding the use of an extra spare. Assuming link failures occur with relatively low 
probability, however, this may not be worth the overhead. Although both the brother and father 
links of a node cannot fail without the tree failing, it should be noted that even if these two lines 
are shorted together the tree can still function. In this case, setting foil of the sonlbrother of the 
shorted links results in correct reconfiguration. 
Failure of a redundant link implies that it is either in use or attempted use. Reconfiguration 
for failure of a redundant link can be implemented as follows: 
1. Clear path and SS of both nodes adjacent to the link. 
2. Set foil of the displaced node. 
If the spare which is configured in to replace the node is needed, then an unrecon signal will travel 
from the leaves up to the node whose brother link has failed. At this point. the node can be 
reconfigured back into the array. 
Failure of the lines within the nodes must also be considered. Failure of lines such as A and B 
of Figure 2a can clearly be modeled as failure of the PE itself. Failure of C is handled in the 
31 
following manner: 
1. Clear path rigk. 
2. Set fail of the PE. 
These modifications apply to the switching scheme of upper level nodes. The modifications for 
leaves and spares are similar but with less complexity due to their simpler switching structure. 
In the stuck model of switch failure, the simplest way to deal with a stuck-open fault of a 
switch is to handle it as a break in the internode line. For the case of stuck-closed faults. the only 
time a stuck-closed fault requires an alternate reconfiguration strategy is when the switch is closed 
but the current switch configuration requires the switch to be open. Taking advantage of 
symmetry. switches 1 through 5 of Figure 2a will be considered. If switch 1 is stuck-closed, this is 
only an issue if the PE has failed. This implies that the line to the left son is no longer reliable. 
Thus, the steps to tolerate this failure are the same as the link failure. If switch 2 is stuck-closed. 
then it. too, can be modeled as a failureof the line to the left son. Significantly. this is only an issue 
if displacement to the left or brorecon is attempted. Similarly, switch 3 being stuck-closed can be 
handled by considering the PE to have failed displacing through the right son (set pathleft and 
S S ,  to zero). Switch 4 being stuck-closed can be handled by treating the PE as faulty. Similarly, 
switch 5 can be handled by considering the PE to have failed. Similar results have been obtained 
for leaf switching. 
3) Reconjigvation for control Failures 
The final area of switch and link failure to be considered concerns failure in the 
reconfiguration control hardware. If a soft/firmware approach is used, this is not as much of an 
issue, since only good processors perform the reconfiguration algorithms. The assumption that 
switch and link failure is isolated in its incidence is extended to control failure as well. For 
instance. if a node's good variable is stuck-at-one, then there is little that can be done if the node 
fails. This corresponds to an undetected fault, which cannot be tolerated. 
32 
I 
The fault model employed for control failure is also a functional fault model. It considers the 
case where a value or signal is one or zero when it should not be. The technique for handling good 
stuck-at-zero is self-evident. 
If disp is stuck-at-one then the response should be to clear good. If disp is stuck-at-zero. 
then its path and SS should be set to zero. If fatherdisp is stuck high, then this can be modeled as 
failure of the father. If it is stuck-at-zero, then the path and SS of its brother should be set to zero. 
If a global variable is zero when it should be one, then reconfiguration will not proceed in that 
direction. This is not significant, since the worst situation is the node failing, in which case 
reconfiguration cannot proceed in that direction anyway. If a variable is stuck-at-one then, under 
the assumption that switch and link failure are isolated in occurrence, the second variable is valid, 
Le.. if SS is stuck-at-one then, if reconfiguration cannot proceed in this direction, path will still be 
zero. Consequently, both SS and path must be checked before selecting a son for displacement. 
Finally, reconfiguration for signal failures (e.g.. recon, brorecon , ...I are considered. Since each 
(upper level) node has two recon signals, one to each son, if one recon is stuck low, then 
- reconfiguration simply cannot proceed in that direction (Le.. SS and puth are zero). If a recon is 
stuck high, then the node should be set as faulty. Thus, the recon signal can be ignored after that. 
Similarly, if unrecon is stuck high, then the node can be set faulty, further w e c o n  signals then 
being invalid. If unrecon is stuck low. then the node can be set faulty, thereby assuring that the 
unrecon need never be used. If brorecon is stuck low, then path and SS of the node must be set to 
zero. If it is stuck high, then this can be modeled as a failure of the father, with reconiiguration 
proceeding in the direction of the node. 
The algorithms presented in Section C are appropriate for dynamic reconfiguration of failures. 
Static reconfiguration. where a tree has a fixed subset of nodes which are faulty, requires 
reconfiguration and analysis of reconfigurability for the entire structure given a subset of faulty 
c 
33 
nodes. An algorithm has been developed which determines if an input fault subset is 
reconfigurable. The algorithm has been implemented in Pascal. 
Input to the algorithm is as follows. First, the faulty nodes in levels above and including the 
root of the SSTs are specified. Next, the associated spare of each SST is specified as being faulty or 
fault-free. Finally, the number of nonspare faults in each SST is input to the algorithm. 
Nine Boolean variables are assocated with each node at  and above the level of the roots of the 
SSTs. The algorithm begins by determining the reconfigurability of the tree a t  the SST level alone. 
This is done by initializing the Boolean values for the root of each SST. on the basis of the status of 
the ST's root, and the number of faults within the SST. These values are then altered based on 
the Boolean values of the node's neighbors (brothers and cousins) on the same level. If the tree has 
not been determined to be unreconfigurable, then the nodes on the level above the SST level are 
considered. The Boolean values of any node above the SST level are set on the basis of its son's 
Boolean values. These values are then altered on the basis of the Boolean values of the otheronodes 
on the same level. 
Since the reconfigurability of the tree can be determined at  each level, based only on the 
Boolean values from the level below, and the status of nodes on that level. the algorithm runs in 
time directly proportional to the number of nodes above the Spare SubTree level. Since the size of 
the tree above the SSTs equals the number of spares less one, the algorithm is O ( n )  for a fixed 
percentage of spares, where n is the number of nodes in the tree. For a fixed number of spares, the 
algorithm runs in order constant time. If the tree has been determined to be reconfigurable. then a 
proper reconfiguration for the tree can be determined on the basis of the Boolean values associated 
with each node. 
34 
IV. RELIABILITY ANALYSIS 
In this section it is shown that the reliability of a SOFT binary tree, even with the restrictions 
imposed by VLSI layout, is always superior to a tree implemented using the class of approaches 
employed by [7,8]. This is demonstrated by first establishing an upper bound on the reliability of 
the previous approaches, independent of the actual implementation, Le.. their optimal reliability, 
and comparing it to a lower bound for SOFT trees. Exact reliability calculations of some specific 
SOFT implementations are also derived and compared. 
A. ReLicrbiLity of Other Approaches 
It is assumed that the i + l  level tree is allocated k spares. A Modular Sparing Approach 
(MSA) to fault tolerance in binary trees is any approach to reconfigurable design which partitions a 
tree into k groups of processors and allocates each group of processors one spare to be used 
exclusively by that group. The work of Raghavendra. et al. (RAE) and Hassan and Agarwal (M- 
trees) can both be classified as MSA. Significantly, SOFT trees are not included in this category. It 
should be noted that the strategy of Rosenberg [91 is not MSA. However, Rosenberg's strategy does 
not allow for interconnect or switch failure. With an MSA. each module must be functioning in 
order for the tree to be functioning. Thus, the reliability can be expressed as the product of the 
reliability of all of the modules. In general this is: 
m =1 
where Rm is the reliability of the mrh module. Although some MSA schemes may tolerate 
interconnect failure. the following reliability analysis considers only processor failures. for sake of 
simplicity. Since the spare can be configured into the module in case of any single failure. the 
reliability of each module can be expressed as: 
I 
I 
I 
I 
I 
I 
1 
I 
I 
I 
I 
I 
I 
1 
I 
I 
1 
I 
I 
I 
1 
I 
1 
I 
I 
1 
I 
I 
I 
I 
I 
I 
I 
1 
I 
1 
I 
I 
35 
where R is the reliability of each individual processor and q is the number of processors in the 
module, including the spare. The reliability for all processors is assume; to be equal and 
exponentially distributed (Le.. R =e-)3 1. Although this assumption is not accurate for many 
environments. it does provide an initial point of comparison and is a common assumption in 
reliability analysis [7. 8.163. For simplicity. the failure rate of spares. p, is assumed to be equal to 
the failure rate of nonredundant processors. A. In the following discussiori, it is assumed that the 
trees can be divided evenly into modules of size q ,  although the theorem does not rely on this 
assumption. 
THEOREM 6: Optimal reliability in an MSA tree corresponds to when the tree is divided into 
modules of equal size: q =(2' +'- 1 + k )/k . 
PROOF: Co.nsider a tree with equal module size. The following inequality indicates that every 
time a single node is moved from a module of size q to another module of size q , thereby creating 
modules of size q +1 and q -1. Rvs decreases: 
Additionally, the following inequality indicates that moving nodes from modules of smaller size 
into modules of larger size will decrease reliability: 
for any 0 6 c .  and Obk < q .  0 
It has been shown that the reliability versus spares curve of M-trees is greater than that of 
RAE 181. The reason for this is that the RAE scheme allocates an entire spare to the root, whereas 
M-trees can 'spread' the spare out into level 1 (or more) nodes. At lower levels, however, the 
number of nonredundant nodes per spare are the same. Consequently, as i and/or k increases. the 
reliabilities of both schemes converge. In contrast, the next section demonstrates that the SOFT 
36 
approach dways results in superior reliability over MSA designs. 
The reliability of a redundant system composed of nodes with equal reliability R , which is 
not subject to degraded performance, can be thought of as a polynomial of degree (2i+1- 1 + k )  
where there are (2' +l- 1 + k ) nodes in the system. The polynomial can be expressed as: 
where R is the reliability of individual components. and cyI is the number of ways in which j 
reconfigurable faults can occur in the tree (ai =O for j > k  1. For comparison of a, an aj from a 
SOFT reliability equation is denoted as afs .  whereas, aim is associated with the optimal MSA 
reliability. 
In order to analyze SOFT reliability, it is necessary to calculate the number of possible 
processor locations for the j'" faulty processor to occur given that j -1 faults have already been 
successfully reconfigured. This is dependent on the specific SOFT implementation. however. using 
the following identity, a lower bound on the number of possible locations for the j f h  fault is 
derived in Theorem 7. 
2i +1 
Number of nodes per SST (including the spare) = - k 
If a given SOFT or MSA tree contains j -1 specific faults. and the tree has not failed. then the 
location of the faulty processors (fault subset) is referred to as an at-lr or aj-lm scenario, 
respectively. In SOFT. if j -1 faults have been reconfigured. this implies that the associated spares 
of j -1 SSTs are configured in (or failed), and k - j  +1 SSTs have associated spares which are not 
used. 
I 
1 
I 
1 
I 
I 
t 
1 
I 
I 
I 
I 
I 
I 
1 
1 
37 
THEOREM 7: (k - j  +l) -+1 is a lower bound on the number of reconfigurable faults 1";" I 
from an aj -ls scenario. 
PROOF: In each of the k - j  +l SSTs which has its associated spare unused, at least - 12y1 I
positions for the j r h  fault exist,, which can be reconfigured. If in aj-Is the father of one of the 
k - j  +1 SSTs has not yet failed, then the failure of the father can be tolerated and 1 can be added 
2i +1 
to - as the number of locations for k - j  +1 of the SSTs. If the father has failed, however, then k 
it must have been reconfigured into a neighboring SST, which means that a fault in that neighboring 
SST can be tolerated and the f ss 's reconfiguration shifted into the SST with the free spare. The 
lower bound, therefore, remains valid. If two of the k - j  +l SSTs are adjacent to the same father. 
then father has been counted twice in the lower bound. If the father, f , has not yet failed, 
however, then the failure of f is also tolerable, and by the same analysis, the lower bound 
remains valid. 0 
THEOREM8: The reliability of a SOFT tree is always greater than the reliability of an 
equivalent MSA tree for k (the number of spares) > 1. 
PROOR If it can be shown that for at least one j , cyjs >aj,. and cyj ,  >ajrn for all other j , then 
Rsysso,'Rv~hisA (assuming that the reliability of individual processors is the same for both 
architectures). Since ao, =aoS =1 and al,=aI, =(Zi +I- 1 + k ). Rvs with k =O or k =1 is identical 
for both schemes. Consider. however, a2. MSA can tolerate only one failure per module. As a 
result: 
(2'+'- 1 + k )  
n! 
where 1 = --- I 
38 
L 
and: 
(2'+'- 1 + k )  2' +'/k [ 2 2 
Now consider . Since a2, >cy2, , it will be shown by contradiction that ais >ai, given that 
aj -1, >ai -l,. From the definition of 01. it is evident that if ajm >'cyj,, at  least one fault 
configuration corresponding to a successful reconfiguration of j - 1  faults in MSA (denoted as a 
fault configuration in ayi -1,) must have more possible locations for successful reconfiguration of 
the j r h  fault than the possible locations for the j t h  fault in an aj -1, configuration. 
Derivation of the number of 'choices' for the j f h  fault to occur for each aJ -l in optimal MSA 
trees is straightforward and is less than the lower bound of Theorem 7. 
Due to the analytical complexity of a global reconfiguration strategy and the variety of 
possible SOFT implementations. a closed form expression for reliability has not been found for 
arbitrary size trees with arbitrary numbers of spares. However. the analysis presented in Section 
I11 is sufficient to determine the reliability of any specific SOFT tree. The following section presents 
some exact reliability calculations for example SOFT implementations. 
The reliabilities of four level trees, implemented by M-tree. RAE, and SOFT as a function of 
the number of spares. for R = e- .u,  t-.5, and t-1.0. are shown in Table 5. It should be noted that 
the SOFT reliability is superior even when the modular schemes are allocated more spares. The 
data for  t-1.0 is graphically displayed in Figure 9. In Table 6 and Figure 10, the reliabilities versus 
time curves of a tree with no redundancy, duplicated four level trees, and four level trees with 
four spares employing an optimal MSA. a SOFT approach, and a tree with optimal reliability are 
presented. A scheme with optimal reliability guarantees reconfiguration for any number of faults 
I 
I 
1 
1 
1 
1 
1 
1 
1 
1 
I 
I 
I 
I 
I 
I 
Table 5. Reliability for a four-level tree by number of spares. 
0.9504 
0.9281 
0.9033 
0.8274 
0.7655 
0.7073 
0.9350 
0.8499 
0.8179 
0.7832 
0.6194 
0.5416 
0.9512 - 
0.9350  
0.9248 0.9974 
0.88 11 0.9528 
0.8179 0.8179 
0.8298 - 
0.7832 
0.7563 0.9665 
0.6553 0.7841 
0.5416 05416 
Reliability 
1-00 
0.60 
0.50 
0.90 -. 
0.80 -I 
0.70 - 
Number of 
Y 
I I I I 1 1 1 1 
I I I I 1 1 1 I Spares 
0 1 2 3 4 5 6 7 8  
Figure 9. Reliabilities of four-level trees at t-1.0. 
less than or equal to the number of spares. Table 6 includes calculations for an RAE tree with 4 
spares and an M-tree with 5 spares. The M-tree approach was allocated five spares due to its 
inability to support four. 
40 
D . I w & g S O I ; T ~  
There are several possibilities for enhancing the reliability of a SOFT architecture. If a 
designer is not concerned with VLSI layout issues or is willing to pay O(n l o p  ) area, the SOFT 
tree can be implemented such that the leaves are fully connected and the full sharing of inter-SST 
spares is practical. As an alternative, sharing of spares between the i =4 leaf subtrees is possible 
using the procedure of Horowitz and Zorat [ll]. A second option is the addition of redundant lines 
Table 6. Reliability for X=.l 
II  no r e t  I 
0.153 
0.072 
0.902 
0.721 
0.544 
0.396 
0.283 
0.200 
0.140 
0.097 
R R A E  
4 spares 
1 .Ooo 
0.972 
0.903 
0.810 
0.707 
0.604 
0.505 
0.417 
0.339 
R M - t -  
5 spares 
1 .Ooo 
0.982 
0.935 
0.866 
0.783 
0.694 
0.604 
0.5 17 
0.437 
RMSA -opt 
4 spares 
1 .ooo 
0.979 
0.925 
0.847 
0.756 
0.661 
0.566 
0.447 
0.396 
%OFT 
4spares 
1.000 
1.000 
0.997 
0.987 
0.965 
0.927 
0.873 
0.807 
0.73 1 
Reliability 
I 
Obtimal 
I \ \ \ ODt. MSA 
R O p r  
4 spares 
1 .ooo 
1 .ow 
0.998 
0.990 
0.971 
0.936 
0.886 
0.822 
0.747 
- Time 
.25 s o  .75 1.M 1.25 1-50 1.75 2.00 
Figure 10. Reliability versus time for a four-level tree and four spares. 
0.0 
41 
between leaves which are adjacent to the same spare. This allows reconfiguration of SSTs with 
numerous faults, assuming that the spares are available in neighboring SSTs. The cost of adding 
these lines is an additional k lines (where k is the number of spares). Finally, for  applications 
such as yield enhancement, where processor yield may be quite low, -50’70 sparing is possible. Sn 
SOFT implementations with 50% sparing, spare and link placement is the same as in Algorithms 1 
and 2, with the exception that each spare is associated with a single leaf ‘and there are no cousin 
connections between leaves. 
E. S0I;T PerfcKmmroe 13egradatiopr 
The SOFT approach, as with the RAE technique, allows for graceful degradation in 
performance. No redundant spares are allocated for graceful degradation, however redundant lines 
are located as described by Algorithm 2. For SOFT architectures a spare at the root is unnecessary, 
in contrast to previous graceful degradation approaches 171. Since failures are passed to the leaves, 
the only nodes which must assume the functions of their brothers are the leaves. Also, as noted 
before, the redundancy in terms of links is reduced and the reliability is enhanced. The only 
failures which disable the tree are long runs of failures along the leaves and a failure of a node for 
which no path of good nodes into the leaves exist. i.e.. the class of failures depicted in Figure 5. 
42 
2~ -3 % link redundancy - + N 
V. SOFT N-ARY TREES 
N ' - ~ + N "  
Y N g  
N -ary trees, in which each nonleaf node has N sons, are more suitable for certain tasks than 
classical binary trees. For example, 4-ary (quad) tree architectures have been proposed for 
implementing several classes of artificial intelligence related algorithms [ 171. The following brief 
discussion summarizes how the SOFT approach is applicable to N -ary trees. 
A. - ofRecoprjgur~N-aryTrses 
I )  Location of Redundant Lines 
A link allocation approach which is applicable to arbitrarily large N is to restrict 
reconfiguration to  displacement of the 'outside' two children of each non-leaf node. Redundant 
links to each of a node's brothers are added only to the two outside brothers. The link redundancy 
2N -3 at upper levels is approximately -%, which is O( 1) instead of O(N ). N 
2 )  Location of Redundant Processors 
Allocation of up to one associated spare per group of N brothers is allowed. The spares are 
placed between SSTs with connections from a spare to both its associated SST and a neighboring 
SST. Redundancy is therefore: 
I g r o  
N" % node redundancy = - 
Y N P  
g =o 
where NC is the number of spares allocated to the tree, with 1 <c <i  -1. 
I 
1 
1 
43 
B.Rf?Xfm$gzw- 
Reconfiguration is fundamentally the same as in binary trees. A sample 
reconfiguration for four failures in a 5-ary tree is illustrated in Figure 11. 
b 0 6 
Figure 11. Reconfiguration in a 5-ary tree. 
b 
44 
VI. CONCLUSIONS 
A unique approach to the design of reconfigurable tree architectures has been presented. The 
design allocates spares at the leaves of trees and allows sharing of spares between subtrees. The 
architecture has O h  VLSI layout for binary trees and is directly extensible to N -ary trees. A 
lower bound on reliability for a SOFT tree was shown to be more reliable than all modular sparing 
approaches. with significantly less redundancy. The SOFT architecture is the only known approach 
to reconfigurable trees which tolerates both link and switch failures. The virtual displacement 
concept with sharing of spares between clusters of processors is also applicable to other concurrent 
architectures. The design strategy presented in this paper makes tree architectures attractive for 
environments where high reliability is required. 
I 
I 
I 
I 
1 
1 
I 
I 
1 
I 
I 
I 
1 
1 
B 
I 
1 
I 
I 
45 
REFERENCES 
C. A. Mead and L. A. Conway, Introduction to VLSI Systems. Reading, Mass.: Addison- 
Wesley. 1980. 
S. A. Browning, “A Tree Machine,” Lambda (VLSI Design), pp. 31-36. Second Quarter 1980. 
S. A. Browning, “Computations on a Tree of Processors.” 1979. PhD Thesis, California Insti- 
tute of Technology. 
M. J. Attallah and S. R. Kosaraju, “A Generalized Dictionary Machine for VLSI,” IEEE 
Trans. on Comput.. pp. 151-155. February 1985. 
J. P. Hayes. “A Graph Model for Fault Tolerant Computing Systems,” lEEE Trans. on Com- 
put.. pp. 875-883. September 1976. 
C. L. Kwan and S. Toida. “An Optimal 2-Fault Tolerant Realization of Symmetric Hierarchi- 
cal Tree Systems.” Networks , vol. 12. pp. 231-239, 1982. 
C. S. Raghavendra. A. Avizienis. and M. Ercegovac. “Fault Tolerance in Binary Tree Archi- 
tectures.” IEEE Transactions on Cornput.. pp. 568-572. June 1984. 
A. S. M. Hassan and V. K. Agarwal, “A Fault-Tolerant Modular Architecture for Binary 
Trees Architectures,” Proceedings of the Fifteenth Fault Tolerant Computing Symposium. pp. 
344-349. June 1985. 
A. L. Rosenberg. “The Diogenes Approach to Testable Fault Tolerant Arrays of Processors.” 
IEEE Trans. on C o m p . .  pp. 902-910, October 1983. 
F. R. K. Chung, F. T. Leighton, and A. L. Rosenberg, “Diogenes: A Methodology for Design- 
* ing Fault Tolerant VLSI Array Processors.” Proceedings of the Thirteenth Fault Tolerant 
Computing Symposium, pp. 23-32. June 1983. 
E. Horowitz and A. Zorat. “The Binary Tree as an Interconnection Network: Applications to 
Multiprocessor Systems and VLSI,” IEETrans.  on Cornput., pp. 247-253. April 1981. 
J. R. Goodman and C. H. Sequin, “Hypertree: A Multiprocessor Interconnection Topology.” 
IEEE Trans. on Comput.. pp. 923-933, December 1981. 
A. M. Despain and D. A. Patterson, “X-tree: A Tree Structured Multiprocessor Computer 
Architecture.” Proceedings of the Fifth Computer Architecture Syposium. pp. 144-150. April 
1978. 
C. E. Leiserson. Area Ejtcient VLSI Compz&ation. MIT Press, 1983. 
J. B. Ullman. Cornputatwd Aspects of VLSI. Rockville, Md.: Computer Science Press, 1984. 
R. S. Swarz and D. P. Siewiorek. The Theory and Practice of Reliable System Design. Bed- 
ford, Mass.: Digital Press, 1982. 
A. Rosenfeld. “Quadtrees and Pyramids for Pattern Recognition and Image Processing,” 
Proceedings of the Fifth Interrrcrtwnal Conference on Pattern Recognition. pp. 802-806,1981. 
[ll] 
[ 121 
[13] 
[14I 
[151 
[I61 
[171 
REPORT SECURITY ~LASSlFlCnTlON 
SECURITY CLASSIFICATION AUTHORITY 
U n c l a s s i f i e d  
Zb. OECLASSIFlCATlON I DOWNGRADING SCHEDULE 
. PERFORMING ORGANIZATION REPORT NUMBER(S) 
UILU-ENG-87-2216 (CSG-65) 
. NAME OF PERFORMING ORGANIZATION 6b. OFFICE SYMBOL 
Coordina ted  Sc ience  Lab (tr ap@icab(e) 
1 b. RESTRICTIVE MARKINGS 
None 
Approved for p u b l i c  release; 
d i s t r i b u t i o n  u n l i m i t e d  
3 . DISTRIBUTION I AVAILABILITY OF REPORT t 
5. MONITORING ORGANIZATION REPORT NUMBER(S) 
t 
i 70. NAME OF MONITORING ORGANIZATION 
U n i v e r s i t y  of I l l i n o i s  I N/A 
ADDRESS (City, State, and ZIP C&) 
1101 W. S p r i n g f i e l d  Avenue 
Urbana, I L  61801 
NASA I 
i 
I 7b. ADDRESS (Cify, State, a d  ZIP CO&) NASA Langley Research Center  
Hampton VA 23665 
I 
I I I 
?9. ABSTRACT (Continue on reverse if necessary and identify by Hock number) 
. NAME OF FUNDING /SPONSORING 8b. .OFFICE SYMBOL 
ORGANIZATION (If applicable) 
NASA 
ADDRESS (Cify, Stare, and ZIP Code) 
NASA Langley Research Center  
Hampton, VA 23665 
An approach t o  t h e  des ign  of r econf igu rab le  tree a r c h i t e c t u r e s  i s  p resen ted  i n  which 
The approach I s  unique i n  t h a t  s p a r e s  are spare p rocesso res  are a l l o c a t e d  a t  t h e  leaves .  
a s s o c i a t e d  w i t h  s u b t r e e s  and sha r ing  of spa res  between t h e s e  s u b t r e e s  can  occur .  The 
Subt ree  Or ien ted  F a u l t  Tolerance (SOFT) approach i s  more r e l i a b l e  than  previous  approaches 
caDable of t o l e r a t i n g  l i n k  and swi tch  f a i l u r e s  f o r  bo th  s i n g l e  ch ip  and mul t i -ch ip  tree 
implementat ions wh i l e  reducing  redundancy i n  terms of both s p a r e  p rocesso r s  and l i n k s .  
VLSI l ayou t  i s  O(n) f o r  b ina ry  trees and i s  d i r e c t l y  e x t e n s i b l e  t o  N-ary trees and f a u l t  
t o l e r a n c e  through performance degradat ion.  
1 
9. PROCUREMENT INSTRUMENT IDENTIFICATION NUMBER 
I 
i 
NASA NAG 1-613 
10. SOURCE OF FUNDING NUMBERS 
PROGRAM PROJECT TASK WORK UNIT 
ELEMENT NO. NO. NO. ACCESSION NO. 
17. COSATI CODES 
FIELD GROUP SUB-GROUP 
18. SUBJECT TERMS (ConbinUe on revern if mcesay a d  fdentify by block numkr) 
VLSI, binary tree, N-ary, f a u l t  t o l e r a n c e ,  tree a r c h i t e c t u r e  
Subtree Oriented F a u l t  Tolerance (SOFT) 
20. DISTRIBUTION I AVAILABILITY OF ABSTRACT 
22a. NAME OF RESPONSlBLE INOIVIOUAL 
~UNCLASSIFIEONNLIMITED 0 SAME AS RPT. 0 DTK USERS 
- 
21. ABSTRACT SECURITY CWSIFLCATION 
22b. TELEPHONE (Inrludc Area Code) 22C. OFFICE SYMBOL 
U n c l a s s i f i e d  
