Method and apparatus for fault tolerance by Sullivan, Gregory F. & Masson, Gerald M.
I Il11 11111111 111 11111 11111 Il1 Il11 Il11 Il11 111l I1 111111 111 11111 llll 
US005243607A 
United States Patent 1191 [11] Patent Number: 5,243,607 
Masson et al. [451 Date of Patent: Sep. 7, 1993 
I541 METHOD AND APPARATUS FOR FAULT 
TOLERANCE 
Debugging System,” IEEE Transactions on Comput- 
ers, Jan. 1968, pp. 81-86. 
[75] Inventors: 
[73] Assignee: 
[21] Appl. No.: 
[22] Filed: 
Gerald M. Masson; Gregory F. 
Sullivan, both of Baltimore, Md. 
The Johns Hopkins University, 
Baltimore, Md. 
Primary Examiner-Robert W. Beausoliel, Jr. 
Assivant &miner-Ly V .  Hua 
Attorney, Agent, or Firm-Ansel M. Schwartz 
543,451 P I  ABsTRAcr 
Jun. 25,1990 A method and apparatus for achieving fault tolerance in 
[51] 
[52] 
Int. ( 3 . 5  ............................................... H04L 1/08 
US. Cl. .................................. 371/69.1; 371/68.3; 
37V68.1; 371/19; 395/575 
[58] Field of Search .................... 37V69.1, 68.3, 68.1, 
371/19, 15.1, 16.1, 67.1; 364/200 MS File; 
395/575 
1561 References Cited 
U.S. PATENT DOCUMENTS 
4,696,003 9/1987 Kerr ............................... 371/69.1 X 
4,756,005 7/1988 Shedd ............................ 37W69.1 X 
5,005,174 4/1991 Bruckert et al. ................... 371/68.3 
OTHER PUBLICATIONS 
H. Geng, “Circuit for the Complete Check of a Data-- 
Processing System”, IBM TDB, vol. 16, No. 4, Sep. 
1974, pp. 1144-1 145. 
K. Knowlton, “A Combination Hardware-Software 
a computer system having at least a first central process- 
ing unit and a second central processing unit. The 
method comprises the steps of first executing a first 
algorithm in the first central processing unit on input 
which produces a first output as well as a certification 
trail. Next, executing a second algorithm in the second 
central processing unit on the input and on at least a 
portion of the certification trail which produces a sec- 
ond output. The second algorithm has a faster execution 
time than the first algorithm for a given input. Then, 
comparing the first and second outputs such that an 
error result is produced if the first and second outputs 
are not the same. The step of executing a first algorithm 
and the step of executing a second algorithm preferably 
takes place over essentially the same time period. 
18 Claims, 6 Drawing Sheets 
. 
OUTPUT OR 
ERROR CERTIFICATION TRAl L 
SECOND EXECUTION 
https://ntrs.nasa.gov/search.jsp?R=20080004260 2019-08-30T02:16:14+00:00Z
U.S. Patent Sep. 7, 1993 Sheet 1 of 6 5,243,607 
FIRST EXECUTION 
OUTPUT OR 
ERROR - CERTIFICATION TRAl L INPUT 
SECOND EXECUTION 
1 
2 
3 
4 
5 
6 
7 
8 
9 
IO 
I I  
12 
13 
14 
FIG. ? 
Algorithm MINSPAN(G,weight) 
Input : Connected graph G = (VI E) where V = {l, . . , ,n} with edge weights. 
Output: Spanning tree-of G which has minimum weight 
CHOOSE root r V  
FOR ALL  u t V, key tu): = (0 END FOR 
h:=O; v :=rmt  
WHILE v # empty DO 
key(v1: = -a 
FOR EACH [V,W]C E DO 
IF weiqht([v,wl)< kcy(w) THEN 
key tu):= weight ([v,wl);prefer ( w ) :  = [v ,Wl  
I F  member (w,h) THEN chanqekey (w, key (w), h 1 
ELSE insert (w,key(w),h) END IF 
END IF 
END FOR 
(v,k) := deleternin ( h )  
END WHILE 
15 FOR A L L  u r V -  (root},OiJTPuT(preferlu)) END FOR 
END MINSPAN 
FIG. 3 
US. Patent Sep. 7, 1993 Sheet 2 of 6 5,243,607 
FIG. 210/ 
. 
250 “0 2 50 w 
U.S. Patent Sep. 7, 1993 Sheet 3 of 6 5,243,607 
FIG. 4(i)  FIG. 4(b) 
AlQorithm HUFFMAN (FREQ) 
Input: Sequence of positive integers FREQ.={f [11,f 121,. . .,f hl} 
Output: Pointer too Huffmon tree for the input frequencies 
1 FOR i :  = 1 to n DO 
2 insert (i,f [ i ] ,h) 
3 ptr C i l :  = oliocote() 
4 i n f ~ [ p t r l i l J : = ( i , f t i l )  
5 END FOR 
6 FORj :=n+ l  to 2 n - 1  DO 
7 (iteml, keyl): = deletemin(h) 
8 
9 ptr [ j l :  = o l iocote()  
10 
I I  l e f t  tptr[ jJJ:=ptr [item11 
'3. 
13 insert( j ,keyl  +key2 ,h )  
14 END FOR 
15 OUTPUT (ptr  [2n-13)  
END HUFFMAN 
(item 2, key2): = deletemin (h) 
infotptr I jll: =( jl key 1 + key 2 1 
right [ptr  Cjll:= ptr [item 21 
FIG. 5 
U.S. Patent Sep. 7, 1993 Sheet 4 of 6 5,243,607 
FIG. 6 
Algorithm CONVEXHULLS) 
Input: set of points, S, i n  ~2 
0utput:Counterclockwise sequence of points i n  R which define convex hull of S 
1 Let p l  be the poinl with the largest x coordinate (ond rmollcrt y to breok ties) 
2 For eoch point p (except p l )  colculote the slope of the line through p1 and p 
3 Sort the points (except p l  1 from the smollest slope to the brgest. Coli them p2, ...p n 
4 ql:: pl; q2:=p2; q3:=p3; m.3 
5 F O R k = 4  ton  DO 
6 
7 m : = m + l  
b q m : = p ~  
9 END FOR 
IO FOR i = 1 to m DO, OUTPUTIqi) END FOR 
END CONVEXHULL 
2 
WHILE the ongk formed by qm-l,qm,pk i s  2 180 degrees DO m :* m - 1  END FOR 
FIG. 7 
U.S. Patent 
- c 
Sep. 7, 1993 
r 
, 
MEANS c FIRST 
FOR ALGORITHM 
FAULT 
TO L ERENCE 
T 
,'I SECOND t 
ALGOR I T H M 
Sheet 5 of 6 5,243,607 
FIG. 8(u) FIG. 8(b) 
FIG. 9 
. 
FIRST CENTRAL 
PROCESSING UNIT 
FIRST 
ALGORITHM 
I 
US. Patent 
FIRST OUTPUT 
Sep. 7, 1993 
INPUT 
Sheet 6 of 6 
CERTIFICATION 
TRAl L 
SECOND CENTRAL SECOND OUTPUT - 
5,243,607 
FIRST INPUT FIRST MEMORY FIRST CENTRAL 
PORT PROCESSING 
I FIRST 
FIG. IO 
UNIT 
ALGOR IT H M 
COMPARING 
MECHANISM 
SECOND COMPUTER 
INPUT PORT 
SECOND CENTRAL 
I I I 1 
FIG. / I  
5,243,607 
1 2 
ing a first algorithm and the step of executing a second 
algorithm preferably takes place over essentially the 
The present invention also pertains to a method for 
achieving fault tolerance in a central processing unit. 
The method comprises the steps of executing a first 
algorithm in the central processing unit on input which 
produces the first output as well as a certification trail. 
Then, there is the step of executing a second algorithm 
in the central processing unit on the input and on at least 
a portion of the certification trail which produces a 
second output. The second algorithm has a faster execu- 
15 Then, there is the step of comparing the first and second 
outputs such that an error result is produced if the first 
BACKGROUND OF THE INVENTION and second outputs are not the same. 
Traditionally, with respect to fault tolerance, the The present invention also pertains to a computer 
specification of a problem is given and an algorithm to system. The computer system comprises a first com- 
solve it is constructed. This algorithm is executed on an 2o puter. The first computer has a first memory. The first 
input and the output is stored. Next, the same algorithm computer also has a first central processing unit in corn- 
is executed again on the same input and the Output iS munication with the memory. The first computer addi- 
compared to the earlier output. If the outputs differ then tionally has a first input port in comm~ca t ion  with the 
an error is indicated, otherwise the output is accepted as memory in the first central processing unit. mere is a 
correct. This software fault tolerance method requires 25 first algorithm disposed in the first memory which pro- 
additional time, so called time redundancy [Johnson, B., duces a first output as well as a certification trail based Design and analysis of fault tolerant digital systems, 
METHOD AND APPARATUS FOR FAULT 
TOLERANCE same time period. 
LICENSES 
The United States Government has a paid-up non- 
exclusive license to practice the claimed invention 
herein as per NSF Grant CCR-8910569 and NASA 
Grant NSG 1442. 
FIELD OF THE INVENTION 
The present invention relates to fault tolerance. More 
specifically, the present invention relates to a first algo- 
rithm for fault tolerance purposes. 
rithm that provides a certification trail to a second alga- time than the first for a given input’ 
Addison-Wesley, Reading Mass., 1989; Siewiorek, D., On input received by the input Port when it is executed 
and Swarz, R., The theory and practice of reliable de- by the first central processor. The computer system is 
sign, D ig id  Press, Bedford, Mass., 19821; however, it 30 additionally comprised of a second computer. The set- 
requires not additional software. It is particularly valu- 
able for detecting errors caused by transient fault phe- 
nomena. If such faults cause an error during only one of 
the executions then either the error will be detected or 
the output will be correct. 
A variation of the above method uses two separate 
algorithms, one for each execution, which have been 
written independently based on the Problem Vcifica- 
ond computer is comprised of a second memory. The 
second computer is also comprised of a second central. 
processing unit in communication with the memory and 
the first central processing unit. The second computer 
35 additionally is comprised of a second input port in com- 
munication with the memory in the second central pro- 
cessing unit. There is a second algorithm disposed in the 
second memory which produces a second output based 
tion. This N-version programming on the input and on at least a portion of the certification 
D e n ,  L., and Avizienis A., “N-version programming: 40 trail when the second algorithm is executed by the set- 
ond central processing unit. The second algorithm has a a fault tolerant approach to reliability of software oper- ation,” Digest of the 1978 Fault Tolerant Computing faster execution time than the first algorithm for a given Symposium, pp. 3-9, IEEE Computer Society Press, 
tolerant software,” IEEE Trans. on Software Engineer- 45 mechanism for comparing the first and second Outputs 
ing, vol. 11, pp. 1491-1501, December, 19851 (in this such that an error first and 
case N=2), allows for the detection of errors caused by second outputs are not the same. 
some faults in the software in addition to those caused Moreover, the Present invention also pertains to a 
by transient hardware faults and utilizes both time and computer. The computer is comprised of a memory. 
software redundancy. Errors caused by software faults 50 Additionally, the computer is comprised of a central 
are detected whenever the independently written pro- processing unit in communication with the memory. 
grams do not generate coincident errors. The computer is additionally comprised of a first input 
port in communication with the memory and the central SUMMARY OF THE INVENTION 
processing unit. There is a first algorithm disposed in 
The present invention pertains to a method for 55 the memory which produces a first output as well as a 
achieving fault tolerance in a computer system having certification based on input received by the input 
at least a first central processing system and a second port when the input is executed by the fist central 
steps of first executing a first algorithm in the first cen- the memory which produces a second output based on tral processing unit on input which produces a first 60 
second algorithm in the second central processing unit trail when the second algorithm is executed by the cen- 
on the input and on at least a portion of the certification tral processing unit. The second algorithm has a faster 
trail which produces a second output. The second alga- execution time than the first algorithm for a given input. 
rithm has a faster execution time than the first algorithm 65 Moreover, the computer is comprised of a mechanism 
for a given input. Then, comparing the first and second for comparing the first and second outputs such that an 
outputs such that an error result is produced if the first error result is produced if the first and second outputs 
and second outputs are not the same. The step of execut- are not the same. 
1978; Avizienis, A*, “The N-version approach to fault input. The computer system is comprised of a 
is Produced 
central processing system’ The method comprises the processor. There is a second algorithm also disposed in 
output as well as a certification trail. Next, executing a the input and On at least a portion Of the certification 
3 
5,243,607 
BRIEF DESCRIPTION OF THE DRAWINGS 
In the accompanying drawings, the preferred em- 
bodiments of the invention and preferred methods of 
FIG. 1 is a block diagram of the present invention. 
FIGS. 2A through FIG. 2F shows an examples of a 
minimum spanning tree algorithm. 
FIG. 3 with the source code for a mince man algo- 
rithm. 10 
FIG. 4A and 4B shows an example of a data structure 
used in the second execution of a mince man algorithm. 
FIG. 5 with the source code for a Huffman algo- 
rithm. 
FIG. 6 shows an example of a Huffman tree. 
FIG. 7 with the source code for Graham’s scan algo- 
FIG. 8A through FIG. 8C shows a convex hull exam- 
FIG. 9 is a block diagram of an apparatus of the 20 
practicing the invention are illustrated in which: 5 
15 
rithm. 
ple. 
present invention. 
the present invention. 
the present invention. 
FIG. 10 is a block diagram of another embodiment of 
FIG. 11 is a block diagram of another embodiment of 
DESCRIPTION OF THE PREFERRED 
EMBODIMENT 
The central idea of the present invention, essentially a 
fault tolerance mechanism, as illustrated in FIG. 1, is to 
modify a first algorithm so that it leaves behind a trail of 
data which is called a certification trail. This data is 
chosen so that it can allow a second algorithm to exe- 
cute more quickly and/or have a simpler structure than 
the first algorithm. The outputs of the two executions 
are compared and are considered correct only if they 
agree. Note, however, care must be taken in defining 
this method or else its error detection capability might 
be reduced by the introduction of data dependent be- 
tween the two algorithm executions. For example, sup- 
pose the first algorithm execution contains a error 
which causes an incorrect output and an incorrect trial 
of data to be generated. Further suppose that no error 
occurs during the execution of the second algorithm. It 
still appears possible that the execution of the second 
algorithm might use the incorrect trail to generate an 
incorrect output which matches the incorrect output 
given by the execution of the first algorithm. Intu- 
itively, the second execution would be “fooled” by the 
data left behind by the first execution. The definitions 
given below exclude this possibility. They demand that 
the second execution either generates a correct answer 
or signals the fact that an error has been detected in the 
data trail. Finally, it should be noted that in FIG. 1 both 
executions can signal an error. These errors would in- 
clude run-time errors such as divided-by-zero or non- 
terminating computation. In addition the second execu- 
tion can signal error due to an incorrect certification 
trail. The fault tolerance means can be used in hardware 
or software systems and manifested as fmware  or soft- 
ware in a central processing unit. 
A formal definition of a certification trail is the fol- 
lowing. 
Definition 2.1. A problem P is formalized as a relation 
(that is, a set of ordered pairs). Let D be the domain 
(that is, the set of inputs) of the relation P and let S be 
the range (that is, the set of solutions) for the problem. 
It can be said an algorithm A solves a problem P if for 
25 
30 
35 
40 
45 
4 
all d E D when d is input to A then an s E S is output such 
that (d,s) E P. 
Definition 2.2. Let P : D - S be a problem. Let T be 
the set of certification trails. A solution to this problem 
using a certification trail consists of two functions FI 
and F2 with the following domains and ranges F1:D 
S x T and F2:D x T + S U error. The functions must 
satisfy the following two properties: 
(1) for all d E D there exists s E S and there exists t E 
T such that Fl(d) = (s,t) and Fz(d,t) = s and (d,s) E P 
(2) for all d E D and for all t T either (F2(d,t) = s and 
(d,s) P) or F2(d,t) = error. 
The definitions above assure that the error detection 
capability of the certification trail approach is compara- 
ble to that obtained with the simple time redundancy 
approach discussed earlier. That is, if transient hard- 
ware faults occur during only one of the executions 
then either an error will be detected or the output will 
be correct. It should be further noted, however, the 
examples to be considered will indicate that this new 
approach can also save overall execution time. 
The certification trial approach also allows for the 
detection of faults in software. As in N-version pro- 
gramming, separate teams can write the specification 
now must include precise information describing the 
generation and use of the certification trial. Because of 
the additional data available to the second execution, 
the specifications of the two phases can be very differ- 
ent; similarly, the two algorithms used to implement the 
phases can be very different. This will be illustrated in 
the convex hull example to be considered later. Alterna- . 
tively, the two algorithms can be very similar, differing 
only in data structure manipulations. This will be illus- 
trated in the minimum spanning tree and Huffman tree 
examples to be ’considered later. When significantly 
different algorithms are used it is sometimes possible to 
save programming effort by sharing program code. 
While this reduces the ability to detect errors in the 
software it does not change the ability to detect tran- 
sient hardware errors as discussed earlier. 
With respect to the above, it has been assumed that 
our method is implemented with software; however, it 
is clearly possible to implement the certification trail 
technique by using dedicated hardware. It is also possi- 
ble to generalize the basic two-level hierarchy of the 
certification trial approach as illustrated in FIG. 1 to 
higher levels. 
Examples of the Certification Trail Technique 
50 In this section, there is illustrated the use of certifica- 
tion trails by means of applications to three well-known 
and significant problems in computer science: the mini- 
mum spanning tree problem, the Huffman tree problem, 
and the convex hull problem. It should be stressed here 
55 that the certification trail approach is not limited to 
these problems. Rather, these algorithms have been 
selected only to give illustrations of this technique. 
Minimum Spanning Tree Example 
The minimum spanning tree problem has been exam- 
ined extensively in the literature and an historical sur- 
vey is given in [Graham, R.L., “An efficient algorithm 
for determining the convex hull of a planar set”, Infor- 
mation Processing Letters, pp. 132-133, 1, 19721. The 
65 certification trial approach is applied to a variant of the 
PrimDijkstra algorithm ]Prim, R.C., “Shortest con- 
nection networks and some generalizations,: Bell Syst. 
Tech. J., pp. 1389-1401, November, 1957; Dijkstra, E. 
60 
5,243,607 
5 6 
W., “A note on two problems in connexion with In our case, there is used two different data structure 
graphs,” Numer. Math. 1, pp. 269-1984, Jun. 20-221 as methods to support these operations. One method will 
explicated in [Tarjan, R.E., Data Structures and Net- be used in the first execution of the algorithm and an- 
work Algorithms, Society for Industrial and applied other, faster and simpler, method will be used in the 
Mathematics, Philadelphia, Pa. 19831. The discussion of 5 second execution. The second method relies on a trail of 
the application of the certification trail approach to the data which is output by the fmt execution. 
minimum spanning tree problem beings with some pre- 
liminary definitions. MINSPAN ALGORITHM 
Definition 3.1. A graph G = W,E) consists of a ver- 
tex set v and an edge set E. An edge is an unordered 10 these methods the overall algorithm 
of distinct vertices which is notated as, for example, 
[v,w], and it is said v is adjacent to w. A path in a graph 
from VI to vkk a sequence of vertices VI, v2, . . . , vksuch 
that [VI, V I  = I] is an edge for i L 11, . . . k - 11. A path 
is a cycle if k > 1 and V1 = Vk. An acyclic graph is a 15 rithm must perfom when 
graph such that for all pairs of vertices V,W there is a 
path from v to w. A tree is an acyclic and connected 
graph. 
a positive rational valued function defined on E. A 
subtree of G is a tree, T(V’,E‘), with V’ 51. V and E‘ € 
E. It is said T spans V’ and V’ is spanned by T. If V’ = 
V then we say T is a spanning tree of G. The weight of 
spanning tree of minimum weight. 
Before discussing precise implementation details for 
in both execu- 
tiom is presented. Pidgin d e  for this algorithm ap- 
pears &low. In addition, FIG. 2 iliustrates the execu- 
tion of the algorithm on a sample graph and the table 
below records the data structure operations the dgo- 
on the sample graph. The 
member and the p a m e t e r  h dropped to reduce clutter. 
ne second column gives the evolving of h. 
The third column records the ordered pair deleted by 
certification to these operations and 
is further discussed 
uses a method to a 
spanning tree. The algorithm starts by chaos- 
During each iteration of the algorithm a new edge is 
added to the tree being constructed. Thus, the set of 
vertices spanned by the tree increases by exactly one 
vertex for each iteration. The edge which is added to 
graph which conhhs no cycles. A connected graph is a fist column of the table gives the operations except 
Definition 3*2. = (V*E) be a graph and let w be Z0 the delete- operation. The fourth column records to 
The 
this tree ’ ’, d?w(e)’ A ’panning tree ‘s a 25 ing an arbitrary vertex frorn which to grow the tree. 
Data Structures and Supported Operations 
Before discussion of the minimum spanning tree algo- 
rithm, there must be described the properties of the 30 
algorithm, initially there is described abstractly the data gaph, 2(b) through 2(e) show stages Of 
tions that can be used to manipulate this data. The data 35 the tt-hhum spanning tree. The solid edges in FIGS. 
consists of set of ordered pairs. The fmt element in 2(b) though %e) represent the current tree and the 
these ordered pairs is referred to as the item number and dotted edges represent candidates for addition to the 
the second element is called the key value. Ordered tree. 
pairs may be added and removed from the set; however, To efficiently find the edge to add to the current tree 
at dl times, the item numbers of distinct ordered pairs 40 the algorithm uses the data structure operations de- 
must be distinct. It is possible, through, for multiple scribed above. AS won as a vertex, say V, is adjacent to 
ordered pairs to have the same key value. In this paper Some vertex which is currently Spanned it is inserted in 
the item numbers are integers between 1 and n, inclu- the set h. The key value for v is the weight of the mini- 
sive. Our default convention is that i is an item number, mum edge between v and Some vertex spanned by the 
k is a key value and h is a set of ordered pairs. A total 45 current tree. The array element prefer (v) is used to 
ordering on the pairs of a set can be defined lexica- keep track of this minimum weight edge. AS the tree 
graphically as follows: (i,k) < (i’,k) iff k < k‘ or (k = grows, information is updated by operations such as 
k and i < i’). The data structure should support a subset insert (i,k,h) and changekey (i,k,h). 
TABLE 1 of the following operations. 
member (i,h) returns a boolean value of true if h con- 50 
Data structure operations and certification 
tnil for MINSPAN rains an ordered pair with item number i, otherwise returns false. 
principle data structure that are required. Since many 
different data structures can be used to implement the 
that can be stored by the data structure and the opera- 
the tree is the One with the 
shows this process in action. 
the tree growth and 
weight. 
qa) shows the input. 
shows the output Of 
Operation Set of Ordered Pairs Delete Trail 
insert(6.500) 
insert (i,k,h) adds the ordered pair (i,k) to the set h. 
delete (i,h) deletes the unique ordered pair with item (2.200) smallat 
(2,2W(6,500) 2 
inscrt(2*200) 
55 deleternin (6,500) (2,200) number i from h. 
changekey (i,k,h) is executed only when there is an inscrt(3.8~~~) (6,W,(3,800) 6 
replaced by (i,k). inscrt(7.505) (6,4%(7,W.(3.800) 6 
ordered pair with item number i and h. This pair is ch.ngekey(6.450) (6,450),(3,800) s d a t  
according to the total order defined above and de- 60 ch.ngckey(7,495) 
“empty” is returned. chnngekey(3,350) (3,350),(7,495) smallat 
pair which immediately precedes the pair with item chmgekey(4,650) (7,495),(4,650) 7 
number i in the total order. If there is no predecessor 65 deleternin 
Many different types and combinations of data struc- deletemin empty 
(7,505),(3,800) (6,450) 
(~t2WW39L3s800) d l w t  
( s ,2~ ) , (7 ,49~) , (3 ,m)  
letes this pair. If h is the empty set then the token deleternin 0.495),(3,800) (5,250) 
(3,3S0),0.495),(4,700) 7 
(7,495),(4,700) (3.350) 
(4.650) (7,495) 
then the token “smallest” is returned. deleternin (4,650) 
deletemin (h) returns the ordered pair which is smallest EzE!) 
5 
predecessor (i,h) returns the item number of the ordered 2zzL00) 
tures can be used to support these operations efficiently. 
8 
5,243,607 
7 
The deletemin (h) operation is used to select the next FIG. 3(u) is before the insertion and FIG. 3(b) is after 
vertex to add to the span of the current tree. Note, the the insertion. 
algorithm does not explicitly keep a set of edges repre- When the insert operation is performed, some checks 
senting the current tree. Implicitly, however, if (v,k) is must be conducted. First, the ith array pointer must be 
returned by deletemin then prefer (v) is added to the 5 nil before the operation is performed. Section, the 
current tree. sorted order of the pairs stored in the linked list must be 
In the first execution of the MINSPAN algorithm, preserved after the operation. That is, if (i’,r) is stored 
the MINSPAN code is used and the principle data in the node before (i,k) in the linked list and (i”,r’) is 
structure is implemented with a balanced tree such as an stored after (i,k), then (?,Y) < 0,k) < (i”, F’) must hold 
AVL tree [Aderson-Vel’skii, G.M., and Landis, E.M., 10 in the to& order. If either Of these checks f d S  then 
“& algorithm for the organization of information”, execution halts and “elTOr” is Output. 
Soviet Math. Dokl., pp. 1259-1262,3, 19621, a red-black To perform delete (iyh) the i* array pointer is tra- 
tree [Guibas, L.J., and Sedgewick, R., “A dichromatic versed and the node found is from the linked 
Nineteenth ~ ~ ~ a l  symposium on Foundations of 15 the deletion of item number 7 if one considers FIG. 3(u) 
pp ation is performed one check is made. If the ith array 
173-189, 1, 19721. In addition, an array of pointers in- pointer is nil before the operation then the execution 
dexed from 1 to n is used. The balanced search tree halts and ‘krror” is output‘ 
To perform changekey (i,i,h) it suffices to perform 
delete (i,h) followed by insert (i,k,h). Note, this means stores the ordered pairs in h and is based on the total 
the next item in the certification trail is read. Also, the order described earlier. The array of pointers is initially 
checks associated with both these two operations are all nil. For each item i, the ith pointer of the array is used to point to the location of the ordered pair with 25 performed and the execution halts with ,.error., output 
item number i in the balanced search tree. If there is no if any check fails. such ordered pair in the tree then the ith pointer is nil. the Oth array pointer is 
list is accessed. If there is no such node then “empty” is member (i,h) and delete (i,h). 
The certification trail is generated during the first 3o returned and the operation is complete. Otherwise, 
execution as follows: When CHOOSE root c V is exe- suppose the node is 
output. Also, each time insert (Lkh) Or changekey the ith array pointer is set to nil, and (i,k) is returned. 
(i,k,h) are executed, predecessor is executed after- Lastly, to perform member (i,h) the ith array pointer 
wards, and the answer returned is output. This is illus- 35 is If it is nil then false is returned, otherwise, 
trated in column labeled “Trail” in the table above. true is returned. The predecessor (i,h) operation is not 
The second execution of the MINSPAN algorithm used int he second execution. 
also uses the MINSPAN code; however, the CHOOSE mS completes the description of the second execu- 
construct and the data structure operations are imple- tion. T~ show that there is described a correct imple- 
mented differently than in the fist execution. The 40 mentation of the certification trail method requires a 
CHOOSE is Performed by simply reading the first de- proof. The proof has several parts of varying difficulty. 
merit Of the Certification trail. This guarantees the Same First, one must show that if the first execution is fault- 
choice of a starting vertex is made in both executions. free then it outputs a minimum tree. Second, 
FIG. depicts the Principal data structure used which is one must show that if the first and second executions are 
called an indexed linked list. The array is indexed from 45 fault-free then they both output the -e minimum 
1 to n and Contains pointers to a Singly linked list which spanning tree. Both these parts of the proof are not 
represents the current contents of h from smallest to difficult to show. 
largest. The ith dement of the m a y  Points to the node The third more subtle part of the proof deals with the 
containing the ordered pair with the item number i if it situation in which only the second execution is fault- 
is present in h; otherwise, the pointer is nil. The 0th 50 free. - means an incorrect certification trail may be 
element of the array points to the node containing (0, generated in the first execution. In this case, it must be 
-1NF). Initially, the m a y  contains nil pointers except shown that the second execution outputs either the 
the 0th element. In order to implement the data struc- correct minimum spanning tree or “error”. The checks 
ture operations, the following is provided. that were described this property by detecting any er- 
To perform insert (i,k,h), it is necessary to read the 55 rors that would prevent the execution from generating 
next value in the certification trail. This value, say j, is the correct output. 
the item number of the ordered pair which is the prede- In the first execution each data structure operation 
cessor of (i,k) in the current contents of h. A new linked can be performed in O(log(n)) time where M =n. 
list node is allocated and the trail information is used to There are at most O(m) such operations and O(m) addi- 
insert the node into the data structure. Specifically, the 60 tional time overhead where [E] =m. Thus, the first 
ith array pointer is traversed to a node in the linked h t ,  execution can be performed in O(mlog(n)). It is noted 
say Y. (If j = “smallest” then the 0th array pointer is that th is algorithm does not achieve the fastest known 
traversed.) The new node is inserted in the list just after asymptotic time complexity which appears in Gabow, 
- n d e  .Y and before the next node in the linked list (if H.N., Galil, Z., Spencer, T., and Tarjan, R.E., “Effi- 
.---tipxe is one). The data field in the new node is set to (i,k) 65 cient algorithms for finding minimum spanning trees in 
%-’and ._ the ith pointer of the array is set to point to the new undirected and directed graphs,” Cornbinatorica 6, pp. 
.’ node. FIG. 4 shows the insertion of (7,505) into the data 109-122, 2, 1986. However, the algorithm presented 
structure given that the certification trail value is 6. here has a significantly smaller constant of proportion- 
Framework for balanced trees**, prmdings of the 
Computing, pp. 8-21, IEEE Computer Society Press, 
19781 or a b-tree [Bayer, R., and McCreight, E., “Orga- 
h t .  Next, the ith array pointer is set to d. FIG. 4 Shows 
depicting the data StlWctUre before the Operation and 
‘per- FIG. depicting it *rWmds* When the 
of large ordered indexes”, Acta 
To detelemin This array allows rapid execution Of operations such traversed TO the head ofthe list and the next node in the 
and suppose it the or-. 
cuted in the first step, the vertex is chosen is &red pair (i,k), then the node y is deleted from the list, 
5,243,607 
9 10 
ally which makes it competitive for reasonably sized also uses the command allocate to construct the tree. 
graphs. In addition, it provides us with a relatively This command allocates a new node and returns a 
simple and illustrative example of the use of a certifica- pointer to it. Each node is able to store an item number 
tion trail. and a key value in the field called info. the item numbers 
In the second execution each data structure operation 5 are in the set (1, . . . , 2n - 1) and the key values are 
can be performed in O(1). There are still at most O(m) sums of frequency values. The nodes also contain fields 
such operations and O(m) additional time overhead. for left and right pointers since the tree being con- 
Hence, the second execution can be performed in O(m) structed is binary. 
time. In other words, because of the availability of the The Huffman tree is built from the bottom up and the 
certification trail, the second execution is performed in 10 overall structure of the algorithm is based on the greedy 
linear time. There are no known O(m) time algorithms “merging” of subtrees. An array of pointers called ptr is 
for the minimum spanning tree problem. Komlos [2q used to point to the subtrees as they are constructed. 
was able to show that O(m) comparisons suffice to find Initially, n single vertex subtrees with the smallest asso- - the minimum spanning tree. However, there is no ciated frequency values. To perform a merge a new 
known q m )  time algorithm to actually find and per- Is subtree is created by first allocating a new root node 
form these comparisons. Even the related “verification and next setting the left and right pointers to the two 
problem has no known linear time solution. In the veri- subtrees being merged. The frequency associated with 
fication problem the input consists of an edge weighted the new subtree is the sum of the frequencies of the two 
graph and a subtree. The output is “yes” if the subtree subtrees being merged. In FIG. 6 the frequency associ- 
is the minimum spanning tree and “no” otherwise. The 20 ated with each subtree is shown as the second value in 
best known algorithm for this Problem Was created by the root vertex of the subtree. Details of the algorithm 
Tarjan [Tarjan, R.E., “Applications of path compres- are given below. Note that the priority queue data 
sion on balanced trees”, 3. ACM, PP. 690-715, October, structure allows the algorithm to quickly determine 
19791 and has the nonlinear time complexity of o(- which subtrees should be merged by enabling the two 
ma(m,n)), where a(m,n) is a functional inverse of Ack- 25 smallest frequency values to be found efficiently during 
erman’s function. The fact that the data in a certification each iteration. 
trail enables a minimum spanning tree to be found in Table 2 below illustrates the data structure operations 
linear time is, we believe, intriguing, significant, and performed when the ~ ~ f f ~ ~ ~  tree in FIG. 6 is con- 
indicative of the great promise of the certification trail structed. F~~ conciseness the initial inset operations 
technique. 30 have been omitted. The first column gives the set of 
ordered pairs in h. The second column gives the result. Huffman Tree Example 
of the two deletemin operations during each iteration. 
Huffman trees represent another classic algorithmic Note that column is labeled -~~ail97 because it is 
problem, one of the original solutions being attributed also output as the certification trail. The third column 
to Huffman [Huffman, D., “A method for the construe- 35 records the elements which are inserted by the corn- 
tion of minimum redundancy codes”, Proc. IRE, pp. mand on line 13. 
1098-1101,40, 19523. This solution has been used exten- 
siveiy to perform data compression through the design 
and use of so-called Huffman codes. These codes are 
prefix codes which are based on the Huffman tree and 40 
which yield excellent data compression ratios. The tree kt of Ordered Pairs Trail Insert 
structure and the code design are based on the frequen- 
cies of individual characters in the data to be com- 
tion of minimum redundancy codes”, Proc. IRE, pp. (9,73),(4,77),(10,87),(7,gg) (8,43),(3,44) (10.87) 
1098-1 101, 40, 1952, for information about the coding (10,87),(7,88),(11,150) (9,73),(4,77) (1 1,150) 
application. (1 1,150),( 12,175) (10,87),(7,88) (12,175) 
(1 l,l50),( 12,175) (13,325) 
TABLE 2 
Data structure operations and certifications trial 
for HUFFMAN 
(2,20),(5,23),(1,35),(6,38),(3,441,(4,77), 
(7.88) 
(8,43),(3,44),(9,73),(4,77),(7,88) (1,35),(6,38) (9,73) 
pressed. See Huffman, D., “A method for the construe- (1,35).(6,38),(8,431,(3,44),(4,77),(7,88) (2XM523) (8,43) 
Definition 3.3. The Huffman tree problem is the fol- 
lowing: Given a seauence of freauencies (Dositive inte- 
45 
First Execution of HUFFMAN gers) fill, f[2], . . . ,-f[n], construct a tree with n leaves 50 and with one frequency value assigned to each leaf so 
that the weighted path length is minimized. Specifi- In this execution the code entitled HUFFMAN is 
cally, the tree should minimize the following sum: Elic used and the priority queue data structure is imple- 
~wden(i)fTi] where LEAF is the set of leaves, len(i) is mented with a heap [Tarjan, R.E., Data Structures and 
the length of the path from the root of the tree to the 55 Network Algorithms, Society for Industrial and Ap- 
leaf li,f[i] is the frequency assigned to the leaf li. plied Mathematics, Philadelphia, Pa. 19831 or a bal- 
An example of a Huffman tree is given in FIG. 6. The anced search tree [Guibas, L.J., and Sedgewick, R., “A 
input frequencies are: f(1) = 35, f(2) = 20, f(3) = 44, dichromatic framework for balanced trees”, Proceed- 
f(4) = 77, f(5) = 23, f(6) = 38, and f(7) = 88. The ings of the Nineteenth Annual Symposium on Founda- 
frequencies appear inside the leaf nodes as the second 60 tions of Computing, pp. 8-21, IEEE computer Society 
elements of the ordered pairs in the figure. Press, 1978; Adel‘son-Vel-Vel’skii, G.M., and Landis, 
EM., “An algorithm for the organization of informa- 
tion”, Soviet Math. Dokl., pp. 1259-1262, 3, 1962; 
The algorithm to construct the Huffman tree uses a Bayer, R., and McCreight, E., “Organization of large 
data structure which is able to implement the insert and 65 ordered indexes”, Acta Inform., pp. 173-189, 1, 19721. 
the deletemin operations which are defined above in the Actually, any correct implementation is acceptable; 
minimum spanning tree example. This type of data however, to achieve a reasonable time complexity for 
structure is often called a priority queue. The algorithm this execution the suggested implementation are desir- 
HUFFMAN ALGORITHM 
5,243,607 
11 
able. the certification trail is generated as follows: 
whenever deletemin (h) is executed the item number 
and the key value which are returned are both output. 
In the table, the certification trail is listed in the second 
column. 
Second Execution of HUFFMAN 
This execution consists of two parts which may be 
logically separated but which are performed together. 
In the first logical part, the code called HUFFMAN is 
executed again except that the data structure operations 
are treated differently. All insert operations are not 
performed and all deletemin operations are performed 
Py simply reading the ordered pairs from the certifica- 
tion trail. In the second logical part, the data structure 
operations are “verified”. Note, by “verify” it does not 
mean a formal proof of correctness based on the text of 
an algorithm. The problem of verification can be formu- 
lated as follows: given a sequence of insert (i,k,h) and 
deletemin (h) operations @)operations check to see if 20 
the answers are correct. It should be noted that while in 
our example there is only one h, in general there can be 
multiple h’s to be handled. 
The description of the algorithm for the second exe- 
cution can be further simplified because only some re- 25 
stricted types of operation sequences are generated by 
the HUFFMAN code. First, it can be observed that all 
elements are ultimately deleted from h before the algo- 
rithm terminates; second, it can be further observed that 
when an element is inserted into h, its key value is larger 3 0  
than the key value of the last element deleted from h. 
These two important observations allow us to check a 
sequence using the simplified method which is de- 
scribed next. 
dexed from 1 to 2n - 1. T h i s  array is used to track the 
contents of h. If the ordered pair (i,k) is in h, then array 
element i is set to a value of k; and if no ordered pair 
with item number i is in h, then airay element i is set to 
a value of - 1. Initially, all array elements are set to - 1 40 
and then operation sequence is processed. If insert (i,k) 
is executed then array element i is checked to see if it 
contains - 1. (The value of - 1 is an arbitrary selection 
meant to serve only as an indicator.) If array element i 
does contain -1, then it is set to k. If deletemin (h) is 45 
executed, then the answer indicated by the certification 
trail, say (i,k), is examined. Array element i is checked 
to see if it contains k. In addition, k is compared to the 
key value of previous element in the certification trail 
sequence to see if it is greater than or equal to that 5 0  
previous value. If both these checks succeed then array 
element i is set to - 1. 
If any of the checks just described above fails, then 
the execution halts and “error” is output. Otherwise the 
operation sequence is considered “verified”. It can be 55 
rigorously shown that the checks described are suffi- 
cient for determining whether the answers given in the 
certification trail are correct; this proof, however, has 
been omitted for the sake of brevity. Finally, it is worth 
noting that to combine the two logical parts of this 60 
execution, one can perform the data structure checking 
in tandem with the code execution of HUFFMAN. 
Each time an insert or deletemin is encountered in the 
code, the appropriate set of checks are performed. 
Time Complexity Comparison of the Two Executions 
Again, as in the minimum spanning tree example, the 
availability of the certification trail permits the second 
Our simplified method uses an array of integers in- 35 
65 
12 
execution for the Huffman tree problem to be dramati- 
cally more efficient than the first. 
In the first execution of HUFFMAN, each data struc- 
ture operation can be performed in O(log(n)) time 
where n is the number of frequencies in the input. There 
are O(n) such operations and O(n) additional time over- 
head, hence, the execution can be performed in O(n log 
(n)). This is the same complexity as the best known 
algorithm for constructing Huffman trees. 
In the second code execution of HUFFMAN, each 
data structure operations is performed in constant time. 
Further, verifying the data structure operations are 
correct takes only a constant time per operation. Thus, 
it follows that the overall complexity of the second 
execution is only O(n). 
Convex Hull Example 
The convex hull problem is fundamental in computa- 
tional geometry. The certification trail solution to the 
generation of a convex hull is based on a solution due to 
Graham [Graham, R.L., “An efficient algorithm for 
determining the convex hull of a planar set”, Informa- 
tion Processing Letters, pp. 132-133, l 19721 which is 
called “Graham’s Scan.” (For basic definitions and 
concepts in computational geometry, see the text of 
Preparata and Shamos [Preparata F.P., and Shamos 
M.I., Computational geometry; an introduction, Spring- 
er-Verlag, New York, N.Y., 19851.) For simplicity in 
the discussion which follows, it is assumed the points 
are in so-called “general position” (this is, no three 
points are colinear). It is not difficult to remove this. 
restriction. 
D e f ~ t i o n  3.4. A convex region in R2 is a set of 
points, say Q, in Rzsuch that for every pair of points in 
Q the line segment connecting the points lies entirely 
within Q. A polygon is a circularly ordered set of line 
segments such that each line segment shares one of its 
endpoints with the preceding line segment and shares 
the other endpoint with the succeeding line segment in 
the ordering. The shared endpoints are called the verti- 
ces of the polygon. A polygon may also be specified by 
an ordering of its vertices. A convex polygon is a poly- 
gon which is the boundary of some convex region. The 
convex hull of a set of points, S, in the Euclidean plane 
is defined as the smallest convex polygon enclosing all 
the points. This polygon is unique and its vertices are a 
subset of the points in S. It is specified by a counter- 
clockwise sequence of its vertices. 
FIG. 8(c) shows a convex hull for the points indicated 
by black dots. Graham’s can algorithm given below 
constructs the convex hull incrementally in a counter- 
clockwise fashion. Sometimes it is necessary for the 
algorithm to “backup” the construction by throwing 
some vertices out and then continuing. The first step of 
the algorithm selects an “extreme” point and calls it p1. 
The next two steps sort the remaining points in a way 
which is depicted in FIG. 8(u). It is not hard to show 
that after these three steps the points when taken in 
order, Pi, p2, . . . , pn, form a simple polygon; although, 
in general, this polygon is not convex. 
Graham’s Scan Algorithm 
It is possible to think of Graham’s scan algorithm as 
removing points from this simple polygon until it be- 
comes convex. the main FOR loop iteration adds verti- 
ces to the polygon under construction and the inner 
WHILE loop removes vertices from the construction. 
A point is removed when the angle test performed at 
5,243,607 
13 14 
Step 6 reveals that it is not on the convex hull because 
it falls within the triangle defined by three other points. 
A “snapshot” ofthe algorithm given in FIG. 8(b) shows 
that q5 is removed from the hull. The angle formed by 
q445, pa is less than 180 degrees. This means, q5 lies 5 
within the triangle formed by q4, PI, p6. (Note, ql = PI.) 
In general, when the angle test is performed, if the angle 
Below it will be revealed that this is the primary infor- 10 
mation relied on in our certification trail. When the 
main FOR loop is complete, the convex hull has been 
constructed. 
First Execution of Graham’s Scan 
actually consists of indices into the input data. this does 
not unduly complicate the checks above; instead it 
makes them easier. The correctness and adequacy of 
these checks must be proven. 
Time Complexity of the Two Executions 
In the first execution the sorting of the input points 
formed by qm-l@hpk is less than 180 degrees, then 
qm lies within the triangle formed by qm-I,pl,pk. 
takes O(&g(n) time where n is the number of input 
points. One a show that this cost dominates and the 
complexity is qdog(n)). 
It is possible to note that, the span- 
ning tree example and the Huffman tree example, the 
convex hull example utilizes an algorithm in the second 
execution that is not a close variant of that used int he 
In this execution the code CONVEXHULL is used. 15 first execution. However, like the previous two exam- 
depends fundamentally on the information in the certifi- 
cation trail for eficiency and performance. 
The certification trail is generated by adding an output ples, the second for the problem 
Statement within the WHILE loop. specifically, if an 
angle of less than 180 degrees is found in the WHILE 
loop test then the four tuple consisting of 2o Concurrency of Executions 
qm,qm- l,pl,pk is output to the certification trail. 
Table 3 below shows the four tuples of points that In the three examples discussed above, it is possible to 
would be output by the algorithm when run on the Start the second execution before the first execution has 
example in FIG. 8. The points in Table 3 are given the terminated. This is a highly desirable capability when 
m e  names as in FIG. e@). The final convex hull points 25 additional hardware is available to run the second exe- 
ql ,  . . . qm are also output to the certification trail. cution (for example, with multiprocessor machines, or 
Strictly speaking the trail output does not consist of the machines with coprocessors or hardware monitors). 
actual points in R2. Instead, it consists of indices to the In the case of the minimum spanning tree problem, 
original input data. This means if the original data con- the two executions can be run concurrently. It is only 
sists of ~ 1 ~ 2 ,  . . . ,sn then rather than output the element 3o necessary for the second execution to read the certifica- 
in R2 corresponding to sithe number i is output. It is not tion trail as it is generated-one item number at a time. , 
hard to code the program so that this is done. Thus, there is a slight time lag in the second execution. 
The case of the Huffman tree problem is similar. Both TABLE 3 
executions can be run concurrently if the second execu- 
35 tion reads the certification trail as it is generated by the 
first execution. 
The case of the convex hull problem is not quite as 
favorable, but it is still possible to partially overlap the 
two executions. For example, as each 4-tuple of points is 
40 generated by the first execution, it can be checked by 
the second execution. But the second execution must 
First part of certification trail for Graham’s scan 
Point not on convex hull Three surroundmg points 
PS 
P7 
P4.PI.P6 
PbPI.P8 
P4 P3.PI.P6 
Second Execution for the Convex Hall Problem 
Let the certification trail consist of a set of four tu- 
ples, (xi,ai,bl,cl), (x2dn9c2), . . . , (xrtanbrtcr) followed 
by the supposed convex hull, ql,q2, . . . ,qm. The code 
for CONVEXHULL is not used in this execution. In- 
deed, the algorithm performed is dramatically different 
than CONVEXHULL. 
First, the algorithm checks for i Q (1, . . . ,r) that xilies 
It consists of five checks on the trail data. 
within the triangle defined bv ai.bi. and ci. 
45 
50 
wait for the points on the convex hull to be output at the 
end of the first execution before they can be checked. 
An additional opportunity for overlapping execution 
occurs when the system has a dedicated comparator. In 
this case it is sometimes possible for the two executions 
to send their output to the comparator as they generate 
it. For example, this can be done in the minimum span- 
ning tree problem where the edges of the tree can be 
sent individually as they are discovered by both execu- . .. ., 
Second, the algorithm checks that for each triple of tiom. 
counterclockwise consecutive points on the supposed 
convex hull the angle formed by the points is less than 
or equal to 180 degrees. 
Third, it checks that there is a one to one correspon- 55 
dence between the input points and the points in (XI, 
Fourth, it checks that for i Q (1, . , . ,r), albj, and Ciare 
among the input points. 
Fifth, it checks that there is a unique point among the 60 
points on the supposed convex hull which is a local 
extreme point. A point q on the hull is a local extreme 
point if its predecessor in the counterclockwise order- 
ing has a strictly smaller y coordinate and its succes- 
sor in the ordering has a smaller or equal y coordi- 65 
nate. 
If any of these checks fail then execution halts and 
“error” is output. As mentioned above, the trail data 
. * * J,) u ( ¶ I t  * f .  ,qm). 
Comparison of Techniques 
The certification trail approach to fault tolerance, 
whether implemented in hardware or software or some 
combination thereof, has resemblances with other fault 
tolerant techniques that have been previously proposed 
and examined, but in each case there are significant and 
fundamental distinctions. These distinctions are primar- 
ily related to the generation and character of the certifi- 
cation trail and the manner in which the secondary 
algorithm or system uses the certification trail to indi- 
cate whether the execution of the primary system or 
algorithm was in error and/or to produce an output to 
be compared with that of the primary system. 
To being, the certification trail approach might be 
viewed as a form of N-version programming [Chen, L., 
and Avizienis A., “N-version programming: a fault 
15 
5,243,607 
16 
tolerant approach to reliability of software operation,” collects or is sent information about the operation of the 
Digest of the 1978 Fault Tolerant Computing Sympo- system to be compared with that which was provided 
sium, pp. 3-9, IEEE computer Society Press, 1978; during the set-up phase. On the basis of this comparison, 
Avizienis, A., and Kelly J., “Fault tolerance by design a decision is made by the watchdog processor as to 
diversity: concepts and experiments,” Computer, vol. 5 whether or not an error has occurred. The information 
17, pp. 67-80. August, 19841. This approach specifies about system behavior by means of which a watchdog 
that N different implementations of an algorithm be processor must monitor for errors includes memory 
independently executed with subsequent comparison of access behavior [Namjoo, M., and McCluskey, E., 
the resulting N outputs. There is no relationship among “Watchdog processors and capability checking,” Di- 
the executions of the different versions of the algo- 10 gest of the 1982 Fault Tolerant Computing Symposium, 
rithms other than they all use the same input; each algo- pp. 245-248, IEEE Computer Society Press, 19821, 
rithm is executed independently without any informa- control and program flow [Eifert, J. B. and Shen, J. P., 
tion about the execution of the other algorithms. In “Processor monitoring using asynchronous signatured 
marked contrast, the certification trail approach allows instruction streams,” Dig. 14th Int. Conf. Fault-Toler- 
the primary system to generate a trail of information 15 ant Comput., pp. 394-399, 1984, June 20-22; Iyengar, 
while executing its algorithm that is critical to the sec- V. S. and Kinney, L. L., “Concurrent fault detection in 
ondary system’s execution of its algorithm. In effect, microprogrammed control units,” IEEE Trans. Com- 
N-version programming can be thought of relative to put., vol. C-34, pp. 810-821, September 1985; Kane, J. 
the certification trail approach as the employment of a R and Yau, S. S., “Concurrent software fault detection, 
null trail. 20 ” IEEE Trans. Software Eng., vol. SE-1, pp. 87-99, 
A softwarehardware fault tolerance technique March 1975; Lu, D., “Watchdog processor and struc- 
known as the recovery block approach [Randell, Ba., tural integrity checking, ” IEEE Trans. Comput., vol. 
“System structure for software fault tolerance,” IEEE C-31, pp. 681-685, July 1982; Namjoo, M., “Techniques 
Trans. on Software Engineering vol. 1, pp. 202-232, for concurrent testing of VLSI processor operation,” 
June, 1975; Anderson, T., and Lee, P., Fault tolerance: 25 Dig. 1982 Int. Test Conf., pp. 461-468, November 1982; 
principles and practices, Prentice-Hall, Englewood Namjoo, M., “CERBERUS-16: An architecture for a 
Cliffs, N.J., 1981; Lee, Y. H. and Shin, K. G., “Design general purpose watchdog processor,” Dig. Papers 13th 
and evaluation of a fault-tolerant multiprocessor using Annu. Int. Sump. Fault Tolerant Comput., pp. 216-219, 
hardware recovery blocks,” IEEE Trans. Comput., vol June, 1983; Shen, J. P. and Schuette, M.A., “On-line 
C-33, pp. .113-124, February 1984.1 uses acceptance 30 self-monitoring using signatured instruction streams,” 
tests and alternative procedures to produce what is to Proc. 1983 Int. Test Conf., pp. 275-282, October, 1983; 
be regarded as a correct output from a program. When Sridhar, T. and Thatte, S. M., “Concurrent checking of 
using recovery blocks, a program is viewed as being program flow in VLSI processors,” Dig. 1982 Int. Test 
structured into blocks of operations which after execu- Conf., pp. 191-199, November, 1982; 46,471, or reason- 
tion yield outputs which can be tested in some informal 35 ableness of results [Mahmood, A., Lu, D. J. and 
sense for correctness. The rigor, completeness, and McCluskey, E. J., “Concurrent fault detection using a 
nature of the acceptance test is left to the program de- watchdog processor and assertions,” Proc. 1983 Int. 
signer, and many of the acceptance tests that have been Test Conf., pp. 622-628, October, 1983; Mahmood, A. 
proposed for use tend to be somewhat straightforward Ersoz, a. and McCluskey, E.J., “Concurrent system 
[Anderson, T., and Lee, P., Fault tolerance: principles 40 level error detection using a watchdog processor,” 
and practices, Prentice-Hall, Englewood Cliffs, N.J., Proc. 1985 Int. Test conf., pp. 145-152, November, 
19811. Indeed, formal methodologies for the definition 19851. Using physical fault injection techniques, distri- 
and generation of acceptance tests have thus far not butions of errors that could be detected using such types 
been established. Regardless, the certification trail no- of information have been determined for some specific 
tion of a secondary system that receives the same input 45 systems [Schmid, M., Trapp, R., Davidoff, A., and Mas- 
as the primary system and executes an algorithm that son, G., “Upset exposure by means of abstraction verifi- 
takes advantage of this trail to efficiently produce the cation,” Dig. of the 1982 Fault Tolerant Computing 
correct output and/or to indicate that the execution of Symposium, pp. 237-244, June, 1982; Gunneflo, U., 
the ftrst algorithm was correct does not fall into the Karlsson, J., and Torin, J., “Evaluation of error detec- 
category of an acceptance test. H) tion schemes for using fault injection by heavy-ion radi- 
A watchdog processor is a small and simple (relative ation,” Dig. of the 1989 Fault Tolerant Computing 
to the primary system being monitored) hardware mon- Symposium, pp. 340-347, June, 19891, and the perfor- 
itor that detects errors examining information relative mance of models of error monitoring techniques that 
to the behavior of the primary system [Mahmood, A., could be realized in the form of watchdog processors 
and McCluskey, E., “Concurrent error detection using 55 have been analyzed [Blough, D., and Masson, G., “Per- 
watchdog processors,” IEEE Trans. on Computers, formance analysis of a generalized concurrent error 
vol. 37, pp. 160-174, February, 1988; Mahmood, A., detection procedure,” IEEE Trans. on Computers vol. 
and McCluskey, E., “Concurrent error detection using 39, January, 1990.1. However, in contrast to the certifi- 
watchdog processors-a survey,” IEEE Trans. on cation trail technique, a watchdog processor uses only a 
Computers, vol. 37, pp. 160-174, February, 1988; Nam- 60 priori defrned behavior checks, none of which is SUE- 
joo, M., and McCluskey, E., “Watchdog processors and cient together with the input to the primary system to 
capability checking,” Digest of the 1982 Fault Tolerant efficiently reproduce the output for direct comparison 
Computing Symposium, pp. 245-248, IEEE Computer with that of the primary system. 
Society Press, 1982.1. Error detection using a watchdog Related to the watchdog processor approach is that 
processor is a two-phase process: in the set-up phase, 65 of using executable assertions [Andrews, D., “Software 
information about system behavior is provided a priori fault tolerance through executable assertions,” Rec. 
to the watchdog processor about the system to be moni- 12th Asilomar Conf. Circuits, Syst., Comput., pp. 
tored; in the monitoring phase, the watchdog processor 641-645, 1978, November 6-8; Andrews, D., “Using 
5,243,607 
17 18 
executable assertions for testing and fault tolerance,” cause it is allowed to be probabilistic in a carefully 
Dig. 9th Annu. Int. Sump. Fault-Tolerant Comput., pp. specified way. There are two main differences between 
102-105, 1979, June 20-22: Mahwood, A., Lu, D. J. and this approach and the certification trail approach. First, 
McCluskey E. J., “Concurrent fault detection using a a program checker may call the algorithm it is checking 
watchdog processor and assertions,” Proc. 1983 Int. 5 a polynomial number of times. In the certification trail 
Test Conf., pp. 622-628, October 19831. An assertion approach the algorithm being checked is run once. 
can be defined as an invariant relationship among van- Second, the checker is designed to work for a problem 
ables of a process. In a program, for examples, asser- and not a specific algorithm. That is, the checker design 
tions can be written as logical statements and can be is based on the input/output specification of a problem. 
inserted into the code to signify that which has been 10 The certification trail approach is explicitly algorithm 
predetermined to be invariably true at that point in the being checked is run once. Second, the checker is de- 
execution of the program. Assertions are based on a signed to work for a problem and not a specific algo- 
priori determined properties of the primary system or rithm. That is, the checker design is based on the input- 
algorithm. This, however, again serves to distinguish /output specification of a problem. The certification 
executable assertion technique from the use of certifica- 15 trail approach is explicitly algorithm oriented. In other 
tion trails in that a certification trail is a key to the words, a specific algorithm for a problem is modified to 
solution of a problem or the execution of an algorithm out put a certifications trail. This trail sometimes allows 
that can be utilized to e&ciently and correctly produce the second execution to be faster than any known pro- 
the solution. gram checkers for the problem. This is the case for the 
Abraham, J., “Algorithm-based fault tolerance for ma- Other hardware and software fault tolerance and 
trix operations,” IEEE Trans. on Computers, pp. error monitoring techniques have been proposed and 
518-529, vol. C-33, June, 1984; Nair, V., and Abraham, studied that might be thought of as bearing some resem- 
J., “General linear codes for fault-tolerant matrix opera- blance to the certification trail approach. Extensive 
tions on processor arrays,” Dig. of the 1988 Fault Tol- 25 summaries and descriptions of these techniques can be 
erant Computing Symposium, pp. 180-185, June, 1988; found in the literature [Siewiorek, D., and Swarz, R., 
“Fault tolerant FTT networks,” Dig. of the 1985 Fault The theory and practice of reliable design, Digital 
Tolerant Computing Symposium, June, 19851 uses error Press, Bedford, Mass., 1982; AvizieNs, A., “Fault toler- 
detecting and correcting codes for performing reliable ance by means of external monitoring of computer sys- 
computatians with specific algorithms. This technique 30 tems,” Proceedings of the 1981 National Computer 
encodes data at a high level and algorithms are specifi- Conference, pp. 27-40, AFIPS Press, 1980; Johnson, B., 
cally designed or modified to operate on encoded data Design and analysis of fault tolerant digital systems, 
and produce encoded output data. Algorithm-based Addison-Wesley, Reading, Mass., 1989; Mahmood, A., 
fault tolerance is distinguished from other fault toler- and McCluskey, E., “Concurrent error detection using 
ance techniques by three characteristics: the encoding 35 watchdog processors-a survey,” IEEE Trans. on 
of the data used by the algorithm; the modification of Computers, vol. 37, pp. 160-174, February, 19881. Ex- 
the algorithm to operate on the encoded data; and the amination of these techniques reveals, however, that in 
distribution of the computation steps in the algorithm each case there are fundamental distinctions from the 
among computational units. It is assumed that at most certification trail approach. In summary, the certifica- 
one computational unit is faulty during a specified time 40 tion trail approach stands along in its employment of 
period. The error detection capabilities of the al- secondary algorithms/systems for the computation of 
gorithm-based fault tolerance approach are directly an output for comparison that because of the availability 
related to that of the error correction encoding utilized. of the trail not only proceeds in a more efficient manner 
The certification trail approach does not require that than that of the primary but also can indicate whether 
the data to be executed be modified nor that the funda- 45 the execution of the primary algorithm was correct. 
mental operations of the algorithm be changed to ac- Although the invention has been described in detail in 
count for these modifications. Instead, only a trail indic- the foregoing embodiments for the purpose of illustra- 
ative of aspects of the algorithm’s operations must be tion, it is to be understood that such detail is solely for 
generated by the algorithm. As seen from the above that purpose and that variations can be made therein by 
examples, the production of this trail does not burden 50  those skilled in the art without departing from the spirit 
the algorithm with a significant overhead. Moreover, and scope of the invention except as it may be described 
any combination of computational errors can be han- by the following claims. 
dled. What is claimed is: 
Recently Blum and Kannan [slum, M., and Kannan, 1. A method for achieving fault tolerance in a com- 
S., “Designing programs that check their work,” Pro- 55 puter system having at least a first central processing 
ceedings of the 1989 ACM Symposium on Theory of unit and a second central processing unit comprising the 
Computing, pp. 86-97, ACM Press, 19891 have defined steps of: 
what they call a program checker. A program checker executing a first algorithm in the first central process- 
is an algorithm which checks the output of an other ing unit on input so that a first output and a certifi- 
algorithm for correctness and thus it is similar to an 60 cation trail are produced; 
acceptance test in a recovery block. An example of a executing a second algorithm in the second central 
program checker is the algorithm developed by Tarjan processing unit on the input and on the certification 
[Tarjan, R. E., “Applications of path compression on trail so that a second output is produced, said sec- 
balanced trees,” J. ACM, pp. 690-715, October, 19791 ond algorithm having a faster execution time than 
which takes as input a graph and a supposed minimum 65 the first algorithm for a given input; and 
spanning tree and indicates whether or not the tree comparing the first and second outputs such that an 
actually is a minimum spanning tree. The Blum and error result is produced if the first and second out- 
Kannan checker is actually more general than this be- puts are not the same. 
Algorithm-based fault tolerance [Huang, K.-H., and 20 minimum spanning tree problem. 
5.243,607 >- 
19 
2. A method as described in claim 1 wherein the step 
of executing the second algorithm includes the step of 
determining whether the certification trail is in error. 
3. A method as described in claim 2 including before 
the step of executing the first algorithm, there is the step 
of duplicating the input such that the input that i s  pro- 
vided to the step of executing the first algorithm is also 
the input that is provided to the step of executing the 
second algorithm. 
4. A method as described in claim 3 wherein the step 
of executing the first algorithm includes the step of 
determining whether the first output is in error. 
5. A method as described in claim 4 wherein the step 
of executing the first algorithm includes the step of 
determining whether the second output is in error. 
6. A method as described in claim 5 wherein the 
second algorithm generates the second output correctly 
when the second algorithm is executed by the second 
processing unit even if the certification trial produced 
by the first algorithm when the first algorithm is exe- 
cuted by the first processing unit is incorrect. 
7. A method as described in claim 1 wherein the 
second algorithm is derived from the first algorithm. 
8. A computer system comprising: 
a first computer comprising: 
a first memory, 
a first central processing unit in communication with 
the memory, 
a first input port in communication with the memory 
and the first central processing unit, 
a first algorithm disposed in the first memory, said 
first algorithm produces a first output and produces 
a certification trail based on input received by the 
input port when the first algorithm is executed by 
the first central processor; 
a second computer comprising a second memory, 
a second central processing unit in communication 
with the second memory and the first central pro- 
cessing unit; 
a second input port in communication with the sec- 
ond memory and the second central processing 
unit; 
a second algorithm disposed in the second memory, 
said second algorithm produces a second output 
based on the input and the certification trail when 
the second algorithm is executed by the second 
central processing unit, said second algorithm hav- 
ing a faster execution time than the first algorithm 
for a given input; and 
a mechanism for comparing the first and second out- 
puts such that an error result is produced if the first 
and second outputs are not the same. 
9. A computer as described in claim 8 wherein the 
second algorithm generates the second output correctly 
when the second algorithm is executed by the second 
processing unit even if the certification trail produced 
20 
by the first algorithm when the first algorithm is exe- 
cuted by the first processing unit is incorrect. 
10. A computer system as described in claim 9 
wherein the mechanism for comparing is a comparator. 
11. An apparatus as described in claim 10 wherein the 
second algorithm is derived from the first algorithm. 
12. A method for achieving fault tolerance in a cen- 
tral processing unit comprising the steps of: 
executing a first algorithm in the central processing 
unit on input so that a first output and a certifica- 
tion trail are produced; 
executing a second algorithm in the central process- 
ing unit on the input and on the certification trail so 
that a second output is produced, said second algo- 
rithm having a faster execution time than the first 
algorithm for a given input; and 
cornparing the first and second outputs such that an 
error result is produced if the first and second out- 
puts are not the same. 
13. A method as described in claim 12 wherein the 
second algorithm generates the second output correctly 
when the second algorithm is executed by the process- 
ing unit even if the certification trail produced by the 
first algorithm when it is executed by the processing 
14. A method as described in claim 13 wherein the 
15. A computer comprising: 
a memory, 
a central processing unit in communication with the 
memory, 
a first input port in communication with the memory 
and the central processing unit, 
a first algorithm disposed in the memory, said first 
algorithm produces a first output and a certifica- 
tion trail based on input received by the input port 
when the input is executed by the central process- 
ing unit; 
a second algorithm disposed in the memory, said 
second algorithm produces a second output based 
on the input and on at least a portion of the certifi- 
cation trail when the second algorithm is executed 
by the central processing unit, said second algo- 
rithm having a faster execution time than the first 
algorithm for a given input; and 
a mechanism for comparing the first and second out- 
puts such that an error result is produced if the first 
and second outputs are not the same. 
16. A computer as described in claim 15 wherein the 
50 second algorithm generates the second output correctly 
when the second algorithm is executed by the process- 
ing unit even if the certification trail produced by the 
first algorithm when the first algorithm is executed by 
the processing unit is incorrect. 
17. A computer as described in claim 16 wherein the 
mechanism for comparing is a comparator. 
18. An apparatus as described in claim 15 wherein the 
second algorithm is derived from the first algorithm. 
5 
10 
15 
20 
25 unit is incorrect. 
second algorithm is derived from the first algorithm. 
30 
35 
40 
45 
55 
* * * * *  
60 
65 
