Design and Scheduling for Periodic Concurrent Error Detection and Recovery in Processor Arrays by Wang, Yi-Min et al.
May 1992 UILU-ENG-92-2214
CRHC-92-08
Center fo r Reliable and High-Performance Computing
DESIGN AND SCHEDULING 
FOR PERIODIC
CONCURRENT ERROR DETECTION 
AND RECOVERY 
IN PROCESSOR ARRAYS
Yi-Min Wang, Pi-Yu Chung, and W. Kent Fuchs
Coordinated Science Laboratory 
College of Engineering
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
Approved for Public Release. Distribution Unlimited.
unclassified
REPORT DOCUMENTATION PAGE
Unclassified None
2b. D E C LA S SIF IC A T IO N / D O W N G R A D IN G  SCH ED U LE Approved for public release; distribution unlimited
«. p e r f o r m i n g  o r g a n i z a t i o n  r e p o r t  n u m b e
UILU-ENG-92-2214
R(S)
:RHC-92-08
M O N ITO R IN G  O R G A N IZ A T IO N  REPO R T NU M BER(S)
Coordinated Science Lab 
University of Illinois
6b. OFFICE SYM BO L  
( If  ip p H c tb Jt )
N/A
7a. N A M E O F M O N ITO R IN G  O R G A N IZA TIO N
National Aeronautics Space Administration 
IL Comput. Lab. Aerospace Sys
1101 W. Springfield Avenue 
Urbana, IL 61801
Ah OPCirc Ç Y u a m
Programs " 1 "Langley VA 
Chicago, IL 
Washington, DC
Periodic application of time-redundant error checking provides 
the trade-off between error detection latency and performance degradation. The goal is to 
achieve high error coverage while satisfying performance requirements. In this paper, we derive 
the optimal scheduling of checking patterns in order to uniformly distribute the available check­
ing capability and maximize the error coverage. Synchronous buffering designs using data for­
warding and dynamic reconfiguration are described. Efficient single-cycle diagnosis is imple­
mented by error pattern analysis and direct-mapped recovery cache. A rollback recovery 
scheme usmg start-up control for local recovery is also presented.
!r>!CT A^ CTT-rrn
UNCLASSIFIED
fKCUftlTY CLAMI FI CATION OW THI* *AO«
UNCLASSIFIED
Design and Scheduling for Periodic Concurrent Error Detection 
and Recovery in Processor Arrays
Yi-M in W ang, Pi-Yu Chung and W. Kent Fuchs
Coordinated Science Laboratory 
University o f Illinois at Urbana-Champaign
Correspondent: Yi-Min Wang
Coordinated Science Laboratory 
1101 W. Springfield Ave. 
University of Illinois 
Urbana, IL 61801
E-mail: ymwang@crhc.uiuc.edu 
Phone: (217) 244-7161 
FAX: (217) 244-5686
Abstract
Periodic application of time-redundant error checking provides the trade-off between error 
detection latency and performance degradation. The goal is to achieve high error coverage while 
satisfying performance requirements. In this paper, we derive the optimal scheduling of check­
ing patterns in order to uniformly distribute the available checking capability and maximize the 
error coverage. Synchronous buffering designs using data forwarding and dynamic 
reconfiguration are described. Efficient single-cycle diagnosis is implemented by error pattern 
analysis and direct-mapped recovery cache. A rollback recovery scheme using start-up control 
for local recovery is also presented.
Acknowledgement: This research was supported in part by the National Aeronautics and Space Administration 
(NASA) under Grant NASA NAG 1-613, in cooperation with the Illinois Computer Laboratory for Aerospace 
Systems and Software (ICLASS), and in part by the Joint Services Electronics Program (U.S. Army, U. S. Navy 
and U. S. Air Force) under Contract N00014-90-J-1270.
21. INTRODUCTION
A variety of processor arrays have been proposed for signal and image processing and 
scientific computation applications [1], In order to detect errors produced by faults in these ar­
rays a variety of off-line testing procedures have been developed for detecting permanent faults 
and concurrent error detection (CED) techniques have been developed for transient and intermit­
tent failures [2], The focus of this paper is on processor arrays for systolic algorithms and the 
use of time redundancy techniques for concurrent detection of errors [3,17].
Traditionally, CED is applied continuously to each computation activity so that an error 
resulting from a fault in the processing element (PE) can be detected immediately. However, 
when time redundancy techniques are used for error detection this continuous checking scheme 
may greatly degrade the array performance, e.g., by a factor of two for RESO [4] or alternating 
logic [5]. For some applications where high processing speed is crucial and error detection la­
tency is tolerable, it may be possible to maintain the desired throughput while keeping a reason­
ably high error coverage by turning the CED mechanism on and off periodically. Periodic Appli­
cation of CED (PACED) offers such a trade-off in error latency and probability of error detec­
tion versus performance degradation.
Several techniques regarding the utilization of idle processor cycles for CED have been 
proposed [6-12]. For general-purpose machines with processor-level parallelism, a technique 
called saturation has been introduced for utilizing the idle processors to execute replicated ver­
sions of tasks and employing majority voting to determine the output [6]. For processors with 
multiple pipelined functional units, like the Cray-1, RESO has been applied to the idle function 
units and was shown to equip the scalar unit with error checking capability at the cost of minor 
performance degradation [7]. Another recently proposed technique, called Available-Resource 
Control-flow monitoring (ARC) [8], is aimed at the resource parallelism of instruction-level
3parallel processors such as superscalar and Very Long Instruction Word (VLIW) processors. The 
idle resources in these processors were utilized to detect the control-flow errors.
In the area of systolic architectures, one approach has been developed to take advantage of 
the existing bypassing links in a reconfigurable array to pass the same input data to two adjacent 
PEs and then compare the outputs to do the error detection [9]. A control bit, called test token, 
was inserted periodically from outside and passed along the array to determine when a particular 
PE should invoke a duplicated operation on its neighbor. Related results using error checking 
code to achieve algorithm-based fault tolerance for a systolic sorter have also been developed 
[10].
The incorporation of CED capability in systolic arrays for band matrix multiplication has 
been developed with design parameters such as throughput latency, per-cycle PE utilization rate 
and I/O bandwidth [11]. The arrays were required to have a per-cycle PE utilization rate less 
than 50% in order to leave room for the RESO-based CED technique. Flexible designs were pro­
posed [12] which allow the user to either employ the full throughput rate capability of the system 
or trade off the throughput rate for greater reliability.
In the initial description of the general concept of periodic application of CED (PACED) 
[3] by Chen et al., error pattern analysis was performed only for a specific set of PACED param­
eters and the actual implementation was not discussed. The major contribution of this current pa­
per is that we start from a general formulation by defining a set of PACED parameters which are 
optimized to achieve the maximum enror coverage and reduce the hardware cost.
The PACED implementation considered utilizes the following properties:
(1) Each PE is capable of performing time-redundant computation checking for itself as well as 
input code checking for the possibly erroneous output data produced and propagated by 
previous PEs.
4(2) A single fault is present between the time of the initial fault occurrence and error detection.
The processor arrays considered in this study are unidirectional linear processor arrays con­
sisting of Q processing elements with inputs entering from the top and left [1,13,14]. A PACED 
array driven by the original clock and equipped with the capability of concurrent error detection 
and automatic error recovery is shown in Fig. 1. The control logic consists of circuitry to per­
form buffering, diagnosis, rollback and start-up control.
The outline of the paper is as follows: Section 2 establishes the system parameters; Section 
3 gives the optimization of system parameters with respect to various metrics; Section 4 pro­
poses the required design changes for data buffering, error diagnosis and recovery; Section 5 
concludes the paper.
2. SYSTEM PARAMETERS
For two PEs in our processor array, PE* is upstream of PEj and PEj is downstream from 
PEi if i < j. PEs may not be identical, however each has approximately the same processing time
recovery
Figure 1. Block diagram of the PACED array with buffering and recovery caches.
5so that, without PACED, the array forms a balanced pipeline with clock cycle time equal to a 
time units. When CED is applied, each PE needs another b time units to perform error checking. 
For the purpose of preserving the synchronous nature of the original processor array, b is 
rounded off to multiples of a, b = ka. Therefore, for example, k = 1 corresponds to 100% time 
overhead. The entire activity applied to a certain set of data at each PE is called a computation 
cycle with or without checking, as opposed to the physical clock cycle which always take a time 
units. At the beginning of a clock cycle, each PE reads from its local counter or a global counter 
the checking bit to determine whether it should perform the checking (1) or not (0). Checking 
patterns are the plots of checking bits as a function of computation cycle number as shown in 
Fig. 2.
Figure 2. Checking pattern as a function of computation cycle number.
6The basic idea of PACED is to schedule the checking patterns with the same checking fre­
quency among PEs. Therefore, while some PEs are executing computation cycles with checking, 
some are not. The Corresponding Computation Cycles (CCCs) are defined to be all those cycles 
on different PEs which were originally executed at the same time in the array without CED. A 
task is defined to consist of all the activities applied to each input data by the processor array to 
obtain the corresponding output. The task path consists of all those cycles on different PEs at 
which a certain task is processed as it travels across the array. Each set of checking patterns is 
characterized by the following four parameters all in terms of computation cycles :
(1) M : length of one period;
(2) N : length of one checking burst;
(3) Om : offset between checking patterns for adjacent PEs;
(4) Oj : initial offset (with respect to computation cycle 0) of the checking pattern for the first 
PE. The numbering of the computation cycles is shown in Fig. 2.
While the checking pattern plot in terms of computation cycle as in Fig. 2 is used to illus­
trate the idea of PACED, it is more convenient to use the Task/PE diagram shown in Fig. 3 for 
our analysis. In such a diagram, the checking patterns are adjusted so that each column 
corresponds to a single task path. The offset between adjacent patterns, J, becomes Om plus one. 
The Task/PE diagram will be used to analyze problems related to computation cycle such as Er­
ror Detection Latency (EDL) analysis and diagnosis.
It can be shown that the choice of Oi does not affect the analysis. Moreover, we will con- 
Nsider N and —  as two of the parameters instead of M and N. Therefore, the three parameters in-
Nvolved in the the optimization problem will be N, —  and Om •
7J = o M + 1 
~H K-
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Tasknumber
CCCs Task path
Figure 3. Task/PE diagram of the same array as in Fig. 2 with shaded squares 
representing computation cycles with checking.
3. OPTIMIZATION OF SYSTEM PARAMETERS
3.1. Performance Analysis
NIt is intuitive that —  is a measure of the checking frequency and, therefore, a measure of 
M
time overhead. However, because of the imbalance introduced by PACED, the wait time 
between the output of the upstream PE and the input of the downstream PE of each adjacent pair 
constitutes another time overhead in addition to the checking time. The total execution time of a 
certain task is then given as
Total execution time = computation time + checking time + wait time 
Fig. 4(a) shows the task waiting pattern with each arrow starting from the time the 
upstream PE outputs the data and pointing to the time the downstream PE reads in the data. The 
wait time after computation cycle j, Wait(j), in terms of number of clock cycles has the follow­
ing dependence on Om as well as on N and M and is shown in Fig. 4(b).
8/
Om x k
Wait(j) = < ( N - j - l ) x k  
0
(Om — M + j  + l ) x k
O^j  < N - O m - 1  
N - 0 M < j < N - l  
N ^ j < M - 0 M- l  
M - O m <j  £M -1
Therefore, fixing —  does not necessarily fix the time overhead. However, the most impor- 
M
tant performance measure for a processor array is the throughput instead of the total execution 
time of each single task. As shown in Fig. 4(a), although the data has to wait for the downstream
Om tasks (N-Om) tasks Om tasks
Clock cycles
Wait time
(a)
M-Om ...........M -l 0 1 2 .................................. N-Om — N - l N  N+1-—
computation cycle number
(b)
Figure 4. (a) Task waiting pattern between adjacent PEs (b) wait time as a function of 
computation cycle number.
9PE to become free, the PEs are always kept busy. Therefore, the processor array will produce M 
outputs for every (M+Nxk) clock cycles and the throughput can be calculated as
which is independent of Om-
For N = 0, the PACED array reduces to the original array without CED which has the 
highest throughput — but no error detection capability. For N = M, PACED reduces to continu-
ous checking which has the lowest throughput 1(1 + k)a
but can detect errors without latency.
The problem of optimal scheduling for PACED is then formulated as: given a throughput re­
quirement ---------------, how to choose Om to minimize the error detection latency and maxim-
M
ize the error coverage.
3.2. Potentially Infinite Error Detection Latency
There are two cases in which transient faults occurring at certain cycles will have infinite 
error detection latency (EDL), which means that the errors will escape with 100% probability. 
Case 1 : Improper choice of the values for M, N and Om-
According to the task/PE diagram in Fig. 3, as a task travels through the array, it can be 
viewed as advancing in the checking pattern with a speed of J computation cycles per PE where 
J = Om + 1- If we can make sure that at least one cycle with checking, i.e., 0 < j < N—1, is on the 
path of each task, we can prevent the infinite EDL from occurring under the assumption that er­
rors can not be masked during the propagation. (Error masking is considered in Section 3.5.) The 
following Lemma 1 for proving Fermat’s Little Theorem [15] is used to prove Theorem 1.
10
LEMMA 1. The M numbers 0 mod M, J mod M, 2J mod M ,..., (M -l)J mod M con­
sist of precisely d copies of the M/d numbers 0, d, 2 d ,..., (M/d - l)d where d = gcd(M J) 
(gcd stands for greatest common divisor.)
LEMMA 2. With the same notation in Lemma 1, define the remainder set Ry = 
{V+nd|0 £ n < M/d-1 } for 0 £ V < d-1, then
(1) (j+ixj) mod M e Rj modd for 0 ^  j < M -l and all non-negative integer i.
(2) {(j+ixJ) m o d M | 0 < M- l  } = Rjmodd> where d = gcd(M, J). More precisely, (j + i x j )  
mod M , 0 ^ i ^ M -1 contain d copies of each of the elements in Rj mod d.
Proof. See Appendix.
THEOREM 1. Except for some faults occurring in the last (M-l) PEs of the array, all 
transient faults have finite EDL (less than M) if and only if gcd(M J) = d < N.
Proof. By the task/PE diagram in Fig 3, a task entering PEO at cycle j will be processed by 
PEi at cycle ( j + i x j )  mod M. By Lemma 2(1), ( j + i x j )  mod M € Rjmodd- If N < d, for any 
j such that N < j < d, every integer in Rj mod d is greater than N - 1, hence for all non-negative in­
tegers i, (j + i x J ) mod M > N - 1, which means a task enters PE 0 at such cycle j will never be 
checked, resulting in infinite EDL if it is affected by some transient fault.
Conversely, if d £ N, (j mod d) £ N - 1. By Lemma 2(2), we have { ( j + i x j )  mod M10 < i 
' < M - 1 } = Rj mod d» which means that an erroneous task produced at any cycle j will be checked 
at least once at cycle (j mod d) by the faulty PE itself or one of its (M-l) immediate downstream 
PEs. Therefore, as long as the faulty PE is not one of the last (M-l) PEs of the array, the error 
will be detected. □
Case 2 : Faults occurring near the end of the array.
11
As long as the CED is not applied continuously, EDL must exist. When a transient fault oc­
curs at some PE in a computation cycle with EDL larger than the number of the downstream PEs 
from it, the error will escape. This phenomenon exists no matter how we schedule the checking 
patterns (only the severity varies). One possible solution is to add a code checker with lower 
complexity and higher reliability at the end of the array to perform continuous code checking in 
order to intercept the escaping errors if desired.
3.3. Uniform Distribution of Checking Capability
NWhen the checking frequency is set to — , it is just an average over all tasks and does not
M
Nnecessarily guarantee that each task will be processed with checking —  of the time along its
path. Since all tasks have equal significance, it is desirable to schedule the checking patterns 
such that each task is treated as uniformly as possible. Theorem 2 gives the condition for this 
purpose.
THEOREM 2. Except for the difference due to the fact that array length Q might not 
be a multiple of M, the checking capability is uniformly distributed among all tasks if 
gcd(MJ) = d divides N.
Proof. If N = m d, where m is a positive integer, for every Ry, 0 < V < d - l,the first m ele­
ments V + nd, 0 < n £ m - l,are  less than N and others are not. By Lemma 2, (j + i x J) mod M, 
0 < i < M -1  contain d copies of each of the elements in Rj mod d, which implies that a task enter­
ing PEq at any cycle j will be checked m x d times among M computation cycles. Therefore, all
mxd Ntasks are fairly treated and checked with the same frequency------ = — . □
M M
12
3.4. Minimization of Maximum Error Detection Latency
The EDL(j) of both permanent and transient faults occurring at computation cycle j with 
checking (0 < j < N -l) are defined to be zero in terms of number of computation cycles because 
they are detected immediately. The EDL(j) of other non-checking cycles under the constraint 
that 1 ^ J < N are given by Lemma 3 with superscript "C" indicating the error is detected as a 
computation error and "I" stands for input error. We can prove the optimal solution under this 
constraint actually achieves the optimality for general values of J. For the rest of this paper, we 
will assume the duration of a transient fault is much less than the clock cycle time so that it can 
only affect the result of one computation1.
LEMMA 3. For 1 £  J <N  andN < j £M -1,
(1) under transient faults :
EDLr(j) = M - j
J EDLc (j) = ~ ,
(2) under permanent faults :
EDL‘(j) = M - jJ
; EDLc (j) = M - j .
Proof. (1) Because we can view a task traveling across the processor array as advancing on 
the checking pattern with J cycles per step, we have the following formula for the EDL of a tran­
sient fault for general J:
EDLx(j) =
r M - j
J
where r = j+ixJ
M
10 < (j+ixJ) mod M < N -l, i > 0 k
To our knowledge, there is no closed-form solution for this general problem. The constraint 
1 < J ^ N will make sure that an error caused by a fault occurring at a non-checking cycle be
^ o s t  o f the results w ill still be valid without this assumption except for more complicated error pattern analysis described
later.
13
captured by the next checking burst (r = 1) of some downstream PE. Therefore,
Under the assumption about the length of the transient faults, a PE can not detect such faults oc­
curring at non-checking cycles, so the EDLc (j) is infinity.
(2) A permanent fault can be considered as consisting of a large number of transient faults 
occurring at consecutive computation cycles. Using the above formula, we have, for permanent 
fault,
EDLx(j) = min <
0£q<M -j
M-q+q)
J
By manipulating the expression inside the bracket,
hence
M-q+q) M—j+(J—l)q > M-j
J
+ q -
J J
for J> 1;
EDLxq) = M-j
J
For a permanent fault starting at a non-checking cycle, the faulty PE will detect it as a computa­
tion error as soon as it enters next checking burst. Therefore, EDLC q) = M - j .  □
LEMMA 4. For 1 ^ J £ N and N ^ j ^  M—1, if we define EDLq) to be the latency until 
the first error indication (computation or input error), then for both permanent and tran­
sient faults : EDLq) = M -j  
J *
Proof. This follows immediately from Lemma 3, because M-j
J
< oo and M-j
J
<M -j,
where equality holds for J = 1 or M - j = 1. Hence,
14
EDL(j) = mmjEDL^EDL0 ®  \ = EDLx(j) = M-jJ . □
Next we give some definitions for proving the optimal scheduling.
DEFINITION 1. Given a sequence of n numbers S = (si, S2, ... ,Sn), II(S) is defined to 
be a nondecreasing permutation of S, i.e., II(S) = (Jti(S), JC2(S)>... > ^ (S )) is a permutation 
of S where TCi (S) £ ^ (S ) £ • • • £  Hn(S).
DEFINITION 2. Given two sequences of n numbers S and T, we define S<T if Si<tj for 
1 < i < n.
LEMMA 5. Define Ej = (eu, e2j , ..., e(M-N)J ) = (EDLj(M—1), EDLj(M—2 ),..., EDLj(N)) 
for (KJ<M-1, then
(1) n(Ej) = Ej f o r l £ J £ N
(2) Ej > Ej+i for 1 £ J £ N -l
(3) II(En) ^  n(Ej) for all integer J.
Proof. (1) By Lemma 4,
hence
e j  = EDLj(M-i) = M — (M—i) i
J J
for 1 <1 J ^  N and 1 < i < M-N ; (1)
ejj £ ejj for 1 < i £ j ^ M-N 
n(Ej) = Ej .
(2) By Eq. (1) we also have e j  > ei(j+i) for 1 < i < M-N; hence
Ej >Ej4.i for 1 ^ J £ N -l
(3) By (1) and (2), we have proved II(En) ^ n(Ej) for 1 < J < N. For general values of J, the 
proof is by contradiction. Suppose there exists J such that II(EN) < IIÇEj) is not true. This 
means there exists i, 1 < i < M-N such that 7ti(EN)>7ti(Ej) and
15
7Tì (EN)—1 > 7Cì (Ej ) > • • •  >7Ci (E j ) > 1 . (2)
Because TCì(En ) = e ^  = i and — +1 > i
N N N
by definition, we have —- + 1 > 7ti(EN) and
N
•>N.
Wi(ENH ' "  (3)
Hence, by Eqs. (2) and (3), there must exist m, 1 < m ^ 7q(EN)-l such that more than N ele­
ments of {tti (Ej), ♦ • • ,7q(Ej) } are equal to m. However, for a checking burst of length N, the 
maximum number of non-checking cycles with the same EDL is N for any of the EDL values. 
Therefore, we have reached a contradiction and n(EN) < Il(Ej) for all J. □
THEOREM 3. The maximum EDL, EDLf3*, is minimized by setting J  = N and the 
M-Nminimum value is
N
Proof. By (3) of Lemma 5, EDLf“  = 7tM_N(Ej) > 7CM_N(EN) = EDLn3* for all integer J. 
Hence J = N minimizes the maximum EDL and
EDLif* = max
N£j<M
M -j' M-N
N N . □
3.5. Minimization of Error Escape Probability
The price paid by PACED to maintain the desired throughput is lower error coverage. The 
longer the error detection latency, the larger the possibility that an error will be masked during 
the propagation and escapes. Assume the probability that an error will be masked at any compu­
tation cycle is pm. The error escape probability, Pesc(j), for a transient fault occurring at compu­
tation cycle j is then given by Pesc(j) = 1 -  (1 -  pm)EDLj(j), 0 < j < M -l.
THEOREM 4. The average error escape probability is minimized by setting J  = N.
16
Proof. Assume a transient fault occurs at each cycle j, 0 < j <M-1, with equal probability. 
The average error escape probability is
Hence, setting J = N minimizes the average error escape probability. □
NBy Theorems 3 and 4, we conclude that for a given — , J should be set equal to the length
M
of the checking burst N, i.e., the pattern offset between adjacent PEs should be Om = N - 1, in 
order to minimize the maximum EDL and maximize the error coverage. This optimality is in­
dependent of the choice of N.
3.6. Summary of Optimization Results
NGiven a fixed checking frequency — , we choose M and N to be relatively prime in order
to minimize both M and N. Minimizing M allows Theorem 2 to more accurately state the condi­
tion for uniform distribution of checking capability. We will show in the next section that 
minimizing N can minimize the hardware overhead for data buffering. Since J should be equal 
to N for optimal error detection, we have gcd(M,J) = gcd(M,N) = 1 which satisfies both condi-
because EDL(j) = 0 for 0 < j < N—1. By (3) of Lemma 5,
17
tions in Theorems 1 and 2. Therefore, the possibility of infinite error detection latency is elim­
inated and the checking capability is uniformly distributed among all tasks.
4. DESIGN CHANGES
4.1. Synchronous Buffering Design
By scheduling checking patterns among PEs, resource (PE) conflict may occur when a PE 
is still checking old data but new data has been produced by its upstream PEs. It was shown in 
Fig. 4(b) that the maximum wait time is equal to Oj^xk clock cycles. Hence, it is adequate to in­
sert Oj^xk buffers between each adjacent PE pair, driven by the original clock2. Since Om 
should be equal to N-l for optimal scheduling, the number of buffers will decrease as N de­
creases. Therefore, choosing M and N to be relatively prime also minimizes the hardware over­
head for data buffering.
However as also shown in Section 3.1, the wait time, which determines the number of 
needed buffers, is not a constant but a function of computation cycle number. It becomes clear at 
this point that some kind of dynamic buffering technique has to be used to make the pipeline 
flow smoothly and correctly. We propose two such approaches, namely, data forwarding and 
dynamic reconfiguration.
The data forwarding approach to buffering is described as follows. When a PE is ready to 
output the processed data, the wait time logic shown in Fig. 5, which monitors the checking bit 
sequence, has determined the wait time, Wait(j), of current cycle and connected the PE output to 
the buffer which is Wait(j) stages away from the downstream PE. Once the data is placed into
can be shown that the minimum number o f required buffers is equal to 
control circuits are needed to reuse the buffers.
Oj^xk
k+1
. However, more complicated
i
18
Figure 5. Buffering by data forwarding
the appropriate buffer, the synchronous buffering design will ensure the data arrives at the down­
stream PE at the correct clock cycle.
As an alternative, since the number of buffers needed varies with time, we can treat the ex­
tra buffers at each clock cycle as being "faulty" and use the Diogenes approach [16] to dynami­
cally reconfigure the "buffer arrays" by bypassing the "faulty" ones. A shifter clocked by the fal­
ling edge of the clock (assume the PEs are clocked by the rising edge) is used to set up the 
proper configuration of the buffers for the next data movement The basic rule is :
checking bit of PEi
Figure 6. Dynamic reconfiguration circuit for buffering using Diogenes approach
19
(1) Include one more buffer if the downstream PE is in a computation cycle with checking 
while the upstream one is not;
(2) Bypass one more buffer if the upstream PE is in a computation cycle with checking while 
the downstream one is not;
(3) Maintain the current configuration (by disabling the clock input of the shifter) if the two 
PEs are both checking or non-checking.
Because of the regular pattern of wait time variation (Fig. 4(b)), the reconfiguration circuit 
is very simple, as shown in Fig. 6. An example showing the correct buffering at each clock cycle 
by using the reconfiguration approach is given in Fig. 7.
4.2. Diagnosis
Because permanent and transient faults occurring at different computation cycles will result 
in different combinations of computation and input error indications after various length of la­
tency, it is important for diagnosis to analyze all possible error patterns and classify the faults 
into several categories according to their resultant error patterns.
Since the error escape probability is related to EDL in terms of computation cycles, and in 
order to make the diagnosis procedure independent of the checking overhead k, we will use the 
task/PE diagram (Fig. 3) and the error indications from CCCs for error pattern analysis. How­
ever, for a PACED array, the CCCs do not happen at the same clock cycle. It would be unac­
ceptable if we have to wait for all the CCCs to finish before the analysis because that will delay 
the diagnosis and rollback by a considerable number of clock cycles. The following proof gives 
the upper bound for the number of error indications by any set of CCCs for 2 < J < N. (The case 
when J = 1 will result in peculiar error patterns which can not share the same diagnosis and roll­
back procedures with other choices of J. Since J = 1 also results in large EDL, it will be excluded
20
PEI I I I  PE2 PE1 0 0 0 PE2
r 1 ____ r I T H I T 1
i i i+6 5 4 3
• 1 1 0 0 0 0
r IL n l _LL 1 1 _Ll
i+1 2 1 i+7 6 5 4 3
1 0 0 0 0 0
r I T _L_L n ____ 1 M J_L J_L ~l
i+2 3 2 i i+8 6 5 4
0 0 0 0 0 1
i+3 i+9
0 0 0 0 1 1
i+4 i+10
0 0 0 1 1 1
i+5 i+11
Figure 7. Example of dynamic reconfiguration for buffering. The parameters are N = 4, 
Om = 3 and k = 1. When Q  is 1, the corresponding "faulty" buffer is bypassed.
21
from future discussion.)
THEOREM 5. For 2 ^ J 2* N, at most two PEs will have the earliest error indications 
among all the CCCs for a single fault, and they must be adjacent.
Proof. Since a transient fault can only create one erroneous task, only one PE will detect it. 
For a permanent fault occurring at cycle j, N < j < M, the faulty PE itself will detect a computa­
tion error after M-j cycles and the erroneous task produced at cycle j + q, 0 < q < M-j will be
detected M-G+q)
J
+q cycles after the fault occurrence as an input error by M-G+q)
J
th
downstream PE from the faulty PE. For J > 2 and q > 2, we have
M-G+q) M-j+(J-l)q > 'M -j 1 . , ,’ M-j
J +q - J J + 2 X2 J
+1
which is larger than the EDLG) = [ — . Hence, the only erroneous tasks which will possibly
be detected as the earliest input error indications are the ones produced at cycle j and j+1.
If M-j = 1, the two earliest error indications with EDL(M-l) = 1 are the computation error 
detected by the faulty PE and the input error detected by the immediate downstream PE. If
M-j > 2, (M-j) > M-j
J
for J > 2, the two possible earliest error indications are both input er-
= 0 or 1 for J > 2).rors detected by two adjacent PEs executing at CCCs (
□
' m ± M -G+l)
J J
Therefore, in order to design a diagnosis procedure, we will always keep the pipeline flow­
ing until the immediate downstream PE finishes the CCC once the first error indication is raised 
by some PE (called the detective PE). The number of clock cycles that the detective PE has to 
wait is equal to the wait time of the current cycle because when the downstream PE is ready to
22
process the data corresponding to the task resulting in the first error indication, it must have 
finished processing the previous task at the CCC and setup the error flags.
The next step is to classify all the faults according to their resultant error patterns. The no­
tation is defined as: Class a.b where a = 1 means transient, a = 2 means permanent and b is the 
further classification within each category. The corresponding error patterns are shown in Fig. 8. 
"P" stands for permanent fault, "T" for transient fault and "F" can be either "P" or "T". "I" indi­
cates an input error and ”C" represents a computation error.
(1) Class 1.1 and 2.1: For both permanent and transient faults, Fig. 8(a) represents the case 
where a computation error indication occurs at computation cycle j, 1 < j < N -l. The fault
Figure 8. Error pattern analysis. (The thick line passes through all CCCs which are 
related to the detection of the present fault.)
23
must have just occurred at the detective PE; otherwise, it should have been detected at ear­
lier cycle with checking. In order to distinguish between permanent and transient faults, the 
faulty PE is given a second chance to do the recomputation. If the recomputation still sets 
an error flag, the fault is permanent under the assumption that a transient fault never affects 
more than one computation; otherwise, it is transient.
A more complicated situation occurs when the computation error flag is raised at cycle 0 
because it is possible that the fault is a permanent one which occurred during previous 
non-checking cycles. However, the proof of Lemma 4 shows that the input error will be 
detected no later than the computation error for such faults and the two kinds of error will 
be in the CCCs if and only if such faults occurred at cycle M-l. Therefore, if there is no in­
put error indication in the immediate downstream PE, as in Fig. 8(b), the fault must have 
just occurred and the faulty PE is given a second chance.
(2) Class 1.2: A transient fault occurring at cycle j with N < j ^ M -l will be detected as an in­
put error by the downstream PE which is EDL(j) stages away from the error source PE (Fig. 
8(c)).
(3) Class 2.2: A computation error detected at cycle 0 and an input error detected by the im­
mediate downstream PE at CCC indicates the fault is permanent and occurred at cycle (M- 
1) in the upstream PE (Fig. 8(d)).
(4) Class 2.3: A permanent fault at cycle j with EDL(j) = EDL(j+l), N < j <M -2 will be 
detected by a single downstream PE as an input error (Fig. 8(e)).
(5) Class 2.4: A permanent fault at cycle j with EDL(j) = EDL(j+l) + 1, N < j < M-2 will be 
detected by two downstream PEs as input errors (Fig. 8(f)).
24
Among the classes of faults defined in the previous paragraph, Class 1.1, 2.1 and 2.2 are 
successfully identified by the error pattern analysis and Class 1.2, 2.3 and 2.4 need further diag­
nosis. The basic idea is to design a recovery cache for each PE for storing recent input data. 
When an input enror is detected by some PE, each upstream PE suspected of producing the error 
reads from its recovery cache the input corresponding to the erroneous task and uses it as test in­
put to perform recomputation with checking. The computation and input error flags resulting 
from these recomputations are used as syndromes and will uniquely identify the faulty PE and 
cycle under the assumption of a single fault. Therefore, the diagnosis takes only one computation 
cycle with checking. Again, for the regularity and simplicity of the diagnosis, the following rules 
are adopted:
(1) Although it is possible to calculate the exact number of suspects which is less than or equal 
to the maximum EDL, for each input error detected we will always use maximum EDL as 
the number of suspects. Because the diagnosis procedure for each PE is done in parallel, 
this does not increase the time overhead and allows regular hardware connection.
(2) For the faults in Class 2.4, we will ignore the first input error and only use the second error 
indication for further diagnosis because the second one corresponds to the erroneous task 
produced earlier.
The success of the above simple diagnosis procedure depends on the capability of each PE 
to retrieve the correct data from the recovery cache. Because the CCCs are skewed in a PACED 
array, it is very difficult to determine in which location of the cache the required data resides. 
The design of the direct-mapped recovery cache is aimed at simplifying the searching procedure. 
Similar to the direct-mapped cache design in the memory hierarchy for general-purpose comput­
ers, where each position of the cache can only hold data from certain addresses with identical 
least significant bits, each position i of our direct-mapped recovery cache can only hold data for
25
those tasks with id number n such that n mod (cache size) = i. Consequently, as long as we have 
a recovery cache of sufficient size, i.e. larger than the maximum EDL, so that each data will not 
have been overwritten by the data from later tasks when it is needed for diagnosis, every 
suspected PE only has to read the test input from the same position as that in the detective PE 
and the hardware connection is simplified.
Start-up control is a mechanism to setup the cache correctly once and for all when the pipe­
line starts flowing, so that whenever new data has to be placed into the cache, it is put into the 
next position or, when reaching the end, the first position. Fig. 9 shows how the start-up control 
works. The start-up delay for each PE is computed by accumulating the wait times. Each PE can 
only start reading in the data after the start-up delay. Therefore, the first data each PE places in 
the recovery cache will be for task 0 and later data can simply follow on top. The start-up control 
is also utilized for rollback which is discussed next.
PE
PE3 
'4PE
0 1 2 3 4 5 5 6 6 7 8
0 1 2 3 3 4 4 5 6 7
1 K —
SDÌ 0 1 1 2 2 3 4 5 6
1 SD2 h*—
0 0 1 2 3 4 5
h — SD3 —H
_ r 0 1 2 3 1 4
h - SD4 —H
Clock cycles
Figure 9. Start-up control (SD: start-up delay)
26
4.3. Rollback
Once the faulty PE and faulty cycle are identified, if the fault is permanent, a spare PE is 
brought in to replace the faulty one, and then rollback recovery starts after the reconfiguration; if 
the fault is transient, rollback directly follows diagnosis, or actually, overlaps with it because the 
recomputation in diagnosis can be used as the first step in rollback.
The rollback procedure can be divided into two steps: flushing and local recovery. Similar 
to the simplification in diagnosis, since the rollback is done in parallel for each PE, we will use 
the same procedure for the recovery of both permanent and transient faults even if the latter 
results in fewer number of erroneous data.
(1) Flushing: First, we define the erroneous block to consist of all the following data : (1) data 
inside the PEs and buffers between error source PE and detective PE; (2) data inside the 
recovery cache between these two PEs, from the position containing the data corresponding 
to the erroneous task up to the most recent position. The region enclosed by dash lines in 
Fig. 10(a) shows the erroneous block for the case where a transient fault occurred in PEI 
when task number 7 was being processed and is detected by PE4 as an input error. The first 
step of rollback is to flush all the data in the erroneous block as shown in Fig. 10.(b).
(2) Local recovery: Since the fault only affects the PEs between the error source PE and the 
detective PE inclusive (called the local recovery set), the portion of the pipeline containing 
all the other PEs are frozen during the rollback. The local recovery line is defined to consist 
of all the PEs in the local recovery set with the erroneous task number. The local recovery 
scheme is to apply the start-up control to the local recovery line by viewing the local 
recovery set as a short pipeline, the erroneous task as the first task and the data in the 
recovery cache of the error source PE, which is correct and thus not flushed, as the input 
data. The erroneous block is rebuilt as shown in Fig. 10(c)-(h) after which all PEs proceed
27
PE
PE-
PE;
PE
PE'
0 7 7 8 8 l 9 1011 12 13 
5 6 6 r X 8 ~ 9 l 0 ' î lT îT . 
~4~4[_5 
“ 2|_3
I 1 2 3 4 5
6' 7 8 9 IOIOi ^ ^ ____ I
4 5 6[ 7 8 8 9
PE< 0 1 2 3 4 4 5
6 6 l 7_(
ÏL 6
(a) clock cycle i
7 7 8 8 9 101112  13 7 7 8 8 9 1011 12 13
5 6 6! 7 8 i 5 6 6! 7 8 9 î o î n
4 4 5 ' 6 7 ii 4 4 5 ë— 7L_— 8 9 10 ;
~3l 3 4 5
-  1
6«t
ii 3 4 5 6« ——u 7 8 8 ;
i_l 2 3 4 5 6 < L J 1 1 2 3 4 5 6 T H :
0 1 2 3 4 4 5 5|_6 0 1 2 3 4 4 5 5l_6
(d) clock cycle i+3 (g) clock cycle i+6
PEq 7 7 8 8 9 1011 1213
1
PE'
PE:
PE,
PE, 5 6
4 4 
~2]_3
I 1
4 5
PE« 0 1 2 3
T
6!
2 3 4 5
4 4 5 5[_6_
(b) clock cycle i+1
7 7 8 8 9 1011 12 13
5 6 6][7~8
4 4
~ 2 l 3 4  5 
[ 1  2 3 4 5
5 & 7 8 -----U--1---
V
0 1 2  3 4 4 5 5|_6
(e) clock cycle i+4
7 7 8 8 9 1011 1213
5 6 6 I 7 8 9 l O l f S
4 4 5 L6 7
__ -
8 9 1010!
~2|. 3 4 5 6 8 8 9!
L l 2 3 4 5 6 ¿ L i .  zìi
0 1 2 3 4 4 5 5|_6
(h) clock cycle i+7
PE
PE-
PE'
PE:
PE,
PE,
0
0 1 2 3| 4 4 5 5[ 6 0 1 2 3| 4 4 5 5| 6 0 1 2  3 | 4 4 5  5 [ 6 7
(c) clock cycle i+2 (f) clock cycle i+5 (i) clock cycle i+8
Figure 10. Local recovery procedure (a) fault occurrence and error detection; (b) data 
flushing; (c)-(h) local recovery using start-up control; (i) resumption of normal 
processing.
as before (Fig. 10(i)).
4.4. Summary of Design Changes
A PACED array equipped with the proposed design changes was shown in Fig. 1. Error 
checking circuits are built into each PE for time-redundant computation checking and input code 
checking. A code checker is added at the end of the array as discussed in Section 3.2. With op-
28
rimal scheduling of checking patterns and 100% overhead time-redundant checking (k=l), N-l
buffers are inserted between each adjacent PE pair and a recovery cache of size 2x
M-N
N
is at-
tached to each PE for storing incoming data from the top and the left. The control logic is 
responsible for correct data buffering, start-up control, error diagnosis and local recovery. The 
techniques described in this paper have been simulated on an Alliant multiprocessors with eight 
processors to show their correct operations.
5. CONCLUSIONS
It was shown that, for a PACED array with the period of checking pattern equal to M com­
putation cycles, the length of checking burst equal to N computation cycles and fixed throughput
N(determined by the checking frequency — ), the optimal scheduling in terms of minimizing the
M
maximum error detection latency and error escape probability is achieved by setting the check­
ing pattern offset Om to N - 1. Also, by choosing M and N to be relatively prime, the hardware 
overhead for data buffering is minimized and the checking capability is uniformly distributed 
among the tasks.
Dynamic buffering techniques to preserve the systolic nature and the implementation for 
rollback recovery under faults were presented. It was shown that the complexity in the diagnosis 
and recovery process, resulting from the error latency as a trade-off for performance, can be re­
duced through the use of direct-mapped recovery cache and start-up control. The design flexibil­
ity can be further improved by using a programmable control unit.
29
APPENDIX
Proof of Lemma 2.
(j + i x J) mod M = (j + (i x J) mod M) mod M
= (j + (i mod M x J) mod M) mod M
€i (j + nd) mod M |0 < n ^ M/d-1 ► by Lemma 1.
Also, {(j+ixJ) mod M10<i<M-l } = {(j+nd) mod M10<n<M/d-l }, and (j+ixj) mod M,
0 < i < M -l contain d copies of each element in the set on the right hand side.
For 0 < n < M/d -  [j/dj — 1,
=>0 ^ j  + nd^M  + j m o d d - d < M
=»(j+nd) mod M = j + nd = j mod d + (n + [j/dj )d = j mod d + md , [j/dj < m < M/d-1.
For M/d -  [j/dj £ n < M / d - l ,
=»M+j  m o d d ^ j + nd^M  + j -  d <2 M
=>(]' + nd) modM=j  + n d - M  = j modd + (n + |j/dj-M/d)d = j modd + md, 0 < m < [j/dj -  1. 
Therefore,
|(j+nd) mod M10 < n < M/d-1 (j mod d + md) |0 < m < M/d-1
Finally, we have (j + i x j )  modM e Rj mod d and {(j+ixj) mod M10<i<M-l } = Rj mod d- □
REFERENCES
[1] W. Moore, A. McCabe and R. Urquhart, (eds.) Systolic Arrays, Adam Hilger, 1987.
[2] J. A. Abraham, P. Baneijee, C.-Y. Chen, W. K. Fuchs, S. Y. Kuo, A.L. N. Reddy, "Fault 
tolerance techniques for systolic arrays," IEEE Computer, vol. 20, no. 7, July 1987, pp. 65- 
75.
30
[3] P. P. Chen, A. N. Mourad and W. K. Fuchs, "Confidence in processor array outputs under 
periodic application of concurrent error detection," IEEE Workshop on Defect and Fault 
Tolerance in VLSI Systems, Nov. 1990.
[4] W. T. Cheng and J. H. Patel, "Concurrent error detection in iterative logic arrays", Proc. 
14th IEEE International Conference on Fault-Tolerant Computing, 1984, pp. 286-291.
[5] D. A. Reynolds and G. Metze, "Fault detection capabilities of alternating logic", IEEE 
Trans. Computers, Vol. C-27, No. 12, pp. 1093-1098, Dec. 1978.
[6] J-C Fabre, Y. Deswarte, J-C Laprie and D. Powell, "Saturation: reduced idleness for im­
proved fault-tolerance", Proc. 18th IEEE Fault Tol. Comp. Symp., 1988, pp. 200-205.
[7] G. S. Sohi, M. Franklin and K. K. Saluja, "A study of time-redundant fault tolerance tech­
niques for high-performance pipelined computers", Proc. 19th IEEE Fault Tol. Comp. 
Symp., 1989, pp. 436-443.
[8] M. A. Schuette and J. P. Shen, "Exploiting instruction-level resource parallelism for tran­
sparent control-flow monitoring", Research Report No. CMUCAD-90-42, Dec. 1990, 
Camegie-Mellon University.
[9] Y. H. Choi, S. H. Han and M. Malek, "Fault diagnosis of reconfigurable systolic arrays", 
Proc. IEEE International Conference on Computer Design, Port Chester, NY, Oct. 1984, 
pp. 451-455.
[10] Y. H. Choi and M. Malek, "A fault-tolerant systolic sorter", IEEE Trans, on Computers, 
Vol. 37, No. 5, May 1988, pp. 621-624.
[11] S. W. Chan and C. L. Wey, "The design of concurrent error diagnosable systolic arrays for 
band matrix multiplications", IEEE Trans, on Computer-Aided Design, Vol. 7, No. 1, Jan. 
1988, pp. 21-37.
[12] R. J. Cosentino, "Concurrent error correction in systolic architectures", IEEE Trans, on 
Computer-Aided Design, Vol. 7, No. 1, Jan. 1988, pp. 117-125.
[13] H. T. Kung and M. S. Lam, "Wafer-scale integration and two-level pipelined implementa­
tions of systolic arrays", J. of Parallel and Distributed Computing, Vol. 1, 1984, pp. 32-63.
[14] S. Y. Kung, VLSI Array Processors, Prentice Hall, Englewood Cliffs, 1988.
[15] R. L. Graham, D. E. Knuth, O. Patashnik, "Concrete Mathematics", Addison-Wesley Pub­
lishing, New York, 1989.
[16] A. L. Rosenberg, "The Diogenes approach to testable fault-tolerant arrays of processors", 
IEEE Trans, on Computers, Vol. C-32, No. 10, Oct. 1983, pp. 902-910.
31
[17] E. S. Manolakos, "Transient fault recovery techniques for the VLSI processor arrays", 
PhD. Dissertation, University of Southern California, May 1989.
