Design and scheduling for periodic concurrent error detection and recovery in processor arrays by Chung, Pi-Yu et al.
•May 1992 UILU-ENG-92-2214
CRHC-92-08
Center for Reliable and High-Performance Computing
DESIGN AND SCHEDULING
FOR PERIODIC
CONCURRENT ERROR DETECTION
AND REC OVERY
IN PROCESSOR ARRAYS
Yi-Min Wang, Pi-Yu Chung, and W. Kent Fuchs
(NA£A-CR-19057I) D_SIGN AND SCHEDULING FOR
9FR[_IC CGNC!J_R_N[ ERKfi_ DETFCTION AND
_FCnVFRY !N PPOCESSOR ARRAYS (Illinois
Univ.) 33 p
G3/_t
NQZ-ZQ605
Unclas
0109244
Coordinated Science Laboratory
College of Engineering
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
Approved for Public Release. Distribution Unlimitat.
https://ntrs.nasa.gov/search.jsp?R=19920020452 2020-03-17T11:30:32+00:00Z

Design and Scheduling for Periodic Concurrent Error Detection
and Recovery in Processor Arrays
Yi-Min Wang, Pi-Yu Chung and W. Kent Fuchs
Coordinated Science Laboratory
University of Minois at Urbana-Champaign
Correspondent: Yi-Min Wang
Coordinated Science Laboratory
1101 W. Springfield Ave.
University of nlinois
Urbana, IL 61801
E-mail: ymwang@crhc.uiuc.edu
Phone: (217) 244-7161
FAX: (217) 244-5686
Abstract
Periodic application of time-redundant error checking provides the trade-off between error
detection latency and performance degradation. The goal is to achieve high error coverage while
satisfying performance requirements. In this paper, we derive the optimal scheduling of check-
ing patterns in order to uniformly distribute the available checking capability and maximize the
error coverage. Synchronous buffering designs using data forwarding and dynamic
recortfiguration are described. Efficient single-cycle diagnosis is implemented by error pattern
analysis and direct-mapped recovery cache. A rollback recovery scheme using start-up control
for local recovery is also presented.
Acknowledgement: This research was supported in part by the National Aeronautics and Space Administration
(NASA) under Grant NASA NAG 1-613, hi cooperation with the Illinois _:omputer Laboratory for Aerospace
Systems and Software (ICLASS), and in part by_ tile Joint Services Electronics Program (U.S. Army, U. S. Navy
and U. S. Air Force) under Contract N00014-90-J-1270.
21. INTRODUCTION
A variety of processor arrays have been proposed for signal and image processing and
scientific computation applications [1]. In order to detect errors produced by faults in these ar-
rays a variety of off-line testing procedures have been developed for detecting permanent faults
and concurrent error detection (CED) techniques have been developed for transient and intermit-
tent failures [2]. The focus of this paper is on processor arrays for systolic algorithms and the
use of time redundancy techniques for concurrent detection of errors [3,17].
Traditionally, CED is applied continuously to each computation activity so that an error
resulting from a fault in the processing element (PE) can be detected immediately. However,
when time redundancy techniques are used for error detection this continuous checking scheme
may greatly degrade the array performance, e.g., by a factor of two for RESO [4] or alternating
logic [5]. For some applications where high processing speed is crucial and error detection la-
tency is tolerable, it may be possible to maintain the desired throughput while keeping a reason-
ably high error coverage by mining the CED mechanism on and off periodically. Periodic Appli-
cation of CED (PACED) offers such a wade-off in error latency and probability of error detec-
tion versus performance degradation.
Several techniques regarding the utilization of idle processor cycles for CED have been
proposed [6-12]. For general-purpose machines with processor-level parallelism, a technique
called saturation has been introduced for utilizing the idle processors to execute replicated ver-
sions of tasks and employing majority voting to determine the output [6]. For processors with
multiple pipelined functional units, like the Cray-1, RESO has been applied to the idle function
units and was shown to equip the scalar unit with error checking capability at the cost of minor
performance degradation [7]. Another recently proposed technique, called Available-Resource
Control-flow monitoring (ARC) [8], is aimed at the resource parallelism of instruction-level
parallel processors such as superscalar and Very Long Instruction Word (VLIW) processors. The
idle resources in these processors were utilized to detect the control-flow errors.
In the area of systolic architectures, one approach has been developed to take advantage of
the existing bypassing links in a reconfigurable array to pass the same input data to two adjacent
PEs and then compare the outputs to do the error detection [9]. A control bit, called test token,
was inserted periodically from outside and passed along the array to determine when a particular
PE should invoke a duplicated operation on its neighbor. Related results using error checking
code to achieve algorithm-based fault tolerance for a systolic sorter have also been developed
[101.
The incorporation of CED capability in systolic arrays for band matrix multiplication has
been developed with design parameters such as throughput latency, per-cycle PE utilization rate
and I/O bandwidth [11]. The arrays were required to have a per-cycle PE utilization rate less
than 50% in order to leave room for the RESO-based CED technique. Flexible designs were pro-
posed [12] which allow the user to either employ the full throughput rate capability of the system
or trade off the throughput rate for greater reliability.
In the initial description of the general concept of periodic application of CED (PACED)
[3] by Chen et al., error pattern analysis was performed only for a specific set of PACED param-
eters and the actual implementation was not discussed. The major contribution of this current pa-
per is that we start from a general formulation by defining a set of PACED parameters which are
optimized to achieve the maximum error coverage and reduce the hardware cost.
The PACED implementation considered utilizes the following properties:
(1) Each PE is capable of performing time-redundant computation checking for itself as well as
input code checking for the possibly erroneous output data produced and propagated by
previous PEs.
(2) A single fault is present between the time of the initial fault occurrence and error detection.
The processor arrays considered in this study are unidirectional linear processor arrays con-
sisting of Q processing elements with inputs entering from the top and left [1,13,14]. A PACED
array driven by the original clock and equipped with the capability of concurrent error detection
and automatic error recovery is shown in Fig. 1. The control logic consists of circuitry to per-
form buffering, diagnosis, rollback and start-up control.
The outline of the paper is as follows: Section 2 establishes the system parameters; Section
3 gives the optimization of system parameters with respect to various metrics; Section 4 pro-
poses the required design changes for data buffering, error diagnosis and recovery; Section 5
concludes the paper.
2. SYSTEM PARAMETERS
For two PEs in our processor array, PEi is upstream of PEj and PEj is downstream from
PEi if i < j. PEs may not be identical, however each has approximately the same processing time
buffers
recovery
cache
code
control
logic
Figure I. Block diagram of the PACED arraywith buffetingand recovery caches.
so that,without PACED, the array forms a balanced pipelinewith clock cycle time equal to a
time units.When CED isapplied,each PE needs anotherb time unitsto perform errorchecking.
For the purpose of preserving the synchronous nature of the originalprocessor array,b is
rounded off to multiplesof a, b = ka. Therefore,for example, k = 1 corresponds to 100% time
overhead. The entireactivityapplied to a certainsetof data ateach PE iscalleda computation
cycle with or without checking,as opposed to the physicalclock cycle which always take a time
units.At the beginning of a clock cycle,each PE reads from itslocalcounter or a globalcountcr
the checking bitto determine whether itshould perform the checking (1) or not (0).Checking
patterns arc the plotsof checking bitsas a functionof computation cycle number as shown in
Fig.2.
PE0
ox
(2),(3)I(4)I(
PE 1
Pr_ ,
I I
M
r" _I
bt- N -_ computation
i)¿ (6)_ (2)t, (3)n,(4)1,(5)' (6_ (0)' (1)1 (_ _3_1-(4)_(5_t6 :_ _ OM --" withchecking,onlyC°mputati°nV
! ! _7 1 I I
%
Corresponding
Computation
Cycles (CCCs)
, I' I
V- q,
I I I I
Task path
I I [ !
I _ ] I
Computation cycles
Figure 2. Checking pattern as a function of computation cycle number.
k-
6
The basic idea of PACED is to schedule the checking patterns with the same checking fre-
quency among PEs. Therefore, while some PEs are executing computation cycles with checking,
some are not. The Corresponding Computation Cycles (CCCs) are defined to be all those cycles
on different PEs which were originally executed at the same time in the array without CED. A
task is defined to consist of all the activities applied to each input data by the processor array to
obtain the corresponding output. The task path consists of all those cycles on different PEs at
which a certain task is processed as it travels across the array. Each set of checking patterns is
characterized by the following four parameters all in terms of computation cycles :
(1) M :length of one period;
(2) N : length of one checking burst;
(3) OM : offset between checking patterns for adjacent PEs;
(4) OI : initial offset (with respect to computation cycle 0) of the checking pattern for the first
PE. The numbering of the computation cycles is shown in Fig. 2.
While the checking pattern plot in terms of computation cycle as in Fig. 2 is used to illus-
trate the idea of PACED, it is more convenient to use the Task/PE diagram shown in Fig. 3 for
our analysis. In such a diagram, the checking patterns are adjusted so that each column
corresponds to a single task path. The offset between adjacent patterns, J, becomes OM plus one.
The Task/PE diagram will be used to analyze problems related to computation cycle such as Er-
ror Detection Latency (EDL) analysis and diagnosis.
It can be shown that the choice of OI does not affect the analysis. Moreover, we will con-
N
sider N and -_- as two of the parameters instead of M and N. Therefore, the three parameters in-
N
volved in the the optimization problem will be N, -_- and O_.
pE0
PE 1
PE2
PE 4
J=OM + 1
O 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
CCCs Task path
Figure 3. Task/PE diagram of the same array as in Fig. 2 with
representing computation cycles with checking.
3. OPTIMIZATION OF SYSTEM PARAMETERS
16 Task number
shaded squares
3.1. Performance Analysis
N
It is intuitive that _ is a measure of the checking frequency and, therefore, a measure of
time overhead. However, because of the imbalance introduced by PACED, the wait time
between the output of the upstream PE and the input of the downstream PE of each adjacent pair
constitutes another time overhead in addition to the checking time. The total execution time of a
certain task is then given as
Total execution time = computation time + checking time + wait time
Fig. 4(a) shows the task waiting pattern with each arrow starting from the time the
upstream PE outputs the data and pointing to the time the downstream PE reads in the data. The
wait time after computation cycle j, Wait(j), in terms of number of clock cycles has the follow-
ing dependence on OM as well as on N and M and is shown in Fig. 4(b).
Wait(j) =
OMxk
(N-j- 1) xk
0
(OM-M+j+ 1)×k
0%j%N-OM--1
N-OM %j%N-1
N%j<M-OM-1
M-OM %j <M-1
N
Therefore, fixing -_- does not necessarily fax the time overhead. However, the most impor-
tant performance measure for a processor array is the throughput instead of the total execution
time of each single task. As shown in Fig. 4(a), although the data has to wait for the downstream
OM tasks _-OM) tasks OM tasks
-I
Clock cycles
Wait time
"t2kaka
0
............ M-OM ........... M-1 0
(a)
ooo,o,,.o.....o0 , _ p •
1 2 ................................ N-O_ .... N-1
computation cycle number
N N+t ....
Co)
Figure 4. (a) Task waiting pattern between adjacent PEs (b) wait time as a function of
computation cycle number.
9PEto becomefree, the PEs are always kept busy. Therefore, the processor array will produce M
outputs for every (M+Nxk) clock cycles and the throughput can be calculated as
throughput =
M 1
m
(M+Nxk)a (1+-_ xk)a
which is independent of OM. \
For N = 0, the PACED array reduces to the original array without CED which has the
highest throughput 1 but no error detection capability. For N = M, PACED reduces to continu-
a
ous checking which has the lowest throughput (l+k)a
but can detect errors without latency.
The problem of optimal scheduling for PACED is then formulated as: given a throughput re-
quirement
(l+Nxk)a
, how to choose OM to minimize the error detection latency and maxim-
ize the error coverage.
3.2. Potentially Infinite Error Detection Latency
There are two cases in which transient faults occurring at certain cycles will have infinite
error detection latency (EDL), which means that the errors will escape with 100% probability.
Case 1 : Improper choice of the values for M, N and OM.
According to the task/PE diagram in Fig. 3, as a task travels through the array, it can be
viewed as advancing in the checking pattern with a speed of J computation cycles per PE where
J -- OM + 1. If we can make sure that at least one cycle with checking, i.e., 0 <j < N-l, is on the
path of each task, we can prevent the infinite EDL from occurring under the assumption that er-
rors can not be masked during the propagation. (Error masking is considered in Section 3.5.) The
following Lemma 1 for proving Fermat's Little Theorem [15] is used to prove Theorem 1.
10
LEMMA 1. The M numbers 0 rood M, J rood M, 2J rood M, ..., (M-1)J rood M con-
sist of precisely d copies of the M/d numbers 0, d, 2d, ..., (M/d. 1)d where d = gcd(M,J)
(gcd stands for greatest common divisor.)
LEMMA 2. With the same notation in Lemma 1, define the remainder set Rv -
{V+nc_ O < n < M/d-1 } for O < V < d-l, then
(1) (j+ixJ) rood M E Rj modd for 0 <[j <_ M-1 and all non.negative integer i.
(2) {f.j+ixJ) rood M [0 < i < M-1 } - Rj mad d, where d = gcd(M, J). More precisely, (j + i x J)
mod M, 0 < i < M - I contain d copies of each of the elements in Rj roodd-
Proof. See Appendix.
THEOREM 1. Except for some faults occurring in the last (M-l) PEs of the array, all
transient faults have finite EDL (less than M) if and only if gcd(M,J) = d < N.
Proof. By the task/PE diagram in Fig 3, a task entering PE0 at cycle j will be processed by
PEiatcycle (j +ixJ)modM. By Lemma 2(1), (j +i×J)modMe Rjmod d. IfN<d, forany
j such that N < j < d, every integer in Rj roodd is greater than N - 1, hence for all non-negative in-
tegers i, (j + i x J ) rood M > N - 1, which means a task enters PE O at such cycle j will never be
checked, resulting in infinite EDL if it is affected by some transient fault.
Conversely, if d g N, (j rood d) < N - 1. By Lemma 2(2), we have { ( j + i × J) rood M [0 < i
< M - 1 } = Rj mad d, which means that an erroneous task produced at any cycle j will be checked
at least once at cycle (j rood d) by the faulty PE itself or one of its (M-l) immediate downstream
PEs. Therefore, as long as the faulty PE is not one of the last (M-l) PEs of the array, the error
will be detected. []
Case 2 : Faults occurring near the end of the array.
11
As long as the CED is not applied continuously, EDL must exist. When a transient fault oc-
curs at some PE in a computation cycle with EDL larger than the number of the downstream PEs
from it, the error will escape. This phenomenon exists no matter how we schedule the checking
patterns (only the severity varies). One possible solution is to add a code checker with lower
complexity and higher reliability at the end of the array to perform continuous code checking in
order to intercept the escaping errors ff desired.
3.3. Uniform Distribution of Checking Capability
N
When the checking frequency is set to --_-, it is just an average over all tasks and does not
N
necessarily guarantee that each task will be processed with checking _- of the time along its
path. Since all tasks have equal significance, it is desirable to schedule the checking patterns
such that each task is treated as uniformly as possible. Theorem 2 gives the condition for this
purpose.
THEOREM 2. Except for the difference due to the fact that array length Q might not
be a multiple of M, the checking capability is uniformly distributed among all tasks if
gcd(M,J) = d divides N.
Proof. If N = m d, where m is a positive integer, for every Rv, 0 < V < d - 1, the first m ele-
ments V + n d, 0 < n < m - 1, are less than N and others are not. By Lemma 2, (j + i × J) mod M,
0 < i < M - 1 contain d copies of each of the elements in Rj mod d, which implies that a task enter-
ing PE0 at any cycle j will be checked m × d times among M computation cycles. Therefore, all
mxd N
tasks are fairly treated and checked with the same frequency _ = --_-. []
12
3.4. Minimization of Maximum Error Detection Latency
The EDL(j) of both permanent and transient faults occurring at computation cycle j with
checking (0 <j < N-l) are defined to be zero in terms of number of computation cycles because
they are detected immediately. The EDLfj) of other non-checking cycles under the constraint
that 1 < J < N are given by I.,emma 3 with superscript "C" indicating the error is detected as a
computation error and 'T' stands for input error. We can prove the optimal solution under this
constraint actually achieves the optimality for general values of J. For the rest of this paper, we
will assume the duration of a transient fault is much less than the clock cycle time so that it can
only affect the result of one computation 1.
LEMMA 3. For 1 <J`;N andN <j <M-l,
(1) under transient faults :
(2) under permanent faults :
= Mj__[ ; EDLCfj ) = M-j.EDLIfj)
Proof. (1) Because we can view a task traveling across the processor array as advancing on
the checking pattern with J cycles per step, we have the following formula for the EDL of a tran-
sient fault for general J:
To our knowledge, there is no closed-form solution for this general problem. The constraint
1 < J `; N will make sure that an error caused by a fault occurring at a non-checking cycle be
1Most of the results will still be valid without this assumption except for more complicated error paaea'n analysis described
later.
13
captured by the next checking burst (r = 1) of some downstream PE. Therefore,
Under the assumption about the length of the transient faults, a PE can not detect such faults oc-
curring at non-checking cycles, so the EDLC(j) is infinity.
(2) A permanent fault can be considered as consisting of a large number of transient faults
occurring at consecutive computation cycles. Using the above formula, we have, for permanent
fault,
o 0 ,l+q}F"DLI (J)= O_<M_j L/ J
hence
For a permanent fault starting at a non-checking cycle, the faulty PE will detect it as a computa-
don error as soon as it enters next checking burst. Therefore, EDL c (j) = M -j. []
LEMMA 4. For 1 < J < N and N <j < M-l, if we define EDL(j) to be the latency until
the first error indication (computation or input error), then for both permanent and tran-
where equality holds for J = 1 or M - j = 1. Hence,
By manipulating the expression inside the bracket,
+L0 oLc01EoLI01 []
14
eu _ ejj for 1<i <j <M-N
FI(Ej)=Ej
(2)By Eq. (I)we alsohave eil: ei(j+t)for I < i_ M-N; hence
Ej _-E_-Ifor I _J _ N-I
(3) By (I) and (2),we have proved FI(EN) : FI(Ej)for I : J _ N. For general values of J, the
proof is by contradiction. Suppose there exists J such that 1-I(E_) < I[(Ej) is not true. This
means there exists i, 1 < i < M-N such that gi(EN)>_,i('Ej) and
hence
Next we give some definitions for proving the optimal scheduling.
DEFINrrlON 1. Given a sequence of n numbers S = (sl, s2, ... ,sn), II(S) is defined to
be a nondecreasing permutation of S, i.e., II(S) = (xl (S), x,2(S), ..., r.n(S)) is a permutation
of S where g:(S) < x2(S) < "'" < xn(S).
DEFINITION 2. Given two sequences of n numbers S and T, we define S_'I" if si<ti for
l<i<n.
LEMMA 5. Define Ej = (e:j, e_j, ..., e(M-_ ) = (EDLj(M-1), EDLj(M-2), ..., EDLj(N))
for 0_1_4-1, then
(1) r[(Ej)-Ej forl<J<N
(2) Ej _ Ej+ 1 for 1 < J < N-1
(3) Il(Er_) < II(Ej) for all integer I.
Proof. (1) By Lemma 4,
l I+leij=EDLj(M_i)= M _,t t) _ forl<J<Nandl<i<M-N; (1)
15
_i(EN)-I >_/_i(Ei) _ .'- _ _I(EI) >_ 1. (2)
IN] [--_] i + 1 >_i(EN) andi + 1 > by definition, we have NBecause _ (EN) = eiN = and _-
i
>N. (3)
_(EN)-I
Hence, by Eqs. (2) and (3), there must exist m, 1 < m < _i(EN)-I such that more than N ele-
ments of {_I(Ej), "-" ,_(Ej) } are equal to m. However, for a checking burst of length N, the
maximum number of non-checking cycles with the same EDL is N for any of the EDL values.
Therefore, we have reached a contradiction and I-I(EN) < H(Ej) for all J. []
THEOREM 3. The maximum EDL, EDL_ ax, is minimized by setting J = N and the
m'"'mumv"u'"1
Proof. By (3) of Lemma 5, EDL} _ - _M_N(Ej) _ gM-N(EN) = EDL_ _ for all integer J.
Hence J - N minimizes the maximum EDL and
3.5. Minimization of Error Escape Probability
The price paid by PACED to maintain the desired throughput is lower error coverage. The
longer the error detection latency, the larger the possibility that an error will be masked during
the propagation and escapes. Assume the probability that an error will be masked at any compu-
tation cycle is Pro. The error escape probability, PescO), for a transient fault occurring at compu-
tation cycle j is then given by Pesc (J) = 1 - (1 -pro) EDLj(j), 0 <j < M-1.
THEOREM 4. The average error escape probability is minimized by setting J = N.
16
Proof. Assume a transientfaultoccurs ateach cycle j,0 < j_4-I, with equal probability.
The average errorescape probabilityis
M-I 1 M-I[I_(I_pm)EDL,(j1
1 N.-1 [I_(I_p_)EDLjO i
because EDL(j) = 0 for0 <j < N-I. By (3)of I._mma 5,
._-1 _III--(I--pm)EDLj(J)I= _----M_NII--(l--pm)_(Esl
'
= M" j=N L - J for all integer I.
Hence, setting J ----N minimizes the average error escape probability. []
N
By Theorems 3 and 4, we conclude that for a given -_-, J should be set equal to the length
of the checking burst N, i.e., the pattern offset between adjacent PEs should be O M ---- N - 1, in
order to minimize the maximum EDL and maximize the error coverage. This optimality is in-
dependent of the choice of N.
3.6. Summary of Optimization Results
N
Given a fixed checking frequency --_-, we choose M and N to be relatively prime in order
to minimize both M and N. Minimizing M allows Theorem 2 to more accurately state the condi-
tion for uniform distribution of checking capability. We will show in the next section that
minimizing N can minimize the hardware overhead for data buffering. Since J should be equal
to N for optimal error detection, we have gcd(M,J) = gcd(M,N) = 1 which satisfies both condi-
17
tions in Theorems1 and2. Therefore,thepossibility of infinite error detection latency is elim-
inated and the checking capability is uniformly distributed among all tasks.
4. DESIGN CHANGES
4.1. Synchronous Buffering Design
By scheduling checking patterns among PEs, resource (PE) conflict may occur when a PE
is still checking old data but new data has been produced by its upstream PEs. It was shown in
Fig. 4(b) that the maximum wait time is equal to OMxk clock cycles. Hence, it is adequate to in-
sert OMxk buffers between each adjacent PE pair, driven by the original clock 2. Since O M
should be equal to N-1 for optimal scheduling, the number of buffers will decrease as N de-
creases. Therefore, choosing M and N to be relatively prime also minimizes the hardware over-
head for data buffering.
However as also shown in Section 3.1, the wait time, which determines the number of
needed buffers, is not a constant but a function of computation cycle number. It becomes clear at
this point that some kind of dynamic buffering technique has to be used to make the pipeline
flow smoothly and correctly. We propose two such approaches, namely, data forwarding and
dynamic re,configuration.
The data forwarding approach to buffering is described as follows. When a PE is ready to
output the processed data, the wait time logic shown in Fig. 5, which monitors the checking bit
sequence, has determined the walt time, Wait(j), of current cycle and connected the PE output to
the buffer which is Wait(j) stages away from the downstream PE. Once the data is placed into
:OM×k]
2It can be shown that the minimum number of required buffers is equal to _ /" However, more complicated
control circuits are needed to reuse the buffers.
18
i
recent
checking
bits
wait time i
logic i
!,It'LI'
PE 2
Figure 5. Buffering by data forwarding
the appropriate buffer, the synchronous buffering design will ensure the data arrives at the down-
stream PE at the correct clock cycle.
As an alternative, since the number of buffers needed varies with time, we can treat the ex-
tra buffers at each clock cycle as being "faulty" and use the Diogenes approach [16] to dynami-
cally re.configure the "buffer arrays" by bypassing the "faulty" ones. A shifter clocked by the fal-
ling edge of the clock (assume the PEs are clocked by the rising edge) is used to set up the
proper configuration of the buffers for the next data movement. The basic rule is :
checking bit of PEt
_0...../"_'-_ C1 r-A-"-_ C2 rH-'-A..J t
_m_Eo 1 _) _1 -_1 -
" I 11 I? I I II, toPE_
[ PE1 [ 2"_ _ _ [ PE2 I
CLOCK
Figure 6. Dynamic re.configuration circuit for buffering using Diogenes approach
bit of PE2
19
(1) Include one more buffer if the downstream PE is in a computation cycle with checking
while the upstream one is not;
(2) Bypass one more buffer if the upstream PE is in a computation cycle with checking while
the downstream one is not;
(3) Maintain the current configuration (by disabling the clock input of the shifter) if the two
PEs are both checking or non-checking.
Because of the regular pattern of wait time variation (Fig. 4(b)), the reconfiguration circuit
is very simple, as shown in Fig. 6. An example showing the correct buffering at each clock cycle
by using the reconfiguration approach is given in Fig. 7.
4.2. Diagnosis
Because permanent and transient faults occurring at different computation cycles will result
in different combinations of computation and input error indications after various length of la-
tency, it is important for diagnosis to analyze all possible error patterns and classify the faults
into several categories according to their resultant error patterns.
Since the error escape probability is related to EDL in terms of computation cycles, and in
order to make the diagnosis procedure independent of the checking overhead k, we will use the
task/PE diagram (Fig. 3) and the error indications from CCCs for error pattern analysis. How-
ever, for a PACED array, the CCCs do not happen at the same clock cycle. It would be unac-
ceptable if we have to wait for all the CCCs to finish before the analysis because that will delay
the diagnosis and rollback by a considerable number of clock cycles. The following proof gives
the upper bound for the number of error indications by any set of CCCs for 2 < J < N. (The case
when J = 1 will result in peculiar error patterns which can not share the same diagnosis and roll-"
back procedures with other choices of J. Since J = 1 also results in large EDL, it will be excluded
20
PE1
PE2
PE1
,1 ,213 [4 '4 '5 '5 '6 '6 '7 '7
I I ! I I I' '1 1 2 2 3 3 4 15 a6
I
I I I t t t i I
i-2 i i+2 i+4
I I I
i+6 i+8
i+2
CO C1 C2
1 1 1 PE2
i+6
1 1 0
_-_ i+7
1 0 0
i+8
0 0 0
i+3 _ i+9
0 0 0
i+4 _ i+lO
0 0 0
r-----_ r-'---_ f---'_ [-----_
1+5 _ [z_._ [a__ [._ _ i+l 1
i J i I
8_ I ,
171 i i
I I I I I t
i+lO
CO C1 C2
PE1 0 0 0 PE2
0 0 0
0 0 0
0 0 1
0 1 1
1 1 1
Figure 7. Example of dynamic reconfigurationfor buffering.The parameters are N = 4,
OM = 3 and k = I.When C4 is1,thecorresponding "faulty"bufferisbypassed.
21
from future discussion.)
THEOREM 5. For 2 < J < N, at most two PEs will have the earliest error indications
among all the CCCs for a single fault, and they must be adjacent.
Proof. Since a transient fault can only create one erroneous task, only one PE will detect it.
For a permanent fault occurring at cycle j, N < j < M, the faulty PE itself will detect a computa-
tion error after M-j cycles and the erroneous task produced at cycle j + q, 0 < q < M-j will be
detected [ M-(j+q)] +qj cycles after the fault °ccurrence as an input err°r bY [ M-(i+q) 1 thj
downstream PE from the faulty PE. For J ___2 and q > 2, we have
[ M-_j'+q) ]+q = [ M-]+j(J-a)q ] >- [--_+1x2] = [Mj-_]+I
which is larger than the EDL(j) = [ Mj-Mfl]. Hence, the only erroneous tasks which will possibly
be detected as the earliest input error indications are the ones produced at cycle j and j+l.
If M-j = 1, the two earliest error indications with EDL(M-1) = 1 are the computation error
detected by the faulty PE and the input error detected by the immediate downstream PE. If
[]
Therefore, in order to design a diagnosis procedure, we will always keep the pipeline flow-
ing until the immediate downstream PE finishes the CCC once the first error indication is raised
by some PE (called the detective PE). The number of clock cycles that the detective PE has to
wait is equal to the wait time of the current cycle because when the downstream PE is ready to
22
process the data corresponding to the task resulting in the first error indication, it must have
finished processing the previous task at the CCC and setup the error flags.
The next step is to classify all the faults according to their resultant error patterns. The no-
tation is defined as: Class a.b where a = 1 means transient, a = 2 means permanent and b is the
further classification within each category. The corresponding error patterns are shown in Fig. 8.
"P" stands for permanent fault, "T" for transient fault and "F" can be either "P" or "T". 'T' indi-
cams an input error and "C" represents a computation error.
(1) Class 1.1 and 2.1: For both permanent and transient faults, Fig. 8(a) represents the case
where a computation error indication occurs at computation cycle j, 1 _j < N-1. The fault
p
(a) (d)
ppy'
Co) (e)
P PY
(c) (0
Figure 8. Error pattern analysis. (The thick line passes through all CCCs which are
related to the detection of the present fanlt.)
23
(2)
(3)
(4)
(5)
must have just occurred at the detective PE; otherwise, it should have been detected at ear-
lier cycle with checking. In order to distinguish between permanent and transient faults, the
faulty PE is given a second chance to do the recomputation. If the recomputation still sets
an error flag, the fault is permanent under the assumption that a transient fault never affects
more than one computation; otherwise, it is transient.
A more complicated situation occurs when the computation error flag is raised at cycle 0
because it is possible that the fault is a permanent one which occurred during previous
non-checking cycles. However, the proof of Lemma 4 shows that the input error will be
detected no later than the computation error for such faults and the two kinds of error will
be in the CCCs if and only if such faults occurred at cycle M-1. Therefore, if there is no in-
put error indication in the immediate downstream PE, as in Fig. 8(b), the fault must have
just occurred and the faulty PE is given a second chance.
Class 1.2: A transient fault occurring at cycle j with N < j < M-1 will be detected as an in-
put error by the downstream PE which is EDLfj) stages away from the error source PE (Fig.
8(c)).
Class 2.2: A computation error detected at cycle 0 and an input error detected by the im-
mediate downstream PE at CCC indicates the fault is permanent and occurred at cycle (M-
1) in the upstream PE (Fig. 8(d)).
Class 2.3: A permanent fault at cycle j with EDL(j) = EDLfj+I), N <j < M-2 will be
detected by a single downstream PE as an input error (Fig. 8(e)).
Class 2.4: A permanent fault at cycle j with EDL(j) - EDL(j+I) + 1, N <j < M-2 will be
detected by two downstream PEs as input errors (Fig. 8(f)).
24
Among the classes of faults defined in the previous paragraph, Class 1.1, 2.1 and 2.2 are
successfully identified by the error pattern analysis and Class 1.2, 2.3 and 2.4 need further diag-
nosis. The basic idea is to design a recovery cache for each PE for storing recent input data.
When an input error is detected by some PE, each upstream PE suspected of producing the error
reads from its recovery cache the input corresponding to the erroneous task and uses it as test in-
put to perform recomputation with checking. The computation and input error flags resulting
from these recomputations are used as syndromes and will uniquely identify the faulty PE and
cycle under the assumption of a single fault. Therefore, the diagnosis takes only one computation
cycle with checking. Again, for the regularity and simplicity of the diagnosis, the following rules
are adopted:
(1) Although it is possible to calculate the exact number of suspects which is less than or equal
to the maximum EDL, for each input error detected we wiU always use maximum EDL as
the number of suspects. Because the diagnosis procedure for each PE is done in parallel,
this does not increase the time overhead and allows regular hardware connection.
(2) For the faults in Class 2.4, we will ignore the first input error and only use the second error
indication for further diagnosis because the second one corresponds to the erroneous task
produced earlier.
The success of the above simple diagnosis procedure depends on the capability of each PE
to retrieve the correct data from the recovery cache. Because the CCCs are skewed in a PACED
array, it is very difficult to determine in which location of the cache the required data resides.
The design of the direct-mapped recovery cache is aimed at simplifying the searching procedure.
Similar to the direct-mapped cache design in the memory hierarchy for general-purpose comput-
ers, where each position of the cache can only hold data from certain addresses with identicai
least significant bits, each position i of our direct-mapped recovery cache can only hold data for
25
those tasks with id number n such that n rood (cache size) = i. Consequently, as long as we have
a recovery cache of sufficient size, i.e. larger than the maximum EDL, so that each data will not
have been overwritten by the data from later tasks when it is needed for diagnosis, every
suspected PE only has to read the test input from the same position as that in the detective PE
and the hardware connection is simplified.
Start-up control is a mechanism to setup the cache correctly once and for all when the pipe-
line starts flowing, so that whenever new data has to be placed into the cache, it is put into the
next position or, when reaching the end, the first position. Fig. 9 shows how the start-up control
works. The start-up delay for each PE is computed by accumulating the wait times. Each PE can
only start reading in the data after the start-up delay. Therefore, the first data each PE places in
the recovery cache will be for task 0 and later data can simply follow on top. The start-up control
is also utilized for rollback which is discussed next.
pE 0 0 1 2 3 41 5 5 6 6L7 8
I 10 , 3, 6 7PE 1
PE2 SD1 0] 1 1 2 21 3 4 5 6
SD2
PE3 _ J 0 0[ 1 2 3 4 5
SD3
PE4J ] 0 1 2 3F
SD4
Clock cycles
Figure 9. Start-up control (SD: start-up delay)
7.--
26
4.3. Rollback
Once the faulty PE and faulty cycle are identified, if the fault is permanent, a spare PE is
brought in to replace the faulty one, and then rollback recovery starts after the reconfiguration; if
the faultistransient,rollbackdirectlyfollows diagnosis,or actually,overlaps with itbecause the
recomputation indiagnosiscan be used as the firststepinrollback.
The rollbackprocedure can be divided intotwo steps:flushingand localrecovery.Similar
to the simplificationin diagnosis,sincethe rollbackisdone in parallelfor each PE, we willuse
the same procedure for the recovery of both permanent and transientfaultseven if the latter
resultsin fewer number of erroneous data.
(I) Flushing: First,we definethe erroneous block toconsistof allthe followingdata : (1)data
insidethe PEs and buffersbetween errorsource PE and detectivePE; (2) data insidethe
recovery cache between thesetwo PEs, from the positioncontainingthe datacorresponding
to the erroneous taskup tothe most recentposition.The region enclosed by dash linesin
Fig. 10(a) shows the erroneous block for the case where a transientfaultoccurred in PEI
when tasknumber 7 was being processed and isdetectedby PE4 as an inputerror.The first
stepof rollbackisto flushallthe data inthe erroneous block as shown inFig.10.(b).
(2) Local recovery: Since the fault only affects the PEs between the error source PE and the
detective PE inclusive (called the local recovery set), the portion of the pipeline containing
all the other PEs are frozen during the rollback. The local recovery line is defined to consist
of all the PEs in the local recovery set with the erroneous task number. The local recovery
scheme is to apply the start-up control to the local recovery line by viewing the local
recovery set as a short pipeline,the erroneous task as the firsttask and the data in the
recovery cache of the errorsource PE, which is correctand thus not flushed,as the input
data.The erroneous block isrebuiltas shown in Fig. 10(c)-(h)afterwhich allPEs proceed
27
PE 0 7 7 8 8{ 9 10111213
PE 1 5 6 6_.__-8-9- 1011_,
vE3 _ 3 4 5 6L7_1_s- 8 9:
VEal1 2 3 4 516 6_._7__
VE5 0 1 2 3J4 4 5 516
(a) clockcyclei
PE 0 7 7 8 819 10111213
6 .......
PE 2 _ 5 6___ _'-',,
PE 3 "_ 3 4 5 6' i
PE 4 11 2 34 5_
PE 5 0 1 2 314 4 5 5L. _
0a) clock cycle i+ 1
7 7 8 8{ 910111213
566(78
4 4[5 677 [_
34 i l __1,:
0 1 2 31 4 4 5 516
(d) clock cycle i+3
7 7 8 8] 910111213
5 6 (_7 8 9101][ ''_
44-]5 (,7 s 9f10
"-_3 4 5L_ 71 8 8
[ 1 2 3 4 51 6 6L7 __
0 1 2 31 4 4 5 516
(g) clock cycle i+6
7 7 8 81910111213
5 6 6_7 8-9---]"_
"_ 3 4 5 6' 71
112345 '_
0 1 2 314 4 5 516 0
(e) clock cycle i+4
7 7 8 81 9 10111213 7 7
5
4 4] 5 6:7 8 9[
"_ 3 4 5 61 71 8
[ 1 2 3 4 5f6-6____
01231445516
(f) clock cycle i+5
7 7 8 81910111213
5 6 (_7 8 91011_-'_
4 41 5 5_7 8 911010:
"_ 3 4 6L71 8 8 9!
I 1 2 3 4 51 6-@t.7_7 _
1 2 3J 4 4 5 5[...6
(h) clock cycle i+7
PE 0 7 7 8 819 1011 1213
PE 1 5 6 6C7 ....... Vq
PE 2 _ 5 6_ _---:
PE 3 -'_ 3 4 5 6] [---m i
PE 4 11 2 3 4 5_
PE 5 0 1 2 31 4 4 5 516
(c) clock cycle i+2
8 8{ 9 10 11 12 13 14
5 6 6_-ff-9-_
4
-'_ 3 4 5 _ 71 8 8 919]
11 2 34 5_
0 1 2 314 4 5 516 7
(i) clock cycle i+8
Figure 10. Local recovery procedure (a) fault occurrence and error detection; (b) data
flushing; (c)-(h) local recovery using start-up control; (i) resumption of normal
processing.
as before (Fig. 10(i)).
4.4. Summary of Design Changes
A PACED array equipped with the proposed design changes was shown in Fig. 1. Error
checking circuits are built into each PE for time-redundant computation checking and input code
checking. A code checker is added at the end of the array as discussed in Section 3.2. With op-
28
timal scheduling of checking patterns and 100% overhead time-redundant checking (k=l), N-1
buffers are inserted between each adjacent PE pair and a recovery cache of size at-
tached to each PE for storing incoming data from the top and the left. The control logic is
responsible for correct data buffering, start-up control, error diagnosis and local recovery. The
techniques described in this paper have been simulated on an Alliant multiprocessors with eight
processors to show their correct operations.
5. CONCLUSIONS
It was shown that, for a PACED array with the period of checking pattern equal to M com-
putation cycles, the length of checking burst equal to N computation cycles and fixed throughput
the checking frequency --_), the optimal scheduling in terms of minimizing the(determined by
maximum error detection latency and error escape probability is achieved by setting the check-
ing pattern offset OM to N - 1. Also, by choosing M and N to be relatively prime, the hardware
overhead for data buffering is minimized and the checking capability is uniformly distributed
among the tasks.
Dynamic buffering techniques to preserve the systolic nature and the implementation for
rollback recovery under faults were presented. It was shown that the complexity in the diagnosis
and recovery process, resulting from the error latency as a trade-off for performance, can be re-
duced through the use of direct-mapped recovery cache and start-up control. The design flexibil-
ity can be further improved by using a programmable control unit.
29
Proof of Lemma 2.
APPENDIX
(j + i x J) rood M = (j + (i x J) rood M) rood M
= (j + (i rood M x J) rood M) rood M
e{(j + nd)modMI0 <n < M/d-X} by Lemma 1.
Also, {(j+ixJ)rood M 10_&._d-I } = {(j+nd)rood M [0_n_vl/d-i },and (j+ixJ)rood M,
0 < i< M-1 containd copiesof each element inthe seton therighthand side.
For 0 < n < M/d- [j/dJ- I,
=_0 <j + nd<M+j modd-d<M
=_(j+nd)modM=j+nd=j modd+(n+ _dJ)d=j modd+md, Lj/dJ<m___M/d-1.
For M/d - tj/dJ< n ";M/d - I,
=_M+j modd<j + nd < M+j- d <2M
:=_(j+ nd) rood M = j+ nd - M = jrood d + (n+ [j/dJ-M/d)d = jrood d + rod,0 < m _ Lj/dJ- I.
Therefore,
{(j+nd) mod M 10_< n < M/d-l} = {(j modd+ md)10<m< M/d-l}
Finally, we have (j +ix J) modM eRjmodd and {(j+ixJ) mod M 10<._i:_¢I-1 } = Rjmodd. []
REFERENCES
[1] W. Moore, A. McCabe and R. Urquhart, (eds.) Systolic Arrays, Adam Hilger, 1987.
[2] J. A. Abraham, P. Banerjce, C.-Y. Chen, W. K. Fuchs, S. Y. Kuo, A.L.N. Rcddy, "Fatilt
tolerance techniques for systolic arrays," IEEE Computer, vol. 20, no. 7, July 1987, pp. 65-
75.
30
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
P. P. Chen, A. N. Mourad and W. K. Fuchs, "Confidence in processor array outputs under
periodic application of concurrent error detection," IEEE Workshop on Defect and Fault
Tolerance in VLSI Systems, Nov. 1990.
W. T. Cheng and J. H. Patel, "Concurrent error detection in iterative logic arrays", Proc.
14th IEEE International Conference on Fault-Tolerant Computing, 1984, pp. 286-291.
D. A. Reynolds and G. Metze, "Faultdetectioncapabilitiesof alternatinglogic",IEEE
Trans.Computers, Vol. C-27, No. 12,pp. 1093-1098, Dec. 1978.
J-C Fabre, Y. Deswarte, J-C Laprie and D. Powell, "Saturation:reduced idlenessfor im-
proved fault-tolerance",Proc. I8th IEEE Fault Tol.Comp. Symp., 1988, pp. 200-205.
G. S. Sohi, M. Franklin and K. K. Saluja, "A study of time-redundant fault tolerance tech-
niques for high-performance pipelined computers", Proc. 19th IEEE Fault Tol. Comp.
Symp., 1989, pp. 436-443.
M. A. Schuette and J.P. Shen, "Exploitinginstruction-levelresource parallelismfor tran-
sparent control-flow monitoring", Research Report No. CMUCAD-90-42, Dec. 1990,
Carnegie-Mellon University.
Y. H. Choi, S. H. Han and M. Malek, "Faultdiagnosisof recortfigurablesystolicarrays",
Proc. IEEE International Conference on Computer Design, Port Chester, N'Y, Oct. 1984,
pp. 451-455.
Y. H. Choi and M. Malek, "A fault-tolerant systolic sorter", IEEE Trans. on Computers,
Vol. 37, No. 5, May 1988, pp. 621-624.
S. W. Chan and C. L. Wey, "The design of concurrent error diagnosable systolic arrays for
band matrix multiplications", IEEE Trans. on Computer-Aided Design, Vol. 7, No. 1, Jan.
1988, pp. 21-37.
R. J. Cosentino, "Concurrent error correction in systolic architectures", IEEE Trans. on
Computer-AidedDesign, Vol. 7, No. 1, Jan. 1988, pp. 117-125.
H. T. Kung and M. S. Lain, "Wafer-scale integration and two-level pipelined implementa-
tions of systolic arrays", J. of Parallel and Distributed Computing, Vol. 1, 1984, pp. 32-63.
S. Y. Kung, VLSI Array Processors, Prentice Hall, Englewood Cliffs, 1988.
R. L. Graham, D. E. Knuth, O. Patashnik, "Concrete Mathematics", Addison-Wesley Pub-
fishing, New York, 1989.
A. L. Rosenberg, "The Diogenes approach to testable fault-tolerant arrays of processors"_
IEEE Trans. on Computers, Vol. C-32, No. 10, Oct. 1983, pp. 902-910.
31
[17] E. S. Manolakos, "Transient fault recovery techniques for the VLSI processor arrays",
Ph.D. Dissertation, University of Southern California, May 1989.

UNCLASSIFIED
III[¢UIIWTy CLA*m!_rlCATION oIr TNIII PAaE
I
UNCLASSIFIED
UNCL_\SSIFIED
SECU_:r_, CCASSIFIcArtoN OF THiS PAGE
l la. REPORT SECURITY CLASSIFICATIONUnclassified
2a. SECURITY CLASSiFiCATION AUTHORITY
2b. OECIASSIFICATION I OOWhlGRADING SCHEDULE "'
4. PERFORMING ORGANIZATION REPORT NUMBER(S)
UILU-ENG-92-2214
6a. NAME OF PERFORMING ORGANIZATION
Coordinated Science Lab
University of Illinois
I
6c ADDRESS (GP/, State, _nd ZIPCode)
if01 W. Springfield Avenue
Urbana, IL 61801
8=. NAME OF FUNDING/SPONSOrING
ORGANIZATION
7a
I__ AOORESS_Ci_Jo $t.te. an_ ZlP Code)
7b
1 1. TITLE (Include Security Ciauificatio_)
II
REPORT DOCUMENTATION PAGE
mml
CRHC-92-08
6b OFFICE SYMBOL
(If apDlieable)
N/A
i
lb. RESTRICTIVE MARKINGS
None
3. OISTRIBUTIONIA(/-AIIABILITY OF REPORT
Approved for public release;
distribution unlimited
S. MONITORING ORGANIZATION REPORT NuMBER(S) '
7a. NAME OF MONITORING ORGANIZATION
National Aeronautics Space Administration
IL Comput. Lab. Aerospace Sys
7_ °A_E_,r_,c a__i vrograms
Langley VA
Chicago, IL
Washington, DC
i
_ _v _ V_
N00014-90-J-1270
l
10. SOURCE OF FLJNOING NUMBERS
PROGRAM i JECT I TASK
ELEMENT NO. PRO, NO.
i I
i
WOR_ UNiT
CCESSION NO.
Design and Scheduling for Periodic Concurrent Error Detection and Recovery in Processor Array_
I I ii
12. PERSONAL AUTHORIS)
WANG, YI-Min, Pi-Yu Chunk, W. Kent Fuchs
i | ii
13.. TYPE OF REPORT,echnical II' 'b" TIMECOVERE_FROM TO 114" DATE OF REPORT _'8'1 _l_e _'" I S" 'AGECOUNT]-9_2 May 22 32
16. SUPPLEMENTARYNOTAT1ON
i I
17. COSATICODES I 18. SURJECTTERMS(Com'Inue on mver_lifnecem_ and idenfi_ _ _k numbedFIELD I GROUP SUg-GROU? error detection latency performance degradationscheduling, diagnosis, error pattern analysis
II
!9. ABSTRACT (Cont/nue on m_e if n_esla_ and idenff_ by _k numbe_
Periodic application of rime-redundant error checking provides
the trade-off between error detection latency and performance degradation. The goal is to
achieve high error coverage while satisfying performance requirements. In this paper, we derive
the optimal scheduling of checking patterns in order to uniformly distribute the available check-
ing capability and maximize the error coverage. Synchronous buffering designs using data'for-
warding and dynamic reconfiguradon are described. Efficient single-cycle diagnosis is imple-
mented by error pattern analysis and direct-mapped recovery cache. A rollback recovery
scheme using start-up control for local recovery is also presented.
ii
20. DISTRISUTION/AVAiLABILITY OF ABSTRACT J 21. ABSTRACT SECURITY CLASSIFICATION
(_UNCLASSIFtEDAJNLIMITED [7 SAME AS RPT. [_ DTIC USERS J Unclassified
22a NAME OF RESPONSIBLE INDIVIDUAL 12Zb. TELEPHONE (Include Are& Cod_) I Z2c. OFFICE SYMBOL ....
I I II I I I
00 FORM 1473, 84 MAR 83 APR edltwon may be used untd exhausted. SECURITY CLASSIFICATION OF THiS PAGE
All othe¢ editions are obsolete.
