Confidence in Processor Array Outputs Under Periodic Application of Concurrant Error Detection by Chen, Paul P. et al.
June 1993 UILU-ENG-93-2222
CRHC-93-12
Center for Reliable and High-Performance Computing
CONFIDENCE IN 
PROCESSOR ARRAY OUTPUTS 
UNDER PERIODIC 
APPLICATION OF 
CONCURRANT 
ERROR DETECTION
P. P. Chen, A. N. Mourad, and W. Kent Fuchs
Coordinated Science Laboratory 
College of Engineering
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
Approved for Public Release. Distribution Unlimited.
aUNCLASSIFIED ____
<gduRifY ¿ lä sSiFküaINÖn o f  f  h is p a g F
REPORT DOCUMENTATION PAGE
la. REPORT SECURITY CLASSIFICATION 
U n c la s s i f ie d ____________
1b. RESTRICTIVE MARKINGS 
None
2a. SECURITY CLASSIFICATION AUTHORITY
2b. DECLASSIFICATION/DOWNGRADING SCHEDULE
3. DISTRIBUTION/AVAILABILITY OF REPORT 
Approved for public release; 
distribution unlimited
4. PERFORMING ORGANIZATION REPORT NUMBER(S)
UILU-ENG-93-2222 CRHC-93-12
5. MONITORING ORGANIZATION REPORT NUMBER(S)
6a. NAME OF PERFORMING ORGANIZATION 
C oord inated  S c ie n c e  Lab 
U n iv e r s ity  o f  I l l i n o i s
6b. OFFICE SYMBOL 
(If applicable)
N/A
7a. NAME OF MONITORING ORGANIZATION
O ff ic e  o f  N aval R esearch
N a tio n a l A e ro n a u tics  and Space Admin,
6c ADDRESS (Gty, State, and ZIP Code)
1101 W. S p r in g f ie ld  Avenue 
Urbana, IL 61801
7b. ADDRESS (City, State, and ZIP Code)
A r lin g to n ,. VA 
M o ff i t t  F ie ld ,  CA
8a. NAME OF FUNDING/SPONSORING
ORGANIZATION 7a
8b. OFFICE SYMBOL 
Of applicable)
9. PROCUREMENT INSTRUMENT IDENTIFICATION NUMBER
8c ADDRESS (City, State, and ZIP Code)
7b
10. SOURCE OF FUNDING NUMBERS
PROGRAM 
ELEMENT NO.
PROJECT
NO.
TASK
NO.
WORK UNIT 
ACCESSION NO.
11. TITLE (Include Security Classification)
C on fidence in  P r o c esso r  Array O utputs Under P e r io d ic  A p p lic a t io n  o f  C oncurrent Error D e te c t io n
12. PERSONAL AUTHOR(S) CHENj P>P#> A. n . Mourad, and W. Kent Fuchs
13a. TYPE OF REPORT 
T e c h n ic a l
13b. TIME COVERED 
FROM__________TO
14. DATE OF REPORT (Year, Month, Day)
1993 .Tune LQ---------------- E5. P3  9 AGE COUNT
16. SUPPLEMENTARY NOTATION
17. COSATI CODES
FIELD GROUP SUB-GROUP
18. SUBJECT TERMS (Continue on reverse if necessary and identify by block number)
f a u l t  t o le r a n c e , con cu rren t e r ro r  d e t e c t io n ,  p r o c e sso r  
a r r a y s , c o n fid e n c e  in  o u tp u ts , e rro r  coverage
19. ABSTRACT (Continue on reverse if necessary and identify by block number)
Processor arrays, featuring mo ^ul“j | j P r e s e r v i n g  
• VLSI/WSI implementations and specific applica * detection is one aspect of fault tolerance,
data integrity can be check normal system operations, may detect transient
Concurrent error detection (CED) techniques, methods. .............................
and Intermittent faults 'wTth greater P f  (PACED) technique which
This paper describes the Periodic Apphcat.onof C^ED in processor array archi- 
allows the performance rests incurred thr g ^  both time and space to provide probabilistic
lectures to be reduced. The application of C ^  ^  . Qn the outputs of processor arrays using
detection of errors in processor array . ,. etIor detection, the amount of output to suspect
PACED is studied. Formulae are d en ved th a* ’ s and two-dimensional mesh processor arrays. The
as possibly erroneous, for single Process ’ surprisingly high when PACED is applied in processor arrays, 
results indicate that the error coverage can be surprisingly mgn
I  20.. DISTRIBUTION /AVAILABILITY OF ABSTRACT
SUNCLASSIFIED/UNUMITED □  SAME AS RPT. □  DTK USERS
21. ABSTRACT SECURITY CLASSIFICATION
Unclassified 1
I 22a. NAME OF RESPONSIBLE INDIVIDUAL 22b. TELEPHONE Qnclude Area Code) 22c. OFFICE SYMBOL |
DD FORM 1473,84 MAR
All other editions are obsolete.
SECURITY CLASSIFICATION OF THIS PAGE 
UNCLASSIFIED
UNCLASSIFIED___________
SECURITY CLASSIFICATION OF THIS FAQK
UNCLASS TFT FT)
IPITV n I % C * r- , — « ,
CONFIDENCE IN PROCESSOR ARRAY OUTPUTS 
UNDER PERIODIC APPLICATION OF CONCURRENT ERROR DETECTION
Paul R Chen, Antoine N. Mourad, and W. Kent Fuchs
Center for Reliable and High-Performance Computing 
Coordinated Science Laboratory 
University of Illinois at Urbana-Champaign 
1308 West Main St 
Urbana, IL 61801
Contact: W. Kent Fuchs
Phone: 217/333-8294
FAX: 217/244-5686
Email: fuchs@ crhc . u iu c  . edu
Key words — Fault tolerance, Concurrent error detection, Processor arrays, Error analysis 
Reader Aids —
Purpose: Present an analysis
Special math needed for explanations: Probability theory 
Special math needed to use results: None
Results useful to: Fault-tolerant system designers, Processor array designers & users
Summary & Conclusions — Processor arrays, featuring modularity, regular interconnection 
and high parallelism, are well suited for VLSI/WSI implementations and specific applications 
with high computational requirements. Error detection and recovery can be important for certain 
applications of processor arrays. Concurrent error detection (CED) techniques, which check nor­
mal system operations, have been designed to detect errors caused by transient and intermittent 
faults. However, CED techniques typically suffer from costly hardware penalties or performance
costs.
This paper describes the Periodic Application of Concurrent Error Detection (PACED) 
technique which allows the performance costs incurred through the use of time-redundant CED 
in processor array architectures to be reduced. The application of CED is varied in both time and 
space to provide probabilistic detection of errors in processor arrays. Formulae are derived that 
predict, upon error detection, the amount of output to suspect as possibly erroneous, for single 
processors, linear arrays, and two-dimensional mesh processor arrays. The results indicate that 
the error coverage can be surprisingly high when PACED is applied in processor arrays, e.g., 
95% for checking performed 50% of the time.
This research was supported in part by the SDIO/IST and managed by the Office o f Naval Research under contract N00014-89-K-0070, in 
part by the National Aeronautics and Space Administration (NASA) under Contract NAG 1-613, and in part by the Department o f the Navy and 
managed by the Office o f the Chief of Naval Research undo: Grant N00014-91-J-1283.
11. INTRODUCTION
Acronyms
CED concurrent error detection
PACED periodic application of concurrent error detection
PE processing element (of an array)
Notation
M period of CED application
N duration of CED application
0 checking offset
csAitN checking sequence array
Preserving data integrity in processor arrays that feature modularity, regular interconnec­
tion, and high parallelism, can be important for certain applications; error detection is one aspect 
of fault tolerance. Concurrent error detection (CED) techniques, which check normal system 
operations, may detect transient and intermittent faults with greater probability than off-line test­
ing methods. Techniques such as rollback, instruction retry, and roll forward can be combined 
with CED for error recovery.
Use of time-redundant CED techniques can reduce the hardware overhead of error detec­
tion, but may degrade system performance. Periodic application of CED (PACED), as described 
in this paper, can reduce the performance degradation incurred through the use of time-redundant 
CED in processor array architectures [2].
Without continuous checking, undetected errors may occur. In this paper, the confidence to 
place on the outputs of a single processor using PACED is studied; formulae are derived that pre­
dict, upon error detection, the amount of output to suspect as possibly erroneous. In linear and 
two-dimensional mesh processor arrays, if detectable errors propagate, then the amount of output 
to suspect can be limited. The estimated PACED error coverages for the single processor and 
array architectures are also studied.
2When a single processor uses PACED, it can be parameterized by Af, the period of CED 
application, and N , the duration of CED application (0 <N <M) .M  and N  govern the time dis­
tribution of CED: in any period of Af computation cycles, N  cycles are checked and M -  N  
cycles are unchecked. With small N/M, less performance degradation can usually be expected, 
but small N/M also reduces the probability of error detection.
When PACED is applied to PEs of a processor array, M  and N  may vary at each PE in the 
array. The PE checking offset O determines the initialization of each PE’s first Af-cycle period; 
varying O at each PE in an array can create patterns of checking in the array. The checking 
sequence, CSMN, is defined as an array of M  values:
CSM)tf[r] = 1, for 0 < r < N -  1,
CSMtN[r] = 0, for N  < r < M  -1 .
EXAMPLE 1.1: The checking sequence for Af = 5 and N  = 2 is CS5 2 = (1,1, 0, 0, 0). □
2. PACED IN A SINGLE PROCESSOR
2.1. Error Arrival Model 
Notation
C confidence to place on processor outputs 
K  length of "cluster interval"
L length of "undetected-errors interval"
q Pr{particular CED technique detects error I error exists}
X mean error arrival rate
This study concentrates on the correctness of outputs, and thus on errors. It is assumed that 
faults cause errors, that errors arrive in clusters or bursts [3], that error clusters follow a Poisson 
arrival process with a constant mean arrival rate, and that errors within clusters are also Poisson 
distributed. No assumptions are made concerning the types or distributions of the faults
3themselves. The following example supports this assumption.
EXAMPLE 2.1: a  two-phase hyperexponential function was fit to error arrival data from one
/V
machine in a seven-unit VAXcluster system [4]. The density of the fitted distribution f ( t )  is 
0.88(0.829£~°,829i) + 0.12(0.012 (t in minutes). The fit was tested using the chi-square test
and could not be rejected at the 0.28 significance level, with r2 = 0.99997. □
The two-phase hyperexponential distribution function can be interpreted as exponentially- 
distributed errors (with parameter 0.829/min) arriving in infrequent, exponentially-distributed 
clusters (with parameter 0.012/min) [4].
2.2. Confidence Analysis
When an error is detected at a processor using PACED, the current outputs of the processor 
should be suspected as possibly erroneous: the detected error may be part of an error cluster and 
other errors in the cluster may have gone undetected. The next two subsections determine, when 
an error is detected, two intervals of time during which outputs should be suspected as possibly 
erroneous: one interval prior, and one subsequent, to the detected error. (If, upon error detection, 
specific action is taken either to eliminate the source of errors or to increase the amount of check­
ing performed, then the outputs produced subsequent to the error detection need not be sus­
pected.) These intervals are determined using two criteria, assuming that the detection of an error 
is independent of whether any other error is detected. 1/ Cluster Intervals: Outputs produced in 
intervals which bound an error cluster are suspected. 2/ Undetected-Errors Intervals: Outputs 
produced in intervals from the time of the current detected error backward to the first undetected 
error and forward to the last undetected error are suspected.
The proofs for the theorems in this section use the following lemmas.
LEMMA 2.1: In a processor using PACED with 1 < N < M  and where the CED technique 
has detection probability q < 1, detected and undetected error arrivals are each exponentially dis­
tributed.
4PROOF: See Appendix. □
LEMMA 2.2: The detected and undetected intracluster error Poisson arrival processes are 
independent.
PROOF: See Appendix. □
2.2.1. Fault-active intervals
When an error is detected, if no other errors were detected within K  time units either prior 
to or subsequent to the detection, then with probability C the error cluster is contained within 
these cluster intervals. Assuming clusters occur infrequently, then outputs produced earlier than 
K  time units before, or later than K  time units after, the detected error can be trusted with confi­
dence C (i.e., are correct with probability C). All outputs produced within K  time units before or 
after the detected error should be suspected as possibly erroneous: the outputs may be used but 
the user should be aware that some of this output may be incorrect. Figure 2.1 illustrates the K- 
length cluster intervals.
THEOREM 2.1: Let a processor use PACED with 1 <N <M  and q<  1. Upon error detec­
tion, outputs produced either before K  time units prior to, or after K  time units subsequent to, the 
detected error can be trusted with confidence C, where K  satisfies:
K £ -  ^  r— ln(l -  C) 
N Kq
K K
time
<_______________ J. ic.______________>
trust suspect suspect trust
Error Detected
Figure 2.1. Cluster intervals of length K.
5Outputs produced within K  time units before or after the detected error should be suspected as 
possibly erroneous, since the error cluster is contained in those intervals with probability C.
PROOF: See Appendix. □
For multiple detections, cluster intervals are taken around each detection with no signifi­
cance attached to overlaps. If a second error detection occurs within K  of a first, then the follow­
ing outputs should be suspected: those produced within K  before the first detection, those pro­
duced between the two detections, and those produced within K  after the second detection.
EXAMPLE 2.2: Let q = 1, N/M = 0.5, and C = 0.99. Using X = 0.829 err/min (Example 2.1), 
if an error is detected, then outputs generated earlier than 11.1 min prior to the detected error, or 
later than 11.1 min after the detected error, can be trusted with a confidence of 0.99, provided no 
other errors are detected within 11.1 min of the detected error. All outputs produced less than 
11.1 min before or after the detected error should be suspected as possibly erroneous. □
Figure 2.2(a) shows C versus X and K, given q = 1 and N/M = 0.5. With small X, K  needs to 
be larger to achieve a given C. Small values of K  can reach high confidence levels when X is 
larger: for X > 0.5 err/min, C > 0.95 can be achieved with K > 12 min with N/M = 0.5.
Figure 2.2(b) shows how C is affected either by N/M (with q = 1) or by q (with N/M =1), 
given X = 0.829 err/min. (In the expression for K  given in Theorem 2.1, setting q -  1 and varying 
N/M from 0 to 1 is equivalent to setting N/M = 1 and varying q from 0 to 1.) Given K=  12 min, 
N/M > 0.3 (if q = 1) or q > 0.3 (if N/M = 1) suffices to give C > 0.95. This is an encouraging 
result: if q = 1, designers can use non-continuous checking (the goal of PACED) and still achieve 
high confidence in outputs produced near a detected error. If N/M = 1, then the precise value of q 
is not critical, for large enough K: this is encouraging as well, as it may be difficult to estimate q 
accurately.
6lambda
Figure 2.2(a). Cluster intervals, C vs. X, 
N=5, Af=10, q= 1.
Figure 2.2(b). Cluster intervals, 
C vs. iV (#=1) or q (N/M=l).
2.2.2. Undetected-errors intervals
Theorem 2.1 determined the length of cluster intervals during which the error cluster proba­
bly existed. If, upon error detection, only those outputs since the first undetected error and until 
the last undetected error are suspected, then the time intervals in which to suspect outputs will be 
shorter. These intervals have length L. Figure 2.3 shows the relationship between K  and L.
7K K
time
L L
_________ a ic.________oiL _ 1 _____ >
trust suspect suspect trust
First Undetected Error Error Detected Last Undetected Error 
Figure 2.3. Undetected-errors intervals of length L.
THEOREM 2.2: Let a processor use PACED with 1 <N <M  and q < 1. Upon error detec­
tion:
Case (a): Outputs produced prior to L time units before the detected error can be trusted 
with confidence C; the interval extends from the detected error back to the probable time of the 
first undetected error.
Case (b): Outputs produced subsequent to L time units after the detected error can be 
trusted with confidence C; the interval extends from the detected error forward to the probable 
time of the last undetected error.
In both cases, the length L satisfies:
r M  1 , 
L - ~ N ^
f  \
1 - C
, N
X~ q MJ
Outputs produced within L time units before or after the detected error should be suspected as
possibly erroneous.
PROOF: See Appendix. □
EXAMPLE 2.3: For the processor in Example 2.3 {q = 1, X = 0.829 err/min, N/M = 0.5, C =
0.99), when an error is detected, the outputs generated prior to 9.4 min before, or subsequent to 
9.4 min after, the detected error can be trusted with a confidence of 0.99, provided no other errors 
are detected in those time intervals. All outputs produced less than 9.4 min before or after should
8be suspected as possibly erroneous. In this example, use of L-length intervals reduces the amount 
of output to suspect by about 15%. □
Plotting C versus L yields graphs similar in shape to Figure 2.2. However, for a given set of 
parameter values, a desired confidence level can be achieved with a value of L smaller than the 
necessary value of K.
2.3. Error Coverage
Notation
10 an error arrival time
X length of a time interval
Gfcx) generating function for the number of error arrivals
Cl\, <?2» b> c parameters of G(z, x)
oc, A,*, X2 parameters of two-phase hyperexponential distribution
The error coverage of the PACED technique is the probability that an error will be 
detected. When an error is detected, the backward undetected-errors interval will, in effect, 
"cover" all the undetected errors in that interval by casting them under suspicion. Similarly, the 
forward undetected-errors interval will "cover" any future undetected errors. Hence, an error will 
not be covered only if no other errors are detected in the time intervals of length L before and 
after it.
The probability that an error will not be covered is the probability that the error itself goes 
undetected and that no other errors are detected in the L-length intervals before and after the 
error. This probability will be determined using the following generating function.
LEMMA 2.3: The generating function G(z, x), for the number of error arrivals from a two- 
phase hyperexponential distribution with pdf of the form aXxe~X{t + (1 -  a)'Ki e~Xlt, in an interval 
[t0, t0 + x] or [i0 -  x, i0], given that there was an arrival at i0, is given by:
9G(z,x) = S l z L e - ^  +
b ~ a 2 _
¿Zj — ¿Z2 #1 ^2
where 0 < z < 1, ¿q and a2 are the roots of the following quadratic equation in s:
s2 + (^(1 -  otz) + ^ (1  -  (1 -  a )z))s + (1 -  z ) ^ ^  = 0 , 
b = (1 -  a)^! + 06^ 2, and a, and are the parameters of the distribution (Example 2.1).
PROOF: See Appendix.
Then,
Pr{error not covered}
Since Pr{error covered} = 1 -  Pr{error not civered}, then:
( N T  N  
estimated error coverage = 1 -  G l -  q — , L (1 - q  —)
\  M J M
EXAMPLE 2.4: For a single processor using PACED, let q -  1, N/M = 0.3, C = 0.99, and the 
distribution of Example 2.1 model error arrivals, viz., f  (t) = 0.88(0.829 + 0.12(0.012
g-o.oi2ty yjjjg gives l  = 17.1 min and the estimated error coverage = 94.5%. Hence, with only 
30% checking, almost 95% coverage can be achieved.
Figure 2.5 plots the estimated error coverage versus N/M when q = 1, M  = 10, and C = 0.99. 
The plot shows the coverage is > 95% for all values of N/M > 0.4. High coverage can be 
obtained with small N/M because, as the length L of the undetected-errors interval is long, many 
error arrivals would be expected to occur. Then, at least one of them would likely be detected, 
leading to the coverage of the error by casting suspicion on outputs produced at the time of the
error. □
10
Figure 2.5. Single processor estimated error coverage, 
q = 1, (i= 11.1 min/err.
3. PACED IN A LINEAR ARRAY
Notation
V number of PEs in linear array
L error detection latency
D error propagation distance
For a V-PE unidirectional linear processor array, inputs enter at the top and left of the array; 
outputs are produced at the bottom and right. Data flow only from left to right and from top to 
bottom. The computational activity at each PE consists of receiving input, performing a task with 
or without applying CED, and sending output A task is a fine-grained set of data manipulations, 
such as a multiply-accumulate operation. Such arrays have implemented algorithms such as FFT 
processing [5] and image edge detection [6]. For two PEs in the array PE, and PE; , if i < j, then 
PE, i ‘s upstream of PE; and PE j is downstream from PE,.
11
Assumptions
1. All communication channels in the array are fault-free.
2. If an erroneous array output is produced by a PE, an erroneous propagating output will also 
be produced and sent downstream (e.g., by using the AN-code [7]).
3. PEs are code-disjoint: erroneous inputs or state values will cause erroneous propagating 
outputs.
Assumption 2 ensures an error can be detected downstream if an erroneous array output is 
produced; Assumption 3 ensures PEs that are not checking will propagate errors.
3.1. Error Detection Latency
The error detection latency L is the number of computation cycles an error propagates until 
it is detected; is the maximal value. Use of <9, = (Nti) mod Af,- has been shown to minimize 
in linear arrays [8].
LEMMA 3.1: Given a V-PE unidirectional linear processor array using PACED with q = 1, 
Af,- = Af, Ni = N, 1 <N <M, and <9,- = (Ni) mod Af, it can be shown that the detection latency of 
an error created in the unchecked cycle r at PE,, Lr, is [ (Af -  r)/N~\, where N < r < M -  1 and 
i < V -  Lmax. The maximum error detection latency in the array, Lmax, is f (M -  N)/N~\, for all 
PE, such that i < V -  L ^ .
PROOF: See Appendix. □
EXAMPLE 3.1: Figure 3.1 shows the checking pattern in a 7-PE unidirectional array with 
Af, = 5, Ni = 2, and <9, = (2i) mod 5. Computation cycles proceed vertically; each row shows the 
checking activity in the array during a cycle, where — and x represent a PE doing a task and a 
checked task, respectively. The checking pattern sets up waves of checked cycles that advance 
upstream over time to catch propagating errors.
12
An error created at PE2 in cycle 10 (marked by *) would be detected by PE4 in cycle 12 
(labeled L2), hence L* = 2. An error created at PE2 in cycle 11 (marked by o) or cycle 12 ($) 
would be detected by PE3 in cycle 12 (L3) or cycle 13 (L4): L3 = 1 and L4 = 1, respectively. 
Finally, Lmax = L2 = 2. □
3.2. Error Propagation Distance
The maximum number of unchecked cycles through which a detected error could have 
propagated will enable determination of the amount of previously produced output to suspect as 
possibly erroneous from each PE in the linear array, upon error detection.
LEMMA 3.2: Given a V-PE unidirectional linear processor array using PACED with perfect 
detection (q = 1), let Af, = Af, N { = N, and 1 < N < M. Using 0 { = (Ni) mod Af, it can be shown 
that an error detected by PE,’s r^ checked cycle, 0 < r < N  -  1, propagated through at most Dr 
unchecked cycles, where Dr = min(i, f(Af + r + 1)/A~| -  2).
PROOF: See Appendix. □
EXAMPLE 3.2: Using the array of Example 3.1 (Figure 3.1), D0 for an error detected at PE3 
at computation cycle 12 is |"(5 + 0+  l)/2~|-2 = 1, because PE! checked computation cycle 10. 
For an error detected at PE3 at computation cycle 13,D !=[(5  + 1 + 1)/2~| - 2  = 2, because PE0 
checked computation cycle 10. □
computation PE
cycle 0 1 2 3 4 5 6
10 X X * — — X X
11 X — 0 — X X —
12 — — . t u L* — —
13 — — X u — — —
14 — X X — — — X
Figure 3.1. Checking pattern in a 7-PE array.
13
3.3. Suspected Outputs
Upon error detection, outputs produced both in the recent past (Theorem 3.1) and the near 
future (Theorem 3.2) should be suspected as possibly erroneous.
THEOREM 3.1: Given a V-PE unidirectional linear array using PACED with perfect detec­
tion (q = 1), let Mi = M, N { = N, 1 <N <M, and <9, = (M) mod M. If PE, detects an error at its 
r^ checked cycle in computation cycle c,0  <r < N  then the output from PE, in c should be 
suspected as possibly erroneous. Also, the outputs produced by PE,.*, in cycle c -  k, for 1 < k < 
Dr, should be suspected. All other unsuspected, previously-produced outputs can be trusted with 
a confidence of 1, unless a later error detection makes it necessary to suspect them.
PROOF: See Appendix. □
EXAMPLE 3.3: Figure 3.2 shows a 10-PE unidirectional linear array with M, = 13, A, = 3, 
and Oi = (3i) mod 13. Let PEg detect an error in cycle c (X in the figure). The output from PE8 
in cycle c should be suspected. Also, the outputs of PE7, PE6, PE5, and PE4 in cycles c -  1, c -  2, 
c -  3, and c -  4, respectively (marked by *), should be suspected as possibly erroneous. All other 
outputs generated up through cycle c can be trusted with a confidence of 1, unless a later error 
detection makes it necessary to suspect them. # □
computation PE
cycle 0 1 2 3 4 5 6 7 8 9
c - 5  
c - 4
— — — X
X
X
*
— — — —
X
c - 3 — — X X — * — — — X
c - 2 — — X — — — * — X X
c - 1 — X X — — — — * X —
c — X — — — — — X X —
Figure 3.2. Suspected previously produced outputs, 10-PE array.
14
Future outputs need to be suspected when an error is detected at one of the end elements 
PE,-, where i > V - Lmax.
THEOREM 3.2: Given a V-PE unidirectional linear array using PACED with perfect detec­
tion (q = 1), let M, = M,Ni  = N , \ < N < M ,  and <9, -  (Ni) mod M. If PE^-l^ *,- detects an error 
at its r1*1 checked cycle in computation cycle c, where 0 < r < N  -  1 and 0 < i < Lmax -  1, then the 
following outputs should be suspected as possibly erroneous.
Case (a) If (r + ( L ^  -  1 -  i)N  + k) mod M > Ny then the outputs from PEv_Lniix+,-+;- in 
cycle c + j  + k should be suspected, where 0 < j < - 1  -  i;if  r < N  -  l, then k -  0, otherwise
0 < k < M - N ( r = N - l ) .
Case (b) All output from PEV_! in cycles c + Lmax -  1 -  / until its next checked cycle should 
be suspected.
All other unsuspected, future outputs can be trusted with a confidence of 1, unless a future 
error detection makes it necessary to suspect them.
PROOF: See Appendix. □
EXAMPLE 3.4: Figure 3.3 shows the 10-PE linear array of Example 3.3 where Lmax = 4. 
PE6 has detected an error at check r = 2 in cycle c (marked X). The future outputs to suspect are 
those from: PE6 in cycle c + 1, PE7 in cycles c + 1 and c + 2, PE8 in cycles c + 2 and c + 3, and 
PE9 in cycles c + 3 and c + 4 (marked *), plus the detection (X). All other unsuspected, future 
outputs can be trusted with a confidence of 1, unless a later error detection makes it necessary to 
suspect them. □
The patterns of outputs to suspect upon error detection are static; they can be determined 
pre-run time and retrieved, when needed, with little run time overhead.
The amount of output to suspect upon error detection in the linear array is much less than 
that for the single processor: using the undetected-errors intervals in the single processor
15
computation
cycle 0 1
c+  1
c + 2 — —
c + 3 — —
c + 4 — —
c + 5 — —
PE
2 3 4 5
— — x x
—  x x  —
6 7 8 9
X — — —
* * —  —
—  * * —
—  —  * *
Figure 3.3. Suspected future outputs, 10-PE array.
(Example 2.3: q = 1, X = 0.829 err/min, N/M = 0.5, C = 0.99), outputs from 18.8 min (9.4 min 
before and after an error detection) should be suspected. In the linear array, only a few tens of 
computation cycles’ outputs need ever be suspected; with 15-20 |is cycle times in VLSI 
implementations [6], less than one second’s output would need to be suspected. Since PEs can 
check other PE outputs, PACED can give high confidence in most array outputs upon error 
detection with non-continuous checking.
3.4. Error Coverage
Assuming errors occur uniformly distributed throughout the linear array, an estimate of the 
error coverage can be made. Since all Af-cycle periods are identical, it suffices to examine the 
coverage of one Af-cycle period. There are MV potential sites in one Af-cycle period at which 
errors may occur: one for each PE in each cycle. Since it is assumed errors propagate unmasked 
through the array, only some of these sites could lead to propagation of undetected errors out of 
the array, if an error were to occur. The number of these sites divided by the number of potential 
sites gives an estimate of the error coverage.
Figure 3.4 shows the estimated error coverage for a 16-PE linear array as a function of N/M, 
when Af = 10 and q = 1. The graph shows that even for small values of N/M, the error coverage 
is quite high (greater than 70% for N/M = 0.1). The coverage climbs quickly as N/M increases; 
any checking ratio > 0.4 gives an estimated error coverage > 95%. The cooperation amongst the
16
Figure 3.4. Estimated error coverage for a 16-PE linear array.
PEs that allows propagated errors to be detected affords this rise in coverage for small N/M. This 
result is promising, as it allows low checking ratios and thus, low performance cost, while still 
maintaining good error coverage.
4. PACED IN A TWO-DIMENSIONAL ARRAY
Notation
U
V
RISE, RUN 
L
number of rows of PEs in 2-D array 
number of columns of PEs in 2-D array 
determine Oi r giving slope of checking pattern 
error detection latency
For a UxV two-dimensional (2-D) mesh-connected processor array, inputs enter at the top 
and left of the array; outputs are produced at the bottom and right. Data may only flow from left 
to right and from top to bottom. Similar arrays have implemented algorithms such as matrix 
operations [9] and image processing [10].
17
For two PEs in the array PEi ; and PE*. /, if i < k or j  < /, then PEi ; is upstream of PE*. / and PE*. / 
is downstream from PE,j. The offset Oitjis determined by two parameters called RISE and RUN: 
RISE/RUN gives the slope of the waves of checking in the checking pattern. The confidence 
analysis of the 2-D array is based on the same assumptions as in Section 3.
4.1. Error Detection Latency
An algorithm is used to determine and Lr, the latency of an error created in an 
unchecked computation cycle r of PEI;, for Nitj <r<  When PACED is applied to a 2-D 
array, Oitj = (Mitj+ i + j - ( U -  1 -  i)RUN -  (V -  1 -  j)RISE) mod MUj. Depending on RISE 
and RUN, any PE in the 2-D array may create an error that propagates undetected, so is 
defined as the largest finite error detection latency.
EXAMPLE 4.1: Figure 4.1 shows a 10x10 array amidst a computation, with MUj= 10, N itJ = 
3, RISE/RUN = 2/1, and 0 Uj = (2i + 3j  -  17) mod 10. The detection latency for an error created 
at PE^s in^ycle c (e in the figure), when PE ^ performs its 6th check, is L5 = 2, since both PE3 6 
and PE2J detect the error in cycle c + 2. The figure shows how the error propagates through the 
array (* in the figure) until detection (X). For this array, = L3 = 3. □
j j j
0 1 2 3 4 5 6 7 8 9 i 0 1 2 3 4 5 6 7 8 9 i 0 1 2 3 4 5 6 7 8 9
0 — — — — X — — — — X 0 — — — X X — — — X X 0 — - — X — — — — X —
1 — — — X X — — - X X 1 - - - X - - - - X - 1 - - X X - - - X X -
2 — — — X - e — - X - 2 - - X X - - * X X - 2 - - X - - - - X - -
3 — — X X — — — X X — 3 - - X - - * - X - - 3 - X X - - - X X - -
4 — — X — — - — X - — 4 - X X - - - X X - - 4 - X - - - * X - - -
5 — X X — - — X X - - 5 - X - - - - X - - - 5 X X - - - X X - - -
6 — X — — — — X - - - 6 X X - - - X X - - - 6 X - - - - X - - - -
7 X X — — - X X - - - 7 X - - - - X - - - - 7 X - - - X X - - - X
8 X — — — - X — - - - 8 X - - - X X - - - X 8 - - - - X - - - - X
9 X - - — X X — - — X 9 — — — — X — — — — X 9 — — — X X — — — X X
computation cycle c computation cycle c + 1 computation cycle c + 2
Figure 4.1. Error detection latency L5.
18
4.2. Suspected Outputs
In the 2-D array, the outputs to suspect upon error detection are determined using algo­
rithms that run in 0(UV  • N itj) time, assuming N itjis constant for all PEi ; .
EXAMPLE 4.2: Figure 4.2 shows a 10x10 processor array using PACED with Muj-  13, N itj 
= 5, RISE/RUN = 3/1, and <9I; = (2i + 4j  -  23) mod 13. Each grid shows the array in one com­
putation cycle. The outputs to suspect are marked @ (the error detection) and * (whence the error 
might have propagated).
If an error is detected at PE9 9 in cycle c, its output should be suspected as possibly erro­
neous. Also, the.outputs from the following PEz; should also be suspected: PE98 and PE89 in 
cycle c -  1; PEgj, PE88, and PE ^ in cycle c - 2 ;  PE78 and PE69 in cycle c -  3; and PE49 in 
cycle c -  4. All other unsuspected, previously-produced outputs can be trusted with a confidence 
of 1, unless a later error detection makes it necessary to suspect them. □
EXAMPLE 4.3: Figure 4.3 shows a 10x10 processor array using PACED with M Uj = 10, N Uj 
=3, RISE/RUN = 2/1, and Oitj = (2i + 3 j -  17) mod 10. The figure is notated as in Figure 4.2.
Figure 4.2. Suspected previously produced outputs, 10x10 array.
19
If an error is detected at PE8>8 in cycle c (marked @) its output should be suspected as pos­
sibly erroneous. Also, the outputs from the following PE,j should also be suspected: PE8 9 and 
PEg s in cycle c + 1, and PE9 9 in cycle c + 2 (all marked by *). All other unsuspected, future out­
puts can be trusted with a confidence of 1 (until, of course, the next error detection). □
As in the linear array, the patterns of outputs to suspect upon error detection are static, and 
can be determined before run time and retrieved as needed upon error detection at run time.
An C analysis program was written to determine the minimum number of outputs to suspect 
for varying PACED parameters. A 20x20 array was tested, using MUj = 15, Nitj = 1, 2, • • • 15, 0 Uj 
-  (15 + / +j  -  (19 -  i)RUN -  (19 -  j)RISE) mod 15, and q=  1, and varying RISE and RUN. The 
experiment found that only approximately one second’s output needed to be suspected upon error 
detection. Again, this is a great improvement over the amount of suspected output in the single 
processor case, due to the cooperation of PEs checking other PE outputs to afford high confi­
dence in outputs with only periodic checking.
computation cycle c
(3  checked task □  unchecked task EH error detected (suspect) 0  suspected output 
Figure 4.3. Suspected future outputs, 10x10 array.
20
4.3. Error Coverage
The error coverage in the 2-D array can be estimated if it is assumed that errors occur uni­
formly distributed throughout the array. Only one M-cycle period need be examined, as all other 
M-cycle periods are identical and have the same coverage. In one M-cycle period a UxV 2-D 
mesh array has MUV potential sites at which error may occur: one for each PE of the array, in 
each cycle. Since it is assumed that errors propagate unmasked through the array, only a fraction 
of the potential sites an lead to errors’ propagating undetected out of the array, if an error occurs. 
The estimated error coverage is the number of these sites divided by the total number of potential 
sites.
Figure 4.4 shows the estimated error coverage for a 4x4 PE mesh array as a function of 
N/M, when M = 10 and q = 1. When N/M is small, the error coverage is low; but the coverage 
increases quickly as N/M increases: greater than 95% coverage can be achieved with N/M just 
0.5 or greater. As for linear array, infrequent checking can yield high error coverage — and infre­
quent checking can reduce the performance cost of applying CED.
N/M(M= 10)
Figure 4.4. Estimated error coverage for a 4x4 mesh array.
21
APPENDIX
PROOF OF LEMMA 2.1: Let E, represent the number of error arrivals in a time interval of 
length t, with exponentially distributed interarrival times. Let D, and U, represent the number of 
detected and undetected errors that arrive in the same time interval, respectively. If the interar­
rival times are exponentially distributed, this implies that the arrivals follow a Poisson process.
Pr{D, = *} = lP r{ E ,= « }
n=k
'n V
q M
1 - q
NX
M
n-k
f AT  ^
M
1 N  
l - q M )
£  M n n\ (
n=k n \ k \ (n- k) \
N \ n 
l - q M )
i. N  Y
k\
, N
e~XqM'
This is a Poisson distribution, with modified error arrival rate X  = XqN/M. The proof that unde­
tected arrivals are exponentially distributed is substantially similar and results in a Poisson pro­
cess with a modified error arrival rate X" = X( 1 -  qN/M). □
PROOF OF LEMMA 2.2:
Pr{U, = *&  Df = /} 
Pr{D, = /}
Pr{E t = k + l}
(k + f \
\ y l ~ q M
N \kf
q M
f. N  YXq — t
V M
l\
, N
e Af
(Xt)k+l fk  + l \
(k+iy. 7
1 N )
1 ~ q M )
*( N ' t
q M j
\  NXq — t
M
l\
. N 
e m
22
f N \  \
_  V
k\
= Pr{U, = k)
- W - Ae at
Thus,
Pr{U, = k & D, = /} = Pr{U, = *} • Pr{D, = /} □
PROOF OF THEOREM 2.1: Let D represent the detected error interarrival time. Since 
detected errors follow a Poisson process with parameter XqN/M, Pr{D > t] = e~XqNtlM and 
Pr{D <f} = 1 -
Let K  be the length of a time interval such that Pr{D <K}>C.  Then,
, -Xq— K  . _1 -  e M > c
K  * ~Tr7 " l n ( l - 0  • □ N Kq
PROOF OF THEOREM 2.2: Case (a): Let D and U represent the detected and undetected 
error interarrival times, respectively. From Lemma 2.1, both random variables are exponentially 
distributed with parameters X  = XqN/M and X ' = ^( 1 -  qN/M), respectively.
The quantity D -  U represents the time between the first undetected error and the first 
detected error. The probability Pr{D - U  > t) is now determined using a joint probability distri­
bution.
oo oo
Pr{D -U  > t) = f f \ ”e~v 'x ■ X’e-Vydydx
0 x+t
= K c~Vt
X ' + X  
X '
It follows that P r { U- D< i}  = l -  — —— e~Xt. Let L be the length of a time interval such
A + A
that Pr{D -  U < L) > C. Then,
23
1 - -XL
X" + \ '
> c
i
r M  1 ,
L - - i v ^ ln
1 - C
. N  
q M J
Hence, with confidence C, the first undetected error occurred within L time units before the 
detected error.
Case (b): Let U represent the time to the last undetected error before the next detected error, 
and V, the time to the next detected error.
First, the probability is determined that the last undetected error occurs in some infinitesi­
mal time slice du at time u while the next detected error occurs in some infinitesimal time slice 
dv at time v, where v > u. (If it were known that v < u, i.e., no undetected errors occur before the 
next detected error, then none of the outputs produced between the two error detections would 
have to be suspected.)
The expression below has a term for each of the following conditions: 1) no errors are 
detected in interval u starting from the current error detection; 2) at least one error is undetected 
in interval du; 3) no errors occur in interval v -  u; and 4) at least one error is detected in interval 
dv. (The variable U should be defined as the time of the last undetected error before the fault 
becomes inactive, but since the distribution of fault lifetimes is unknown, U is predicated instead 
on the next error detection. In the derivation, then, the next detection is allowed to take place at 
any time slice dv from u to infinity, in effect allowing the fault to become inactive.)
Pr{(u < U <u + du) & (v i V  <v + dv)} = e~v“(l -  -  «r’-'*)
= a a e au
The terms 1 -  e r 'du and \ -  e Xdv have been simplified using the approximation 1 -  e~
x + o(x) as x —» 0.
24
Now, the probability that U is greater than some L is determined, using the joint probability 
just derived.
Pr{U > L ]  = j  j X'X"e~x'“e,Me^d vd u
L u
N -X L
=  (1 ~ q M )e
L is determined such that Pr{U > L} < 1 -  C, where C, the confidence, is set arbitrarily 
close to 1.
Pr{U > L} < 1 - C
N  v ,( \ - q - ) e - XL Z 1 - C
„ M  1 ,
' - V
f  ^
1 - C
1 N  
l ~ q MJ
Hence, with confidence C, the last undetected error occurred within L time units after the 
detected error. Outputs produced prior to L time units before the detected error, or subsequent to 
L time units after the detected error, can be trusted with confidence C; outputs produced within L 
units of time either before or after the detected error should be suspected as possibly erroneous. □
PROOF OF LEMMA 2.3: Let N(z) represent the number of error arrivals in the time interval 
[i0, t0 + t]. Let Sn be the sum of n error interarrival times. Since there was an arrival at i0,
Pr{iV(x) = n) = Pr{Sn < x < Sn+l}.
This equation also applies to the number of arrivals in the interval [i0 -  x, t0]. Let F(ra)(x) = Pr{5„ 
< x} be the CDF of Sn. Then,
Pr{ N(x) = n) = F^'Cc) -  F*'M',)(x)
£Pr{W(T) = n}z" = 2 F w (x) z" -  z’1 ( 2  z"+1 )
25
G(z,x) = GP(z, x) -  z HGfCz, x ) ~ l )
= (1 - z-1)Gf(z, x) + z_1 
Taking the Laplace transform of GF(z, x):
¿(GP(z,t)) = E L (F w (t))z"
5;
i -
otX, ( l - o O ^ Y  „---------f------------  2
X.J +  S 'h2 +  S j
1___________
( ( l - a ) ^
r--------1 r— ;-------  ZA^«i +5 Ki + s J
Hence,
-i\ r-1G(z, x) = (1 - z ~ l) L
s
L V
1 -
1_________
f a \  (1 ~a)^2
+  S +  5
+ z-1
J J
a \ ~  a 2
where ax and a2 are the roots of:
a' -  e-°'x + e**
Cl 1 — d2
s2 + ( \ l( l - a z )  + X2( l - ( l - a ) z ) ) s  + ( l - z ) X lX2 = 0 
and b = (1 -  a ) ^  + a  A$. □
PROOF OF LEMMA 3.1: By design of the checking pattern, if CSMtN[r] is the checking 
activity at PE, in some computation cycle c, then CSM N[(r + y(N -  1) + z) mod M] is the check­
ing activity at PE,+>, in cycle c + z. With perfect detection, errors only propagate through 
unchecked cycles, so the proof only considers N <r< M  -  1.
If an error occurs at PE, during its cycle, it will go undetected: this cycle is unchecked 
(CSMiN[N] =0) .  In the next cycle, the error will propagate to PE,+1 and be detected if 
CSM)^ [(2A0 mod M] = 1 (i.e., if PE,+1 is checking). If CSMN [(2N) mod M] = 0, then the error
26
will propagate to PE,+2 in the next cycle, where it will be detected if CSm n [(3N) mod M] = 1; 
and so on.
The latency of detection of this error, L#, is the number of computation cycles required for 
the error to reach a checked cycle. In terms of the checking sequence, is the smallest integer 
number of N-bit hops needed to reach s such that CSMi^ [i] = 1 (i.e., 0 < s < N -  1) from N, 
where CSMtN[N] = 0. This is a distance o f M - N  bits.
Ln - N > M - N  
Ln = [(M-N)/N~\
Similarly, L^+1, the latency of an error created during the N  + 1st cycle (an unchecked 
cycle, since CSMiN[N +1] = 0), is f { M - N -  1)/A~|. In general, an error created during cycle r 
(an unchecked cycle: CSM)^ [r] = 0) will have latency Lr = \ { M -  N - { r -  N))IN'] = 
\ (M -  r)/N"|, N <r < M -  1. Clearly, L# > LN+1 > • • • > Therefore, the maximum error 
detection latency, L ^ ,  is LN: L ,^  = LN = [ ( M -  N)/N~|.
This analysis applies to all PEs in the array except the end elements, PE, where i > 
V - L ^ .  At these PE,, an error may propagate undetected out of the array since for these PE, 
there are fewer than PEs downstream. □
PROOF OF LEMMA 3.2: Let CSM ^ [0] at PE, detect an error in computation cycle c. The 
checking activity at PE,_! during cycle c -  1 is CSMtN[(-N) mod M]. The maximum number of 
unchecked cycles through which the detected error may have propagated, D0, is the number of 
computation cycles required to reach a checked cycle, minus 1, counting backwards in time. In 
terms of the checking sequence, D0 + 1 is the smallest integer number of N-bit hops needed to 
reach CSMtN[r], 0 < r < N  -  1, from CSM)# [0]. This is a distance of M -  N + 1 bits.
(D0 + l )N > M -  N + l
D0 = [ ( M - N  + \) /N^\- l  
= f (M+ l)/N~\-2
27
Similarly, Dj = f (M + 2 )/N~] -  2. In general, Dr = \{M + r + l)/iV~| - 2 ,  0 < r < i V - l .  For 
PEs near the beginning of the array, there may be fewer than Dr PEs through which the error 
propagated. Hence, at PE„ Dr = min(i, f (M + r + l)/i\f| -  2), for 0 < r < N -  1. □
PROOF OF THEOREM 3.1: By Lemma 3.2, the detected error propagated through at most 
Dr unchecked cycles to reach PE,. Thus, the error was created at some PE,_fc in a cycle 
c -  (k + y), where 1 < k < Dr and y = 1, 2, 3, • • •.
Figure A.1 shows a 10-PE array with M  = °o and N  = 2. The X marks an error detection at 
PE5 in cycle c and the *s mark the Dr cycles through which an error may have propagated to 
reach PE5.
Suppose that the error had occurred at PE4 in cycle c -  2, c -  3, or c -  4. The error would 
have been detected by PE6 in cycle c, c -  1, or c - 1 ,  respectively. Suppose the error had 
occurred at PE3 in cycle c -  3, c -  4, or c -  5. This error would have been detected by PE6 in 
cycle c or c -  1, or by PE7 in cycle c - 1 ,  respectively.
In general, any error created at PE,.*, before cycle c - k  would either have been detected by 
cycle c (and the appropriate outputs already suspected), or gone undetected (if the error propa­
gated out of the array). This is a result of the checking pattern, in which each PE, performs its
computation PE
cycle 0 1 2 3 4 5 6 7 8 9
c - 6
c -  5 * — — — — — — — — —
c - 4  —  * —  —  —  —  —  —  —  x
c - 2  — — — * — — — x x —
— — — — — X x — — —
Figure A .l. Error propagation in a 10-PE array.
28
last checked cycle (CSm<n[N - 1]) during the same computation cycle that PE,_! performs its 
first checked cycle (CSMiV[0]). Hence, only the outputs from PE,.*, in cycles c - k  need be sus­
pected, 1 < k < Dr, as well as that from PE, in c. All other unsuspected, previously produced out­
puts can be trusted with a confidence of 1, unless a later error detection makes it necessary to 
suspect them. □
PROOF OF THEOREM 3.2: By use of (9, = (Ni) mod M  in the linear array, when PE, in cycle 
c performs its checked task (CSM//[r] = 1), then PE,+y in cycle c + z will perform 
C S ^ [ ( r  + y (iV -l) + z)modAfl.
Now, let PEy-L^+i detect an error in cycle c by CSMtN[r], for 0 < i < -1 .  These
PEy-j 4-t are those PEs that could create errors that propagate undetected out of the array. The 
detected error will propagate to PEy.j in cycle c + - 1  -  i. In that cycle, if PEy.j is not
checking (i.e., (r + ( L ^  -  1 -  i)N) mod M > N), then this error will propagate out of the array 
and outputs from all PEs and cycles through which the error propagated should be suspected as 
possibly erroneous. That is, if (r + ( L ^  -  1 -  i)N) mod M>N,  then the output from PEv. TwH>J- 
in cycle c + j  should be suspected, 0 < j  < Lmax — 1 — i. If PEV_T ^ , -  will check at the next cycle 
c + 1, then this gives part a) when r < N - l  { k -  0).
If r = N  -  1 (PEy.^+j won’t check in cycle c + 1), then as in the above case when r < 
N - 1, if (r + ( L ^  - 1  -  i)N + k) mod M > N, then the output from PEy_Lmax+,+;- in cycle 
c + j  + k should be suspected, where 0 < j  < — 1 — i and k = 0. In addition, for each of the
next M -  N  unchecked cycles, errors may propagate out of the array. This is likely since an 
error has already been detected at PEv_T w +; and the fault may still be active while that PE is not 
checking. The additional outputs to suspect depend upon whether PEy.j is not checking when 
the errors arrive there. That is, for each cycle c + k, 1 < k < M -  N,  if (r + ( L ^  -  1 -  i)N + k) 
mod M> N, then the output from PEy_Lnax+I+; in cycle c + j  + k should be suspected, for 0 < j  < 
Lmax -  1 -  i. This completes part a) when r = N -  1.
29
Once an error propagates to PEy.j while it is not checking, all of its outputs until its next 
checked cycle should be suspected as possibly erroneous since its outputs are not checked by any 
other PE. Hence, all of the outputs from PEV_! in cycles c + -  1 -  i (the earliest that the
error, first detected at P E y .^ ^  in cycle c, could corrupt PEy^) until its next checked cycle 
should be suspected as possibly erroneous. This gives part b) in the statement of the theorem.
All other unsuspected, future outputs from the array can be trusted with a confidence of 1, 
unless a future error detection makes it necessary to suspect them. □
ACKNOWLEDGMENTS
The authors wish to thank Robert Dimpsey, Kumar Goswami, Inhwan Lee, and Dong Tang 
for their help with the curve fitting performed in Section 2.1.
REFERENCES
[1] P. P. Chen, A. N. Mourad, and W. K. Fuchs, “Confidence in processor array outputs 
under periodic application of concurrent error detection,” Technical Report 
CRHC-93-12, Center for Reliable and High-Performance Computing, Univ. of Illinois, 
Urbana, IL, May 1993.
[2] P. P. Chen and W. K. Fuchs, “Periodic application of concurrent error detection in pro­
cessor arrays,” Digest of Papers, Gov't. Microcircuits Applications Conf. (GOMAC), 
vol. XV, pp. 451-454, Nov. 1989.
[3] R. K. Iyer and P. Velardi, “Hardware-related software errors: Measurement and analy­
sis,” IEEE Trans. Software Engineering, vol. SE-11, no. 2, pp. 223-231, Feb. 1985.
30
[4] D. Tang and R. K. Iyer, “Dependability measurement and modeling of a multicomputer 
system,” IEEE Trans. Computers, vol. 42, no. 1, pp. 62-75, Jan. 1993.
[5] E. Swartzlander, Jr., “Systolic FFT processors,” pp. 133-140 in Systolic Arrays. Ed. W. 
Moore, A. McCabe, R. Urquhart. Bristol: Adam Hilger, 1987.
[6] J. A. Vlontzos and S. Y. Kung, “A wavefront array processor using dataflow processing 
elements,” Proc. 1st Int. Conf. Supercomputing (Lecture Notes in Computer Science v. 
297), pp. 744-767, Springer-Verlag, 1987.
[7] J. F. Wakerly, Error Detecting Codes, Self-Checking Circuits, and Applications. New 
York: North-Holland, 1978.
[8] Y. M. Wang, P. Y. Chung, and W. K. Fuchs, “Design and scheduling for periodic con­
current error detection and recovery in processor arrays,” Technical Report 
CRHC-92-08, Center for Reliable and High-Performance Computing, Univ. of Illinois, 
Urbana, IL, May 1992.
[9] S. Y Kung, VLSI Array Processors. Englewood Cliffs: Prentice Hall, 1988.
[10] R. Bayford, “The bit-serial systolic back-projection engine (BSSBPE),” pp. 43-54 in 
Application Specific Array Processors. Ed. S. Y. Kung, E. Swartzlander, Jr., J. A. B. 
Fortes, K. W. Przytula. Los Alamitos: IEEE Computer Society Press, 1990.
