Measurement-Based Analysis of Multiple Errors and Near-Coincident Fault Discovery in a Shared Memory Multiprocessor by Mitra, Samir G. et al.
Ju ly  1988 U IL U -E N G -88-2238
C SG -90
COORDINATED SCIENCE LABORATORY
College of Engineering
MEASUREMENT-BASED 
ANALYSIS OF 
MULTIPLE ERRORS 
AND NEAR-COINCIDENT 
FAULT DISCOVERY 
IN A SHARED MEMORY 
MULTIPROCESSOR
Samir G. Mitra 
Ravishankar K. Iyer 
Mark Sloan
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
Approved for Public Release. D istribution U nlim ited .
SECURITY ClASSIFICATIÓN ÔF Th I$~PÂg T
4. PERFORMING ORGANIZATION REPORT NUMBER(S)
UILU-ENG-88-2238 (CSG-90)
6a. NAME OF PERFORMING ORGANIZATION 
Coordinated Science Lab 
University of Illinois 
6c ADDRESS (Gty, State, and ZIP Code)
1101 W. Springfield Avenue 
Urbana, IL 61801
6b. OFFICE SYMBOL 
(If applicatile)
N/A
8a. NAME OF FUNDING /SPONSORING 
ORGANIZATION
8c. ADDRESS (City, State, and ZIP Code)
see 7b.
8b. OFFICE SYMBOL 
(If applicable)
7a. NAME OF MONITORING ORGANIZATION
NASA
7b. ADDRESS (City, State, and ZIP Code)
NASA Langley Research Center 
Hampton, VA 23665
9 PROCUREMENT INSTRUMENT IDENTIFICATION NUMBER
NASA-NAG-1-613
PROGRAM 
ELEMENT NO.
PROJECT TASK
NO. NO.
WORK UNIT
11. TITLE (Include Security Classification) _______
M U ltlple Err° rS and "— C oincident Fault D iscovery ln  a
12. PERSONAL AUTHOR(S)
■■ Sloanp Mark
13a. TYPE OF REPORT ll3b. TIME COVERED
Technical_________
16. SUPPLEMENTARY NOTATION
FROM TO
114. DATE OF REPORT (Year, Month, Day) fl 5. PAGE COUNT
February 1988 | 37
1 17
FIELD
COSATI
GROUP
CODES
SUB-GROUP
18. SUBJECT TERMS (Continue on reverse if necessary and identify by block number)
parallel processing, multiple errors, simulation proba­
bility distributions, near—coincident faults
£ultPdilr™ e StUd7 multiPle error presence and near coincident
'cessor. The delay between th( 
error latency) can cause mult:_ _
to be catastrophic to the continued’o p e ^ t i ^
done to understand its behavioral characteristics. 1 7 research has been
_ - - ------j c i iu i  jjleseri
eeneratiorofL1! ! ^  ”!!°^ °f “ " 7  ««uitiproc e
:en ti
ir effect is widely kn 
y little research has _
Alliani- f y /o u • -i -----The methodology is illustrated on the
Suoercom^M ’ p* th? Cedar Supercomputer at the Center for. ---wr luc Lcudi supercomputer a
resuitsmare1nrovidedrChdaad Dey lopment at the University of Illinois. Experimental Uits are provided under real concurrent workload conditions over a five-day period.
7 s y flnds that for a conservative error occurence rate of one error a day, there 
is a 25% chance that latent errors cause a multiple error condition. Thus, one out of four 
errors may manifest itself as a multiple error. At the same error occurrence rate, it was 
round that 8% of the error manifestations are near-coincident in nature for (over)
20. DISTRIBUTION / AVAILABILITY OF ABSTRACT 
■ ^UNCLASSIFIED/UNLIMITED □  SAME AS RPT. r i D T ir USERS 
22a. NAME OF RESPONSIBLE INDIVIDUAL
DD FORM 1473,84 MAR
21. ABSTRACT SECURITY CLASSIFICATION 
Unclassified 
22b. TELEPHONE (Include Area Code) 22c. OFFICE SYMBOL
83 APR edition may be used until exhausted. 
All other editions are obsolete. SECURITY CLASSIFICATION OF THIS PAftF
UNCLASSIFIED

MEASUREMENT-BASED ANALYSIS 
OF MULTIPLE ERRORS 
AND NEAR-COINCIDENT FAULT DISCOVERY 
IN A SHARED MEMORY MULTIPROCESSOR
Samir G. Mitra 
Ravishankar K. Iyer 
Mark Sloan
February, 1988
Computer Systems Group 
Coordinated Science Laboratory 
University of Illinois 
1101 W. Springfield Ave. 
Urbana, IL 61801
Ill
ABSTRACT
This paper presents a m ethodology to study m ultiple error presence and near- 
coincident fau lt discovery in the memory o f a shared m em ory multiprocessor. The delay  
between the generation of an error due to a fau lt and its detection (error latency) can 
cause m ultiple errors and near-coincident fau lt discovery in a system . The latter effect is 
w idely known to be catastrophic to the continued operation o f a system  but very little  
research has been done to understand its behavioral characteristics. The m ethodology is 
illustrated on the A lliant FX /8, the basic cluster component o f the Cedar Supercomputer 
at the Center for Supercomputing Research and Developm ent at the U niversity o f Illinois. 
Experimental results are provided under real concurrent w orkload conditions over a five- 
day period. This stud y finds that for a conservative error occurrence fate o f one error a 
day, there is a 25% chance that latent errors cause a m ultiple error condition. Thus one 
out of four errors may m anifest itself as a m ultiple error. A t the same error occurrence 
rate, it was found that 8% of the error m anifestations are near-coincident in nature for a 
tim e-window size of 50 microseconds (approxim ately 250 instruction cycles on the 
A lliant F X /8). It was seen that the probability of m ultiple errors tends to saturate after 
a threshold error occurrence rate of approximately one error a day. A high degree of 
sim ilarity between behaviors of m ultiple errors and near-coincident fau lt discovery 
suggested strong correlation between existences of m ultiple errors and their discovery in
near-coincidence.
ACKNOWLEDGMENTS
Acknowledgments are due to Ed Davidson, Tracy Tilton, Mark Washburn, Madhu Sharma, A1 Malony, Pen 
Yew, Mike Haney and the other members at the Center of Supercomputing Research and Development for their 
support and many useful discussions during the course of this work. This work was supported by the National 
Aeronautics and Space Administration under NASA grant NAG-1-613.
VTABLE OF CONTENTS
SECTION PAGE
1. INTRODUCTION.................................................................................................................................. x
1.1. Related R esearch...........................................................................................................................  3
2. EXPERIMENTAL METHODOLOGY ...............................................................................................  5
2.1. Hardware Measurements .....................   £
2.2. Simulation ........................................................................................................  j
2.3. Measurement of Multiple Errors................................................... :.........................................  g
2.4. Measurement of Near-Coincident F au lts............................................................    n
3. RESULTS .......................................................................................................................................  13
3.1. M ultiple Errors ...............     ^6
3.2. Near-Coincident Fault Discovery ............................................................................................. 21
4. CONCLUSIONS .................................................................................................................................... 26
REFERENCES ....................................................    28
APPENDIX THE SIMULATOR................................ ..............  * oq
vi
LIST OF TABLES
TABLE PAGE
1. Concurrent Workload Memory Address Trace ............................................................................  7
2. Error Latency Simulator Output ....................................................................................................  30
3. Multiple Error Simulator Output ...................................................................................................  31
4. Near-Coincident Fault Simulator O utput......................................................................................  31
v ii
LIST OF FIGURES
FIGURE PAGE
1. Configuration of Measured Alliant FX/8 .....................................................................................  5
2. Example of a Latency Profile ..........................................................................................................  9
3(a). No O verlap.....................................................................................................................................  10
3(b). Type 1 O verlap.............................................................................................................................  10
3(c). Type 2 Overlap .............................................................................................................................  10
4(a). Error Latency Distribution at Measurement Time -  0.0 hr ................................................  15
4(b). Error Latency Distribution at Measurement Time = 20.03 h r ............................................  15
5. An Example of Multiple Error Presence During the Measurement Period ............................  17
6. Multiple Error Presence at Different Error Occurrence R ates................................................... 18
7. An Example of Multiple Error Presence with Successive Injections.......................................  20
8. Near-Coincident Fault Discovery at Different Time-Window Sizes (Macroscopic) ............. 22
9. Near-Coincident Fault Discovery at Different Time-Window Sizes (Microscopic) .............. 23
10. Near-Coincident Fault Discovery at Different Error Occurrence Rates ................................ 25
11. The Simulation Environm ent........................................................................................................ 29
1SECTION 1 
INTRODUCTION
The reliability and availability issues have become key aspects in a w ide range o f 
computer system  design methodologies. A prerequisite in designing for reliability is to  
understand the effect o f fau lts and their m anifestation mechanisms. Behavior o f fau lts 
in  a computer system  is not easy to realize or understand. This is even more so in a 
m ultiprocessing environm ent, where the number of w ays in w hich error conditions m ay 
m anifest them selves is u sually  incomprehensible. A nalytical modeling techniques suffer 
from  constraining assum ptions and developm ental com plexity. A lternatives are 
measurements and experim ents on production m ultiprocessor system s. These aid the 
m odel building process and provide valuable insight for designing new system s.
This paper studies the fau lt discovery process in a shared memory m ultiprocessor 
system . There is usually  a delay between the generation o f a error (caused by a fau lt) 
and its discovery by a detection mechanism. This tim e is com m only referred to as error 
latency. These latent errors (som etim es referred to as "lurking errors") can be major 
threats to the reliability of the system . This is so because if  there exists more than one 
latent error there is a possibility that they can be discovered sim ultaneously behaving as 
though m ultiple fau lts have occurred. Most recovery mechanisms however are not 
designed to handle m ultiple fau lts. Latent errors have been observed to  be discovered 
close in tim e to each other, thus, stressing the error recovery mechanism. Such situations
are referred to as near-coincident fau lt discovery and are known to be catastrophic in 
real system s [1,2].
2The purpose of this experimental study is to quantify the characteristics of 
m ultiple errors and near-coincident fau lt discovery in a shared memory m ultiprocessor 
system  under a real concurrent workload1. A multiprocessing system  presents new  
complications from  the workload point of view , since a number of processes can be 
active at the same tim e. This casts a new perspective on the study of latent error 
behavior as the probability of error discovery is potentially high.
The experiment em ploys actual hardware measurements from  an A lliant FX /8 
system  to sim ulate error occurrence in the system  and to investigate m ultiple error 
occurrence and near-coincident fau lt discoveries. The A lliant FX /8 is a key component 
in the "Cedar" parallel supercomputer project at the Center for Supercomputing Research 
and Development in the U niversity of Illinois at Urbana-Champaign [3]. The measured 
A lliant FX /8 runs the current version of "Xylem," the Cedar operating system . 
Specifically, the methodology is applied to the A lliant memory subsystem . The fau lt 
model used in this study assumes that a permanent error has already occurred2. The 
physical mechanism causing the fau lts can be varied and do not affect the results.
The results are unique in that they provide new insight into the behavior of 
m ultiple errors and near-coincident fau lt discovery in a complex parallel processing 
environment. A t a conservative error occurrence rate of approxim ately one error a day, 
there is a 25% chance that latent errors cause a m ultiple error condition in the system . 
Thus one out of four errors may m anifest itse lf as a m ultiple error. Further it was 
found that 8% of the error m anifestations are near-coincident in nature at the same error 
occurrence rate for a tim e w indow  size o f 50 microseconds. It w as also found that the
1When two or more processors are active the system is said to be in concurrent operation.
2 An error is that part of the system state which is liable to lead to failure. The cause - in its 
phenomenological sense -  of an error is a fault.
3probability o f m ultiple errors tends to saturate after a threshold error occurrence rate of 
approxim ately one error a day. Although it was seen that the near-coincident fau lt
i
discovery behavior has a high correlation w ith  m ultiple error behavior, the saturation 
effect on the probability of near-coincident fau lt discovery becomes apparent at a higher 
error occurrence rate.
1.1 Related Research
There is little  or no research cited in the literature w hich experim entally 
investigates the occurrence of m ultiple or near-coincident fau lt discovery. Fault 
injection studies in the FTMP (Fault Tolerant M ultiprocessor) showed that the m ost 
lik ely  threat to system  failure in the short run was arrival of tw o failures so close to 
each other that system  reconfiguration w as not possible [1,2]. These experiments related 
to specific programs and involved p in-level fau lt injection. An analytical m odel for 
near-coincident fau lts in NMR system s w ith  different voting schemes is presented in [4]. 
The general valid ity of such a model however is not established.
Other related research includes experiments conducted to measure fault/error 
latency. Experiments to measure fau lt latency via pin-level fau lt injections in FTMP 
are discussed in [5]. In that study, the researchers measured latency tim es for fau lts in 
different system  components and obtained a standard distribution fit to their measured 
fau lt latency distribution. CPU fau lt latency for the digital microprocessor in FTMP 
was studied in [6,7] via gate-level sim ulation. A set of specific programs w as used to 
exercise the CPU to reveal fau lts injected into the sim ulation.
4The above approaches and results are, however, not applicable in general to 
m ultiuser system s. More recently, latent fau lt behavior in  the memory o f a VAX 
11/780 w as studied in [8]. The memory system  was instrum ented for measurements, 
and fau lt/error latencies were calculated by sim ulated fa u lt injection in the memory. 
The effect of workload on fault/error latencies was investigated in [9].
A lthough the above studies investigate the subject o f latency quite system atically, 
the question o f m ultiple errors or near-coincident fau lt discovery is not addressed. 
Given that several past measurements indicate that these problems are usually
catastrophic to the system , points toward a great need for an investigation given by this 
thesis.
Section 2 describes the experimental m ethodology used to calculate the m ultiple 
error and near-coincident fau lt discovery probabilities. Section 3 presents results and 
discusses the m ultiple error and near-coincident fau lt discovery behavior 
4 is the concluding section, which highlights important results o f the paper.
5SECTION 2
EXPERIMENTAL METHODOLOGY
The measured system  is an A lliant FX /8, a shared m em ory m ultiprocessor. Figure 1 
show s the A lliant F X /8 components related to our stu d y. The system  runs the current 
version of "Xylem," the Cedar operating system . From the softw are point o f view , m any 
features of the Cedar supercomputer are running on the A lliant FX /8. Detailed 
inform ation on the A llian t FX /8 is given in [10,11]. The w orkload on the A lliant FX /8  
consisted m ostly o f scientific applications such as circuit sim ulation, weather modeling, 
digital animation and fluid dynam ics.
Computational Complex
CEO CE1 CE2 CE3 CE4 CE5 CE6 CE7
L. _
i-------------------------------------------------- -
Main Memory
Figure 1. Configuration of Measured A lliant FX /8
6We investigate the failure characteristics of the main memory. An important reason 
for this is that measured field results show the largest number of failures occur in the 
memory [12]. A large number of CPU errors can also be traced to the memory [13]. As 
shared memory is a common resource, the possibility of it being the source of failures is 
significant.
2.1 Hardware Measurements
The A lliant FX /8 backplane was sampled to collect data on memory access 
operations from  the shared cache. A Tek DAS 9200 w ith  a 32K trace buffer was used for 
this purpose [14]. The hardware probes were attached prim arily to the main memory 
address bus on the backplane of the system . Other probes were used to monitor signals 
so that appropriate triggering could be performed.
As mentioned in the introduction, the measurements were performed w hile the 
system  was executing concurrent workload. The measurements were conducted over a 
five-day period, 8am to 5:30pm daily, Monday to Thursday and 8am to 3:45pm on 
Friday (prim arily due to drop in concurrent operations). Samples were taken 
approximately every 4 m inutes3, each containing 8K address references (representing 8K 
machine cycles). The total measurement period was approxim ately 46 hours.
Table 1 shows the filtered version of the raw data output. Addresses represent 
memory block start addresses. The memory is accessed in blocks o f 32 bytes (transfer 
size between the shared cache and mem ory). The fields cntlO and cn tll provide
3 The sampling rate chosen reflects a compromise between an adequate sample size and delay in 
transferring data to a data logger.
7Table 1. Concurrent W orkload Memory Address Trace
Line no. tim e stamp address cntlO cn tll
1 00033316579 0D3FF8 F 0
2 0003331666B 000230 F 4
3 000333166BC 000232 F 4
4 0003331670D 000234 F 4
5 000333167BB 0D3FF7 F 0
6 000333167CC 1AE0F7 F 8
7 00033316869 0C2FF5 F 0
additional status inform ation about the state of the memory bus.
22, Simulation
The memory address trace obtained above was then used as input to a sim ulation 
system , which essentially reconstructed the address space into w hich sim ulated error 
injections are performed over the entire measurement period ( the sim ulator is driven by 
the address trace). An error is discovered when the tim e of error injection at an address 
location is less than or equal to the tim e of arrival of that address in the concurrent 
workload address trace. By observation of the discovery behavior of these fau lts , the 
sim ulator provided results on the probability of m ultiple errors and near-coincident 
fau lts.
For error injection purposes no distinction was made between specific locations 
w ithin a block. Since the transfers from main memory occur in blocks of 32 bytes, an 
error in one location w ithin the block is equivalent to an error in any other location in 
the block from a discovery point of view . This greatly sim plified the sim ulation and 
smoothed out discontinuities arising out of the fact that the data were sampled.
8Sim ulated error occurrence (i.e., error injection) were performed assuming an 
exponential distribution for error occurrence over the entire measurement period. The 
error injection rates ( \ )  were varied from 0.009 to 0.058 ( x6 error occurrences per 
hour). Address locations for error injection were random ly chosen. The exponentially 
distributed intervals between error injections were also random ly chosen .
In order to obtain statistically consistent results, approximately 600 fau lts were 
injected at each error injection tim e. This is equivalent to the sim ulation being run 600 
tim es for each error injection rate. In each run, a random ly chosen location is injected 
w ith an error.
2 3  Measurement of M ultiple Errors
M ultiple errors occur when tw o or more errors are yet undiscovered in the system . 
In order to determine the probability of m ultiple errors at a given error injection rate, 
we first construct a latency profile for each injection. The latency profile for an injection 
is the discovery tim e profile for a ll errors injected at that injection tim e. Once the tim e to 
the discovery of each error injected is available, a latency profile can be plotted as in 
Figure 2.
Consider for sim plicity a case in which tw o error injections are made in the 
measurement period. Figure 3 shows the three possible latency profile overlaps. Note 
that at each injection tim e a number of errors are injected into the memory. The 
m ultiple error regions between the tw o error injections are shown. Errors whose 
discovery latencies do not lie w ithin the m ultiple error region are those that do not exist 
as m ultiple errors in the system . The m ultiple error region area versus the total latency
9Figure 2. Example of a Latency Profile
profile area of both injections give a rough view  of the probability of m ultiple errors in 
the system . The probability of m ultiple errors w ould be the ratio o f number of errors in 
the m ultiple error region and the total number of errors injected.
Let Et represent the number of errors injected at error injection number i . A lso let 
MeijX represent the number of errors of error injection i that exist as m ultiple errors 
w ith error injection j  at the error occurrence rate X (e.g., Me12A represents number of 
errors in injection 1 that exist as m ultiple errors w ith injection 2 at the error occurrence 
rate X and Me21A is the number of errors in injection 2 that exist as m ultiple errors w ith  
injection 1 at the error occurrence rate X). Then between tw o error injections n and m 
where n <m ,the probability of m ultiple errors MpnmX for an error occurrence rate of X is
MPnm>r
M e n m > + M e rrmX
E„+E„
10
I I
Error Injection 1 Error Injection 2
Figure 3(a). No Overlap
Multiple error region
f  Ì
♦  *
Error Injection 1 Error Injection 2
Figure 3(b). Type 1 Overlap
Multiple Error Region
» »
♦ *
Error Injection 1 Error Injection 2
Figure 3(c). Type 2 Overlap
11
In the complex case of more than tw o error injections w ithin the measurement 
period, the m ultiple error probabilities can be individually calculated w ith  respect to one 
particular error injection for the given error occurrence rate X (i.e., M pux  is a m ultiple 
error probability between fau lt injections 1 and 2, Mp13X is m ultiple error probability 
between 1 and 3 etc.). For each MpnmX the m ultiple errors may exist in either of the tw o  
forms shown in Figures 3(b) and 3(c). But given the definition of m ultiple errors »where 
at least tw o errors m ust exist undiscovered in the system , only adjacent error injection 
probabilities need be considered. Thus only Mp12X, Mp23X, Mp34X etc. values are used to 
give an overall m ultiple error probability (Af/>x) for the error occurrence rate chosen. 
Thus, if  nei represents the number of error injections achieved at the error occurrence 
rate X, the overall probability of m ultiple errors at error occurrence rate X is
i=n^ ~1
L  MPuxi+i)\
i=1
2.4 Measurement of Near-Coincident Faults
In order to measure the probability of near-coincident fau lt discoveries, w e choose 
an appropriate tim e window  of size T. Next w e moved this w indow  over the total 
measurement tim e in increments equal to T, each tim e observing the number of errors 
(from  different error injections) discovered w ithin the tim e w indow . The ratio of total 
number of errors found in that tim e window to the total number of errors injected gave 
the probability of near-coincident fau lts in the system . Note that if  errors from the 
same error injection were discovered w ithin the tim e window , they do not qualify as a 
near-coincident fau lt discovery.
12
The total measurement period is divided into n tim e slices t l ... tn, each T long 
except one (if true integer division is not possible). The number of errors discovered 
(from  d ifferen t error in jection s) in each tk were NkX at an error occurrence rate X 
where Again nei represents the number of error injections achieved at error
occurrence rate X. If the total number of errors injected into the system  is E, then, the 
probability of near-coincident fau lts (iVCx) for an error occurrence rate of X is
NCk=-
i=n
¿=1
where E = £  Et
¿=1
13
SECTION 3 
RESULTS
This section presents results of m ultiple error and near-coincident fau lt behaviors 
seen on a shared memory multiprocessor (A lliant F X /8) m em ory subsystem . Recall that 
errors were injected at exponentially distributed intervals (w ith  an error injection rate 
X). The memory, address trace was then used for determ ining m ultiple error and near- 
coincident fau lt discovery probabilities. For purposes o f th is study, errors were injected 
in the high usage regions of the memory. The region o f injection represented 96% of the 
address references in the real concurrent workload trace but occupied only an eighth of 
the memory address space available. C learly, the behavior o f fau lts in th is region is 
more critical for continued system  operation.
On the average, 14% of a ll injected errors rem ained undetected during the 
measurement period (approximately 5 days). The choice o f the error injection rate X for 
the experiment was chosen to reflect realistic error occurrence rates (see [10]) The range 
w as chosen to be 0.009 <X^ 0.058 (x6 error occurrences per hour -  approxim ately 2 to 16 
error injections over 5 days). The tim e-window  sizes chosen for analysis in the near- 
coincident fau lt discovery calculations represent reasonable error recovery tim es for a 
high performance system . The tim e-window range w as varied from 1 microsecond to 
250 microseconds (approxim ately 6 to 1500 instructions on the A lliant FX /8 [9]).
Figures 4(a) and 4(b) show  the error la ten cy  d istrib u tion  o f detected errors at an 
error occurrence rate of 0.03 (x6 error injections/hr or e i/h r). In Figure 4(a) errors are 
injected at the start o f the measurement period (detected errors -  549) whereas in Figure 
4(b) errors are injected near the beginning of the third day o f measurement (detected
14
errors = 497). From Figure 4(a), w e find the distribution has three distinct peaks. The 
first peak occurs during the beginning of the first day of measurement. The second and 
third peaks also occur during the beginning of the second and third days of 
measurement, respectively. Close to the start of the working day there is a sudden 
increase of workload on the system . This leads to a high number of error discoveries. A t 
the first peak there are some error discoveries due to the injection of errors at that point. 
This behavior was also seen by [8] where error latencies were measured on the 
uniprocessor system  (VAX 11/780). Contrary to the variation observed by [8] during 
midday hours, Figure 4(a) does not show any significant variation during these hours. 
The conjecture is that during midday hours the workload is more stable. This is 
prim arily due to assignment of large processes (w hich run longer) on the CE’s during 
midday hours. Figure 4(b) also shows the same behavior for errors injected close to the 
beginning of the third day of measurement. The difference in the behavior of latent 
errors on a multiprocessor system  versus that for an uniprocessor system  is closely  
related to the difference of workload characteristics for both types o f system s.
15
150
100
Frequency
50
0
0 500 1000 1500
Error Latency in m inutes
Figure 4(a). Error Latency Distribution at Measurement T im e- 0.0 hr
200 -
Frequency
100 1
O L l i . * . i . . . L , L i „ L - L . . i  l . i L L l l l l l i  f i - . L  i  i  I i  1 > I I 1 > i  ,fc l  1 L.
0 100 200 300
Error Latency in m inutes
Figure 4(b). Error Latency Distribution at Measurement T im e- 20.03 hr
16
3.1 M ultiple Errors
Figure 5 show s the variation in the probability of m ultiple errors during the 
measurement period for an error occurrence rate o f approxim ately tw o errors a day 
(0.043 x6 ei/hr). Figure 6 shows the probability o f m ultiple errors being present in the 
system  at different error occurrence rates. We find that the probability of m ultiple 
errors increases from  a low  of 0.04 at an error occurrence rate o f approxim ately one 
error every tw o days to a high of 0.50 which is more or less a saturation probability. 
The oscillatory behavior of the graph is primarily due to statistical variations.
Figure 6 shows, at a conservative error occurrence rate o f approxim ately one error a 
day (X=.022), there exists a 25% chance that latent errors w ill cause a m ultiple error 
condition. This suggests that one out of four errors m ay m anifest itse lf as a m ultiple 
error (A/>x>0.25 for one error a day). A t higher error injection rates, the m ultiple error 
probability is high enough to be potentially hazardous. Examining the plot in Figure 6 , 
we find at low  error injection rates that plot has a higher slope than at high error 
injection rates. As expected the error occurrence rate Cor the number of error injections) 
does have an impact on the m ultiple error probability, but this effect subsides as the 
error occurrence rate increases. The threshold is not high, about four errors in a 5-day 
period (i.e., at less than one error a day). The reason for this is that at higher error 
occurrence rates seem ingly more latent fau lts tend to be discovered or "swept away", 
thereby resulting in a tapering effect on the plot.
To show in detail how the m ultiple error probability changes during the course of 
error injections, a plot of variation in probability of m ultiple errors for an error 
occurrence rate of 0.031 (x6 ei/hour) is shown in Figure 7. There are nine error injections
17
0.8 F
0.7
0.6
0.5
Prob.of M ultiple 
Errors o.4
Cm>(*xi+i)a3
0.3
0.2
0.1
0.0
.
•
. ' i
.
1 \
i \
i i 
i i  
i i
:  t
i i  
i i  
i t
/  \ i i
i » i i
:  /  » i i
• \ i i
• \ |  |
i
- A l 1 \
S
s i i
s \ i is
s \ i t
•  / \
• /
r \ i t
■ \ i \
\ i i
■ \ i t
• \
i i
i i
- \ i i *
\ i \ i\
• K i i i \
* \ / i l \\ i i i \
. \ i t
• \
\ i \ • \
\ i t • \
\
\
1  \ 
1 \
> \ 
i \
\
\ i i 1 \
• L  1 1 \ l
■ \ l V
- k
l
l
\
\
■ \ l \
/ \
- l V
■ i / \
. > / \
- 1 / \
• t  i \
1 / \
• \
W b
• «
'  ■ ■ ■ ■ »  ■ ■ ■ ■ ■ ■ ■ ■ »
10 20 30 40
Measurement Time in hours ( max. — 46.7 )
Figure 5. An Example of M ultiple Error Presence During the Measurement Period
18
Error Occurrence Rate (x6 ei/hour)
Figure 6. M ultiple Error Presence at Different Error Occurrence Rates
19
in the measurement period for this error occurrence rate. Each dotted line represents one 
m ultiple error probability plot w ith  respect to a specific error injection number. L 1 
represents m ultiple error probabilities of error injection 1 CEj) w ith  error injections E2, 
Ev  E4 and E y  The Mp 12,0.031» 13,0.031* 14,0.031 &nci Mp 15,0.031 values are represented on
this line. Sim ilarly L 2 represents m ultiple error probabilities of error injection tw o ( 
Mp23,0.031* -Mp24,o.o3i and Mp2S 003l ) and so on. A downward behavior is seen for a ll the 
lines. This seems intuitive, say for L v  the errors of E1 w ill tend to be discovered as tim e 
progresses, thereby reducing the probability of m ultiple errors being present in the 
system  when Es is introduced.
To explain the peculiarities of the error discovery behavior in the system , the high 
m ultiple error probability of 0.9 in L 2 w ill be explained. We find that m ost of the errors 
injected in 2 are discovered during the interval between error injections 3 and 4 (as 
Mp230mi=0.9 suggests high values of Me23 0 031 and Afe320031). Of the few  that are left 
some are discovered between error injections 4 and 5 ,very few  though as Mp24Q021=OA6 
(indicates that m ost of the contribution is from Me42 0 031) The rest are discovered (of 
those discovered in the measurement period) between error injections 5 and 6. Most 
m ultiple error probability distributions over the measurement period for different error 
occurrence rates show this behavior.
20
Prob.of M ultiple 
Errors
Figure 7. An Example of M ultiple Error Presence w ith  Successive Injections
21
3.2 N eai^C oincident F ault D iscovery
We w ill first observe how the near-coincident fau lt probability changes w ith  
varying tim e-w indow  sizes. Figure 8 shows the variation of probability of near­
coincident fau lt discovery w ith  tim e-window  sizes from 10 to 250 microseconds for 
three different sets of error injections. As expected, w e see a m onotonically increasing 
function of tim e-w indow  size. But w e find that the rate o f increase in probability of 
near-coincident fau lt discovery slow ly  decreases for larger tim e-w indow  sizes. Figure 9 
shows a microscopic view  (1 to 10 microseconds) of the behavioral change in the near- 
coincident fau lt probabilities. It presents itse lf in a stepw ise fashion. This is easily  
understood by the fact that if  w e have near-coincident fau lts in tim e-w indow  size T, 
then those same near-coincident fau lts m ust exist in tim e-w indow  size r+ 1 .
The variation of probability of near-coincident fau lt discovery w ith  respect to error 
injection rate for three tim e-w indow  sizes is shown in Figure 10. In Figure 10, the range 
of error occurrence rates is 0.009 <X <0.049 (x6 error occurrences per hour ), about 2 to 
14 error injections over the measurement period. From Figure 10, the near-coincident 
fau lt probability values range from 0.003 to approximately 0.21 over the 10 to 250 
microsecond tim e-w indow  size.
There exists a high degree of sim ilarity between Figure 6 and Figure 10. As a 
result, w e can see there exists a high correlation between the existence of m ultiple errors 
and their discoveries in near-coincidence. From Figure 10, after the initial steep rise the 
plot starts to taper as in the m ultiple error probability case. But the saturation effect 
comes about slow ly  in Figure 10, becoming more apparent at a higher error occurrence 
rate than that seen in the probability of m ultiple error behavior. This can be explained
22
Prob, of Near- 
Coincident 
Faults (iV C j
Figure 8. Near-Coincident Fault Discovery at Different Tim e-W indow Sizes (Macroscopic)
23
Figure 9. Near-Coincident Fault Discovery at Different Tim e-W indow Sizes (M icroscopic)
24
by understanding the situation in which the rate of number of latent errors being "swept 
out" increases; the probability of near-coincident fault discovery can increase as a side 
effect. But after a certain error occurrence rate, the rate of removal of latent errors from  
the system has more effect on the probability of near-coincident fau lt discovery. Thus 
we find that although the behavior of of multiple errors and near-coincident fault 
discovery has a high correlation, the saturation effect on the probability of near- 
coincident fau lt discovery occurs slower than that for multiple errors.
From Figure 10 we find that the percentage increase in probability for near­
coincident faults from its lowest value to highest value is more than twice that of 
multiple errors (Figure 6). This shows that the probability of near-coincident fault 
discovery is more sensitive to change in error occurrence rate than the probability of 
multiple errors.
25
Figure 10. Near-Coincident Fault Discovery at Different Error Occurrence Rates
26
SECTION 4 
CONCLUSIONS
Past studies have shown that m ultiple errors and near-coincident fau lt discovery 
can seriously degrade the reliability of a system even in a highly fault-tolerant 
environment such as the FTMP. U sually there exists a delay between the generation of 
an error due to a fau lt and its detection. Due to different error latencies, many "lurking'’ 
errors may be present in the system  behaving as though the cause were a multiple fault. 
Most error recovery mechanisms cannot handle multiple faults or very close discovery 
of these faults. A  sound understanding of multiple errors and near-coincident fault 
discovery can aid the development of intelligent memory scrubbing techniques to 
improve system  reliability. This paper describes a methodology to study the behavior of 
m ultiple errors and near-coincident fau lt discovery in the memory subsystem of a 
shared memory multiprocessor.
The methodology is illustrated on a production multiprocessor system , the Alliant 
FX/8, in an operating system environment for the "Cedar" supercomputer under real 
concurrent workload conditions. A measurement period of five days was used to monitor 
the behavior o f multiple errors and near-coincident fau lt discovery in a simulated error
occurrence environment. Results are provided for a multiprocessing environment at 
realistic error occurrence rates.
The results show that for a conservative error occurrence rate of one error per day. 
there is a 25% chance that latent errors cause a multiple error condition. Thus one out of 
four errors may manifest Itself as a multiple error. It was also found that 8% of the
27
error manifestations are near-coincident in nature at the same error occurrence rate for a 
time window size of 50 microseconds. It was seen that the probability of multiple errors 
tends to saturate after a threshold error occurrence rate of approximately one error a 
day. The saturating effect on the probability develops as the number of latent errors 
being "swept” out from the system increases at higher error occurrence rates. This can 
lower the chances of a multiple error condition in the system  and is observed after the 
threshold error occurrence rate. It was also found that there exists a high degree of 
similarity between the behaviors of multiple errors and near-coincident fau lt discovery, 
suggesting a strong correlation between existence of multiple errors and their discoveries 
in near-coincidence. The saturation effect on probability of near-coincident fault 
discovery was seen to take effect slower than that for the probability of multiple errors. 
The reason for this is that as the rate of latent errors being swept out increases as a 
result of an increase in error occurrence rate, the probability of near-coincident fault 
discovery increases as a side effect. The effect of tim e-window size on the probability of 
near-coincident fault discovery was also studied. It was found that the near-coincident 
fault probability increases monotonically w ith larger tim e-window sizes.
To further understand the behavior of multiple errors and near-coincident fault 
discovery, more experimental research on other systems should be attempted. In the 
future, the effects of transient errors on multiple error occurrence and near-coincident 
fault discovery w ill also be investigated.
28
REFERENCES
[1] A. L. Hopkins, T. B. Smith, and J. H. Lala, “FTMP -  A highly reliable fault- 
tolerant multiprocessor for aircraft,” Proceedings of the IEEE, vol. 66, no. 10, pp. 
1221-1239, October 1978.
[2] J. H. Lala, “Fault detection, isolation and reconfiguration in FTMP : methods and 
experimental results,” Proceedings of the IEEE National Aerospace Electronics, 
vol. 1, pp. 21.3.1-21.3.9, 1984.
[3] D. Kuck, D. Lawrie, A. Sameh, and D. Gajski, “CEDAR -  A large scale 
multiprocessor,” Proceedings of the 1983 International Conference on Parallel 
Processing, pp. 524-529, August 1983.
[4] J. McGough, “Effects of near-coincident faults in multiprocessor system s,” 
Proceedings of the IEEE/AIAA Fifth Digital Avionics Systems Conference, pp. 
16.6.1-16.6.7, 1983.
[5] K. G. Shin and Y. H. Lee, “Measurement and application of fau lt latency,” IEEE 
Transactions on Computers, vol. C-35, no. 4, pp. 370-375, April 1986.
[6] F. L. Swem, S. J. Bavuso, A. L. Martensen, and P. S. Miner, “The effects of latent 
faults on highly reliable computer system s,” IEEE Transactions on Computers, 
vol. C-36, no. 8, pp. 1000-1005, August 1987.
[7] J. McGough and F. L. Swem, Measurement of Fault Latency in a Digital Avionic 
Miniprocessor Part II. NASA Contractor Report 3651, 1983.
[8] R. Chillarege and R. K. Iyer, “Measurement-based analysis of error latency,” 
IEEE Transactions on Computers, vol. C-36, no. 5, pp. 529-537, May 1987.
[9] R. Chillarege and R. K. Iyer, “The effect of system workload on error latency: an 
experimental study,” ACM SIGMETRICS Conference on Measurement and 
Modeling of Computer Systems, pp. 69-77, 1985.
[10] Alliant FX/Series Product Summary. Alliant Computer Systems Corporation, 
Littleton, MA, October 1986.
[11] Alliant FX/Series Architecture Manual . Alliant Computer Systems Corporation, 
Littleton, MA , October 1986.
[12] M. C. Hsueh, R. K. Iyer, and D. J. Rossetti, “Measurement and modeling of 
computer reliability as affected by system activity,” ACM Transactions on 
Computer Systems, vol. 4, no. 3, pp. 214-237, August 1986.
[13] R.K. Iyer and D.J. Rossetti, “A statistical load dependency of CPU errors at 
SLAC,” Proceedings of the 12th International Symposium on Fault Tolerant 
Computing, pp. 363-372, 1982.
[14] DAS 9200 System & Module A60 User's Guide. Tektronix, Beaverton OR., 1987.
29
APPENDIX 
THE SIMULATOR
The simulation environment consisted of three separate simulators: ELS -  the Error 
Latency Simulator, MES -  the Multiple Error Simulator and NCFS -  the Near-Coincident 
Fault Simulator. Figure 11 shows the manner in which the simulation environment is set.
ELS uses the concurrent workload address trace (RCW) and error injector (El -  set to 
a particular error injection rate) as inputs to monitor the discovery behavior of faults.
Error Injection 
Rate
Figure 11. The Simulation Environment
30
The ELS is a real-time simulator whose clock is incremented by both the time stamp in 
the address trace and sampling times.
Table 2. Error Latency Simulator Output
Er. Latency Address Er.Inject Time Er.Discover Time Er.Inject Rate
10905974.0 102F00 1.0 10905975.0 0.030
10925851.0 1D7600 1.0 10925852.0 0.030
110409.0 01FB00 1.0 110410.0 0.030
1117435.0 04D500 1.0 1117436.0 0.030
11618402.0 0478F2 1.0 11618403.0 0.030
Table 2 shows a portion of a typical output from ELS. A ll times are given in time 
units of the real clock. We note that all the errors shown in Table 2 originated from the 
same error injection, but each one has a different error latency.
The multiple error simulator MES uses the output generated by ELS and examines 
the latency profiles to find multiple error situations. The MES calculates the multiple 
error probability seen at a given error injection rate.
Table 3 shows a portion of the MES output at the error injection rate -  0.30; this 
corresponds to 9 error injections over the complete measurement period. Note that the 
time for the last fault injection is not required as it does not have any multiple error 
probability. We see for any one source error injection , its corresponding multiple error 
probability w ith  respect to other error injections decreases. From these data, MES 
calculates the overall multiple error probability at the given fault injection rate.
NCFS also uses the output from ELS (Table 3) and observes along real time for a 
specific time window size the probability of near-coincident faults. NCFS uses the 
discovery times of errors (from different error injections) to calculate number of near-
31
Table 3. Multiple Error Simulator Output
Er. Inject. Time Source Inject no. Dest. Inject no. Mult. Prob.
1.0 0 1 0.431487
1.0 0 2 0.437318
1.0 0 3 0.369546
1.0 0 4 0.060724
2573348352.0 1 2 0.834617
2573348352.0 1 3 0.422660
2573348352.0 1 4 0.048530
3340261376.0 2 3 0.388702
3340261376.0 2 4 0.051163
5145478656.0 3 4 0.203210
5145478656.0 3 5 0.038282
coincident faults. Table 4 shows a portion of output from NCFS for time window sizes 
ranging from 1 microsecond to 10 microseconds. From these data the probability of near- 
coincident faults can be calculated at a given error injection rate.
Table 4. Near-Coincident Fault Simulator Output
Tim e_w indow  size No. of NCF Total Errors Injected
1 158 2697
2 158 2697
3 158 2697
4 158 2697
5 158 2697
6 159 2697
7 159 2697
8 159 2697
9 159 2697
10 159 2697
