Fault and Error Latency Under Real Workload: an Experimental Study by Chillarege, Ram
I 
-I-
I 
========:=: .. --_. 
~.---.::. 
==------=----.o.~ __ 
~~-
, 1-
·-I~­
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
~ 
August 1986 D /f .4 / L 4 /JC; t 0/ . UILU-ENG-86~2230 
CSG-55 
COORDINATED SCIENCE LABORATORY 
College of Engineering 
(NASA-CR-179802) FAULT AND IEEOR LATENCY N87-10733 
UNDER REAL WORKLOAD: AN EKPEBI!ENTAL STUDY 
Eh.D. ,Thesis (Illinois Univ., 
Ur1ana-Cbampai9n.~ 100 p CSCL 09B UAclas 
G3/62 44301 
FAULT AND ERROR LATENCY 
UNDER REAL WORKLOAD--
AN EXPERIMENT AL STUDY 
Ram Chillarege 
UNIVERSITY OF ILLINOIs AT lTRRANA-CHAMPAIGN 
Approved for Public Rel~. Distribution Unlimited. 
https://ntrs.nasa.gov/search.jsp?R=19870001300 2020-03-20T13:54:59+00:00Z
I 
I 
I 
I 
I 
I 
I 
I 
I 
I @ Copy right by Ram Chillarege 
I 1986 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
FAULT AND ERROR LATENCY UNDER REAL WORKLOAD 
- AN EXPERI~,tENTAL STUDY 
BY 
RAM CHILLAREGE 
B.Sc., University of Mysore. 1974 
B.E .. Indian Institute of Science. 1977 
M.E .. Indian Institute of Science. 1979 
THESIS 
Submitted in partial fulfillment of the requirements 
for the degree of Doctor of Philosophy in Electrical Engineering 
in the Graduate College of the 
University of Illinois at Urbana-Champaign. 1986 
Urbana. Illinois 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
FAULT AND ERROR LATENCY UNDER REAL WORKLOAD 
- AN EXPERI1\lENTAL STUDY 
Ram Chillarege. Ph.D. 
Department of Electrical and Computer Engineering 
University of Illinois at Urbana-Champaign. 1986 
iii 
This thesis demonstrates a practical methodology for the study of fault 
and error latency under real workload. This is the first study that measures 
and quantifies the latency under real workload and fills a major gap in the 
current understanding of workload-failure relationships. The methodology is 
based on low level data gathered on a V AX 111780 during the normal 
workload conditions of the installation. Fault occurrence is simulated on the 
data. and the error generation and discovery process is reconstructed to 
determine latency. The analysis proceeds to combine the low level activity 
data with high level machine performance data to yield a better understanding 
of the phenomenon. This study finds a strong relationship between latency and 
workload and quantifies the relationship. The sampling and reconstruction 
techniques used are also validated. 
Error latency in the memory where the operating system resides is studied 
\lSlOg data on physical mernory access. 1'he51..' daHl arc I!athercd thrnugh 
hardware probes in the machine that samples lhe sySl(,IIl during the normal 
workload cycle 01 the installation. The technique pnwides a mean5 to study 
the system under different w\lrkloads and for multiple days. These data are 
nsed to reconsl ruel The error discovery process in the system. An approach to 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
iv 
determine the fault miss pC1"centagt' is developed and a verification of the entire 
methodology is also performed. This study finds that the mean error latency, 
in the memory containing the operating system. varies by a factor of 10 to 1 (in 
hours) between the low and high workloads. It is also found that of all errors 
occurring within a day, 70% are detected in the same day. 82% within the 
following day, and 91% within the third day. 
Fault latency in the paged sections of memory is determined using data 
from physical memory scans. Fault latency distributions are generated for s-
a-O and s-a-l permanent fault models. Results show that the mean fault 
latency of a s-a-O fault is nearly 5 times that of the s-a-l fault. Performance 
data gathered on the machine are used to study a workload-latency behavior. 
An analysis of variance model to quantify the relative influence of various 
workload measures on the evaluated latency is also given. 
Error latency in the microcontrol store is studied using data on the 
microcode access and usage. These data are acquired using probes in the 
microsequencer of the CPU. It is found that the latency distribution has a large 
mode between 50 and 100 microcycles and two additional smaller modes. It is 
interesting 10 note that the error latency distribution "n the microcontrol slore 
is not e:xponenlial as suggested by other reported resear:.-h. 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
v 
ACKNOWLEDGEMENTS 
I would like to express my gratitude and appreciation to my thesis advi-
sors, Professors Ravi Iyer and Edward Davidson. I have learnt from them, 
more than what a mere academic interaction would have ordinarly provided, at 
different times and in different ways. Their support, friendship, and guidance 
have been invaluable. Professor Iyer has spent many long hours in discussing 
the research and in helping put this thesis together. 
It has been a pleasure to be a part of the Computer Systems Group, and I 
thank all the members, past and present, staff and students, for the stimulating 
environment and their friendship. Although there are a number of people that 
lowe thanks to, I would like to mention some in particular. Ron Odom and 
Joel Emer for the timely loan of measurement equipment. Rick Berry, Subhasis 
Laba, and Geoff McNiven for their help during the instrumentation of the V AX 
and Bill Rogers for the slides. Jackie Ziemer, Paula Pachciarz, and Katy Lind-
quist have been very helpful in our office. 
Kelly, my wife, has been a constant source of encouragement and suste-
nance. I met her during the course of this research and she has been a 
signiftcant influence towards this end. I thank her for being a friend and a 
partner. 
I am very grateful to my parents for their selfless pursuit to help me get a 
higher education. Their encouragement and belief have meant a lot to me. My 
sister, Meera and her husband, Chandra, have always been there whenever I 
needed them. 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
rvu cannot see the seer vf sight. 
you cannot hear the hc>arer of hearing. 
you cannot t h i11k 1 he 1 h inker of thought. 
you cannot know the kn.ower of knowledge. 
This is your self that is ... dthin. all. 
- Brhadaranyaka Upanishad 
vi 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
TABLE OF CONTENTS 
CHAPTER 
1. INTRODUCTION ..................................................................................... 
1.1. Goal of 1 he Thesis ......................................................................... . 
1.2. Thesis Outline .............................................................................. . 
2. LATENCY AND ITS MEASUREMENT ................................................. . 
2.1. FaUlt. Error. and Failure ............................................................ .. 
2.2. Fault and Error Latency .............................................................. . 
2.3. Classical Failure Analysis ........................................................... . 
2.4. Earlier Latency Studies ................................................................ . 
2.5. Methodology of Measurement .................................................... .. 
2.5.1. Counter-based techniques ................................................. .. 
2.5.2. Trace-based techniques ...................................................... . 
3. ERROR LATENCY ................................................................................. .. 
3.1. In trod uction ................................................................................ .. 
3.2. Instrumentation .......................................................................... .. 
3.2.1. The system ........................................................................ . 
3.2.2. Experimental setup ........................................................... . 
3.3. Measurernent ............................................................................... .. 
3.4. Error Latency Determination ...................................................... . 
3.-l.1. Fault model and latency calculation .............................. .. 
3.4.2. Algorithm implementation ............................................... . 
3.5. Error Latency Distributions ....................................................... .. 
3.5.1. Faults at low and high workloads .................................. .. 
3.5.2. Multiple day measurement .............................................. . 
3.5.3. Latency and hazard ........................................................... . 
3.6. Validation and an Analysis of FaUlt-miss Percentage ............... . 
3.6.1. The effect of class size and sampling factor ..................... . 
3.6.2. FaUlt-miss percentage ....................................................... . 
3.b.3. Verifying the miss perCt~ntage eslimatit'n ....................... .. 
3.7. SUn1111ary ...................................................................................... . 
vii 
PAGE 
1 
2 
3 
5 
5 
6 
8 
8 
9 
11 
11 
13 
13 
15 
15 
19 
20 
25 
25 
27 
28 
30 
34 
36 
3q 
40 
41 
43 
47 
4. F:\L~L·r L.:"'l'ENC\' ................................................................................... 4() 
-l.i. lnt rociuclion ................................................................................ .. 
4.1. (\1rnpll1Jlion of Faull LHency ................................................... .. 
4.1.1. faul1 latency caiculalinn ................................................. .. 
4.1.1. Eslimal ing error latency ................................................... . 
4.3. 'rl1e Experin1ent ........................................................................... .. I fR£.CEOtNG PAGE mAtk INatt'ft£Mt:BJtJ scanner ...................................................... .. 
,:J.O 
50 
51 
53 
55 
55 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
CHAPTER 
4.3.2. The experimental setup .............•........................................ 
4.4. Latency Distributions .................................................................. . 
4.4.1. S-A-O and S-A-l distributions ......................................... . 
4.~.2. Workload-latency model .................................................. . 
4.5. Discussion and Significance of Results ........................................ . 
4.0. S urn mar)· ...................................................................................... . 
5. ERROR LATENCY IN THE MICROCONTROL STORE ........................ . 
5.1. Introd.uction ................................................................................. . 
Instrumentation ........................................................................... . 
5.2.1. The microseq uencer ........................................................... . 
5.2.2. Data acquisition ................................................................. . 
5.3. Measurement and .. ;nalysis ......................................................... . 
5.3.1. Microcode usage distribution ............................................ . 
5.3.2. Interaccess time ................................................................. . 
5.4. Error Latency Calculation ........................................ ; .................. . 
5.5. S unl rnary ........................................................................................ . 
b. CONCLUSI()NS ...... ~ ................................................................................ . 
6.1. Summary and Discussion of Results ........................................... . 
6.2. Suggestions for Future Research ................................................ .. 
REFERENCES ............................................................................................... . 
viii 
PAGE 
56 
58 
58 
62 
63 
67 
69 
69 
_70 
70 
72 
73 
73 
73 
75 
80 
81 
81 
83 
86 
\lIT .t:,... ••••••••• ••••••• ••••• •••••••••• •••• ••••••••• ••••• ••••• ••• •••• ...... ..... ...... ••••••••••••• •••• •••••• ••• ............. 89 
-PRtCEDlNQ paGE m ANK -NU I Ell AEb'" 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
ix 
LIST OF FIGURES 
FIGURE PAGE 
2.1. Fault and Error Latency....................................................................... 7 
3.1. VAX 11/780 System Organization ....................................................... 17 
3.2. SBI Information Transfer Group Fields ............................................... 21 
3.3. Acquired Data in Regular Mode ........................................................... 21 
3 . .:J. Decoding the Data ................................................................................. 21 
3.5. Memory Usage Histogram ..................................................................... 23 
3.6. User CPU Usage Versus Time of Day.................................................. 24 
3.7. Latency Time Calculation ..................................................................... 26 
3.8. Acquisition and Processing of Data ...................................................... 29 
3.9. Error Latency Distribution - Fault at 00:00 hrs ................................. 31 
3.10. Error Latency Distribution - Fault at 12:00 hrs ............................... 33 
3.11. Error Latency Distributions for 2 Consecutive Days ........................ 35 
3.12. Hazard Rates for 2 Consecutive Days .................................... :........... 38 
3.13. Miss Percentage Versus Class Size ...................................................... 42 
3.14. Miss Percentage Versus Sampling Factor ........................................... 42 
3.15. VerifIcation: Miss Percentage Versus Class Size ................................ 44 
3.16. Verincation: Miss Percentage Versus Sampling Factor ...................... 44 
3.17. \/ariation in Miss Percentage Between Regions .................................. 46 
4.1. Latency Calculation .............................................................................. 5~ 
4.2. Acquisition and Processing of Data ...................................................... 57 
4.3. S-:\-O I.atency Dist riDutions ................................................................ 5q 
4.4. 5-1\-1 Latency Distributions ................................................................ bO 
4.5. Pie Chart ShOWing Workload-latency Relatilmship ............................ 65 
4.6. Plot of Three Workload Measures ....................................................... 66 
5. I. Usage Distribution for the Microcontroi Store .................................... 74-
5.2. Microv,rord Interaccess Time Distribution ........................................... 7b 
5.3. Error Latency Time Calculation .......................................................... 7p. 
5.4. Error La1ency Distrilmlion lor the Microconl rol Store ...................... 7q 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
1 
CHAPTER 1 
INTRODUCTION 
The widespread dependence on computing resources in our society has 
made reliability and integrity central issues in computer system design. An 
important prerequisite in designing for reliability is an understanding of the 
effect of faults in the system and their manifestation. Due to the complex 
nature of a system. its behavior under fault is not easy to comprehend. 
This thesis concerns itself with the discovery process of faults in the sys-
tem and how it is affected by the workload. Since a system has components 
that are not always utilized there is usually a time delay between the 
occurrence of a fault and its manifestation as a malfunction. This time delay is 
deflned as the fault or error lutency. An analytical study of latency is unfeasi-
ble at 1his stage. due to (! lack of understanding of the complex interactions 
involved with fault occurrence and manifestation. Hence. a systematic experi-
mental methodology is adopted in this thesis for studying latency and the asso-
ciated issues. 
The understanding .)[ faul1 and error latency has many ramifications in 
reliable ~ystem design. Primarily. the knowledge of latency is essenlial for 
accurate reliability prediction. Larl!e latencies m.av resu!i in mulliple errors 
lhereby rendering many detection and recovery mechanisms ineffective. Furth-
errnore. knowledge of latency [s essential to design effecTive rollback. recovery 
mechanisms. This is necessary since rolling back less than the latency time 
2 
may result in repeated failures due to corrupt information. 
Another motivation for determining latency arises due to the dependency 
of failures on system activity as reported in [1. 2J. These studies reported a 
sharp decline in the reliability of computing systems at high utilization. One 
possible cause of this phenomenon is the latent discovery of faults and errors. 
Latent discovery suggests that faults occur randomly and an increase in work-
load reveals the faults thus resulting in a noticeably higher failure rate at 
higher loads. Hence. it becomes imperative to study fault and error latency 
under real workload conditions. 
1.1. Goal of the Thesis 
The primary focus of this research is to develop and demonstrate an exper-
imental approach to study fault and error latency in a machine under real 
workload. 
Fault and error latency are fundamental issues in determining the reliabil-
ity of large systems and have not been well understood because of the complex 
nature of the system. Measurements to determine latency become even more 
complicated when they need to be performed on a production installation. 
This research is timely since' t he present growt h of computers is headed 
to\\'ard multiple machines or co-operaling clusters nt machines where problellls 
Liu(' :l) IZltt'!1cy are crucial, The strong emphasis nn high availahility computing 
fllrTher necessitates J good understanding t)f ilw f'dnc.iarnenlals of the rault and 
error discovery process in machines. 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
3 
1.2. Thesis Outline 
Chapter 2 defines fault and error latency and their ramifications. A sur-
vey of earlier latency studies and their limitations is presented. The methodol-
ogy adopted in this thesis for the measurement of latency under real workload 
follows. The subsequent three chapters are each self-contained and present 
research on latency in specific subsystems of the machine under study. 
Chapter 3 presents a study in error latency in the operating system region 
of the memory. The instrumentation used. and the measurement and computa-
tion of error latency is discussed followed by the error latency distributions 
that are generated. The validation of the technique is given followed by an 
estimation of the fault miss percentage. i.e .. the chance that a fault is not 
discovered. an important parameter associated with error latency that can only 
be estimated in an experimental setup. 
Chapter 4 illustrates fault latency in the paged regions of memory. The 
computation of fault latency and the experimental setup are discussed followed 
by the latency distributions. The latency information taken together with 
workload information is further used to develop a workload-latency model. 
Chapter 5 examines efror latency in the microcontrol store of the proces-
sor. The experimental SE'tup is described followed by the analysis and latency 
distrib1l1inns. This experimental setup is v<.:rsi11i!c and CGO be ~lscd for a variety 
of peri'nrrnancl' stndies as well. Th(: appiicability t)i' ihe data and instrumenta-
linn are also discussed. 
4 
Finally. Chapter 6 presents a summary of this research and describes the 
future scope of this work. 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
5 
CHAPTER 2 
LATENCY AND ITS MEASUREMENT 
2.1. Fault, Error, and Failure 
There has been considerable confusion in terminology regarding fault, 
error, and failure. In an attempt to provide a conceptual framework for 
expressing the attributes that constitute dependable and reliable computing. 
these terms have bt'en informally but precisely defined in [3]. The basic 
definitions of the terminology are best quoted from the text: . 
The service delivered by a system is the system behavior as ic is perceived by 
another special system(s) interacting with the considered system: its user(s). 
A system fa.ilure occurs wpen the delivered service deviates from the specified 
service. where the service specification is an agreed description of the expected 
service. The failure occurred because the system was erroneous: an error is that 
part of the system state which is liable to lead to failure. i.e .. to the delivery of 
a service not complying with the specified service. The cause -- in its 
phenomenological sense -- of an error is a fault (3). 
This definition. although precise. is general enough to be applied to a wide 
variety of systems or applications. In addition. the definitions can also be suit-
ably interpreted in different planes l)f applicability, such as the physical. electr-
ical l)r lo~ical plan(-s. This thesis deals with stuck-at fault models that refer to 
the lo~ical plan(-. This j'(lull Illodel is widely \lsed and is representative of a 
number ~)t physical faults. ')incc iT is a CnlllJllOnl\' used f(lult ilwtiel. The results 
from this study provide an insight which is llseful to a large body of applica-
1ions. 
6 
2.2. Fault and Error Latency 
The time between the occurrence of the fault and the time of its 
discovery. i.e .. the failure. is defined to be its total latency. The fault model 
chosen for this study is the single stuck-at fault model. This fault model refers 
to the logical plane and can be caused due to a variety of malfunctions at the 
physical or electrical plane. 
In order to express the effects seen at the logical plane two new terms are 
introduced, namely, active and inactive faults. Using Figure 2.1. for illustra-
tion, consider a bit in a word containing data with a value of 1. If a s-a-l fault 
occurs on that bit. then the fault cannot cause a failure. This fault is called 
inactive and is latent. If during a write into the word the new data attempts to 
change the value of this bit to a O. then the fault becomes active. An active fault 
is defined as an error since it is that part of the system state which is lil.lble to 
lead to failure. During a subsequent read operation. either the Error Correcting 
Code (FCC) will detect and correct the error, or. lacking fCC. a failure will 
occur. The time taken for the inactive fault to become active is defined as fault 
latency. The time taken for the error (an active fault) to cause a failure is 
defined as error latency. The sum of the error and fault latency is the total 
latency. It the bit in Figure 2.1. originally contained a value of 0, then the 
I <.lull is active at the time of its occurrence and hence has a t ault latency of 
,:('1'0. :\ntt lhat the :'<:\\\It lalt:ncy can be zero. but the errl)r latency is always 
nl)o zero. 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
Bit Value 1 
Fault s-a-l 
inactive fault 
IE: 
fault latency 
WRITE 
attempted 
change to 0 
{ 
o 
tlct iW? fault / error 
error latency 
total lulency 
Figure 2.1. Fault and Error Latency. 
READ 
failure if 
noECC 
t 
o 
failure 
7 
8 
2.3. Classical Failure Analysis 
A number of studies on failure data obtained from computer installations 
have been performed to study machine reliability, failure trends, or patterns of 
failure. etc. The data on failures are usually obtained from machine-recorded 
information in cases when such error logs exist and from operator-recorded 
information when they do not. Since such data only contain information on 
the detected failures. nothing is known about failures that were undetected. 
Additionally, the moment of fault occurrence and error generation is not 
observable from these data. Thus. the studies are usually limited in their scope 
to studying the failure event and the associated history of other failures and 
the environment under which these failures occurred. 
A variety of interesting studies has been performed using machine failure 
data. These studies have analyzed failures that occur in different parts of the 
system and also separately analyzed hardware and software failures. 
2.4. Earlier Latency Studies 
There have been a number of studies on latency: however. there have been 
no s1 \ldies nsing real wl)rk loads. There is no general technique for the measure-
tllent or li:llency. :\ )!ale-level clllltlalion of an <l,,'ionic miniprocessor is reported 
in [4] and [5]. :\ set oj specinl programs was used To exercise t he machine. The 
prn)!rams lit) Dnt. hO\\:(''\·L'r. represcnT a rcal workload envirnllllltnT. Thert:r'ore. 
the methodology anci results are lll)t generally applicable. ,-\.nother Similar 
experiment is described in [6]. The delay between The occurrence of an error 
and tht III OIlH,'n 1 of iis detectioll is dC!1111'd as Jt'[cc'riol1 rimc in [7.8] and as 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
9 
llltcncy difference in [9]. In [8], Courtois presents a methodology for on-line 
testing of microprocessors and develops the distribution of detection time of 
failures affecting the heart of the M6800 CPU die. In [10], Shedletsky shows 
that error latency is a geometrically distributed random variable for a very 
general class of faults in com binational digital circuits. Most studies have 
almost always measured error latency or the sum of both fault and error 
latency. A significant attempt at determining fault lutency by Shin is [11]. The 
authors use an indirect technique to estimate fault latency at the pins of the 
chips in the CPU of the Fault Tolerant Multiprocessor (FTMP). Since the exact 
moment when the fault becomes active is not known, the technique provides 
only an upper bound for fault latency. 
All the studies so far reported have used specific programs or fault injec-
tion on special purpose machines. Thus there have been no studies that measure 
latency in a real workload environment. This study is the first study that 
measures both fault and error latency under a real workload. These error 
lazency measurements. in the unpaged section of the operating system under a 
real workload. are reported in [12]. Further work on this including the valida-
tion of the technique appe<J!"s in [13]. The measurements on faultlal<,ncy in the 
pa~l'd -.,<:et ions ol m;_'lllory are repor1cd in[ 14]. 
2.5. Methodology of ivJea.surement 
Hardware faulls m()sll:\' occur 0.1 a device [t.'vel: hence. measurements to 
measure fault or error lalf'I:(Y have 1() be perl(Jrmed al a low level. The meas-
nrement must acquire data lha1 contain lhe change nt state which activates the 
10 
fault and propagates the error condition. There are a few alternatives for this 
measurement, and they predominantly use hardware probes to gather the data. 
In certain cases a software probe can also be used. Data acquisition with probes 
is usually complicated by physical problems due to restricted access to circui-
try and circuit integration. 
The key to making such measurements is to define appropriate probe points 
that will yield the most relevant data for the specific study. The choice of such 
probe pOints is an important issue in the design of the experiment. The probe 
points are typically defined from a functional stand pOint, however, in practice 
it is a trade-off between available accessible points. Since the volume of such 
data can be very large, the choice should also take into consideration the fact 
that the high volume of data can be handled by the instrumentation used. The 
final design of the instrumentation is predominantly influenced by the available 
instruments. Although it is conceivable that instruments are designed to suit 
the requirements of the measurement. in reality, the availability of instru-
ments can in fact dominate the project. 
There exist two popular schemes for hardware measurements. One of 
them is counter based and the other trace based. The applicability of each is 
\"cry Ltependent on the type oj problem. In certain cases either can be used with 
some Jegree ot adaptability. For this TheSis both types or lnstruments were 
("\\"allable. however. only the trace based insl rument was used. 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
11 
2.5.1. Counter-based techniques 
Counter-based techniques essentially use counting to make measurements 
on a certain set of events [15. 16]. Essentially a Jarge number of counters typi-
cally ranging anywhere from 128 to as large as 16K are used for this purpose. 
The probes along with some encoding/decoding logic detect events in the 
machine that need to be counted. At the end of the measurement period the 
counters provide a histogram of the various events that were counted. The 
period of time that the system can be observed continuously is limited only by 
the length of the counters. Hence, the maximum sampling interval is the time 
taken by the most frequent event to overflow the counter. The instrument gen-
erally provides for backup of the counters and reinitialization. 
The advantage that this technique provides is the large sampling interval 
since providing counters with larger length is relatively simple. Additionally. 
it is practical to provide a large num ber of counters Since they can be made 
using memory. The major drawback of the technique is the facl that it is based 
on counting. Counting necessitates that all the events ~o be measured are 
known a priori and are Emite. Further. the timing and history information 
aSSOCiated with the events are lost. 
2.5.2. Trace-based techniques 
Trace-hased techniqucs. llS the name sngf'.t'SlS. pnn"ic..ie ,1 trace \)1' e'.:enlS or 
data [15. 11]. The data from the probes do not necessarily h'lve to be decocied 
or enc()cied to be stored in a buffer memory. Thus. the rn;}ximum sampling 
period is determined by the dept h of the 0uff rr mcmory. The start.. 1 he 
12 
sampling may. however. be triggered through a sequence of events. The data 
may also be qualified with logic so that only a subset of the data seen by the 
probes is recorded in storage. These instruments also provide [or backup of the 
buffer and reinitialization. 
The advantage of this technique is that it provides a very true representa-
tion of events or data as they occur in the machine. Since a priori knowledge of 
specific events is not necessary. such as in the c~se of the counter-based tech-
nique. this is an excellent method for exploratory studies. The data contain 
history and timing information which are valuable. However. since the buffer 
memory has only a finite storage. the data acquisition is forced into being a 
sampling system. This impacts the measurement technique by requiring the 
sam pling method to be validated. 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
13 
CHAPTER 3 
ERROR LATENCY 
3.1. Introduction 
The study of error latency is an important issue in fault tolerant comput-
ing with significant implications in both reliability predictioq. and testing. The 
time between the occurrence of a fault and its manifestation as an error has 
been referred to as error latency [10, 3]. Many errors can only be detected 
when a particular module or subsystem is C'xercised. Thus, although the 
failures may not be caused by increased utiiization, they are revealed by this 
factor. causing a higher observed error rate due to increased workload. The 
difficulties with the measurement of error latency are that the moment of error 
generation is unknown and failure records only contain information on 
detected errors. 
There is. in addition. conSiderable experimental eVidence to show that 
computer reliability is a dynamic funclion of system activity (as measured by 
the workload). \Vorkload-failure studies [17, IJ (on IBM machines) and [18.21 
(on DEC machines) provide eVidence 1ha1 CPU and IlH:lllory failure rOles 
increase rapidly as the system workload approaches saturation. The cl1uSc-eJf('C7 
rel<.llionship in This dependency is unknnwn. hnwever. it is speculated that \)fle 
component in this relationship is due to laTency [1]. :\n explicit model f\)r this 
is given In[I0]. Another possibie reason tor the obs('rveJ workload-enilure 
1~ 
dependence is the stresses imposed by high currents and voltages. A model 
based on this is given in [20]. 
There is no general technique for determining error latency under various 
workload conditions. Studies on CPU fault latency for an avionic miniproces-
sor. determined through a gate-level simulation. are described in [4. 211. A set 
of a specific programs was used to exercise the machine to reveal faults that 
were injected into the simulation. Another Similar experimental study is found 
in [6]. Although the approach and results are valid for the case studied, they 
are not applicable to multi-user systems. Furthermore. it is not practical to 
measure workload effects through such simulations. Similar studies that do not 
use a real workload can be found in [22. 23. 24]. 
In this chapter, a methodology to study the latency characteristics of 
medium-to-large computer systems is developed. The technique is applied to 
the memory subsystem. however. the methodoh.)gy. in principle. is also applica-
ble to the microcontrol store of the CPU. The scope and implication of failure 
in memory go far beyond the memory subsystem. In addition to the largest 
number of failures occurring in the memory [25]. it has been shown that. a 
large number of the CPC errors are traced In l)riginate from the memory [2b]. 
This is the brsl atlempt at joinlly SHld'.'ing error lalel1C\ and workload 
variations in a full production environment. The lIlethod is based on sampled 
data of physical memory activity ga1hered. through hardware instrumentation. 
during the normal workload of the installati,m. The data are then used tn 
reconstruct the error discovery process in 1 he system. The measured s\'sleIll is 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
15 
a VAX 111780 that runs the Unix Berkeley Version 4.2 operating system. The 
system has 20 to 25 interactive users during the peak hours. The workload 
comprises a variety of scientific and miscellaneous word and data processing 
applications. The hardware instrumentation has the advantage of not biasing 
the workload of the machine during measurement, and sampling provides an 
effective means to work with the large volumes of data generated by observing 
the system for multiple days. A detailed validation of the sampling technique 
is performed, and it is shown that the approach can successfully predict the 
percen tage of undetected errors. 
Section 3.2 discusses the instrumentation used. Section 3.3 the measure-
ment, and Section 3.4 the computation of error latency. Section 3.5 shows and 
discusses the error latency distributions that are generated. and Sections 3.0 the 
validation of the technique. Section 3.0 also discusses the estimation of miss 
percentage. an interesting attribute of error latency. that can only be estimated 
in an experimental setup. 
3.2. Instrumentation 
3.2.1. The system 
The instrumentatilm is an inters~ing. project in iTself'. As cxplained. above. 
f~)r tht purpose ~)r this study, the emphasiS is l)n IIIt'lIlory activity. For the pur-
poses of studying error latency. physical I\lemory activity nerds h) be Illeas-
ured. This is only possible through hard\vare instrumentation and direct access 
10 the memory. rhe backplane of the VAX CPU was probed and r he data sam-
16 
pled by the instrumentation. The hardware instrumentation has the additional 
advantage of not interfering with the regular workload of the system. 
The VAX central processor and the memory subsystem are linked through 
a data path called the Synchronous Backplane Interconnect (SBT). Figure 3.1. 
shows the organization of the machine. which are given in [27]and [28]. The sm 
is a parallel datapath that is multiplexed for address and data, and uses a 200 
nsec clock to achieve a maximum information transfer rate of 13.3 million 
bytes per sec. 
The best approach for obtaining memory activity information on the VAX 
is to monitor the SB1 through which all transactions occur. Requests to 
memory can arise from either the CPU or from the lIO devices. and all of them 
are transacted through the sm. Therefore. monitoring the SBI captures all 
requests to the memory subsystem. The address space on the 5B1 is partitioned 
so that addresses to the main memory subsysten1. Unibus subsystem. or other 
adapters are unique thus partitioning access to the subsystems to be individu-
ally extracted. 
The 58r consists of 84 signal lines that belong to five different groups. 
n<lI11el y. arbitrat inn. information 1 ransfer. response. interrupt. and cont ro1. Tb·~ 
inl()rIlldlion transfer ;ll'OUP with -lb si~nal lines conlJins the Illt:'IIh)ry activity 
inll)rIll,\1ion. It IS llsed to 1ransfer dddrcsses. data. and interrupt sUIllmary 
int orInal inn. This group is subdiYided in10 h\'e l1ekb tha1 represen1 pari ty 
check. (P). information tag (TAGJ. source or destination identification (10). 
masks (!'v1.-\SKS). and 32 bits of infnrma1 ion lines (B J. as in Figure 3.~. (page 
21 l. 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
17 
.. ;:. 
8 ~®~ Mb Mb :\'lb Terminal I Controller /f. 
Disk Tape 
VAX 111780 4 M Byte Controller Controller 
CPt; :\lemory 
Cache :\lemory 'lass Rus Cnibus Controller Adapter Adapter 
I Synchronous Backplane Interconnect (SBI) I 
t t t Probes 
I I DAS 
Figure 3.1. \'.\X 1117RO Systt.'IIl OrganiLC11ion. 
18 
(1) P field: The parity field of 2 bits provides even parity for detecting single 
bit errors in the information transfer group. One of the bits provides par-
ity over the TAG. ID and MASK fields and the other over the B field. 
(2) TAG field: The TAG field. is 3 bits Wide and indicates the information type 
(being transmitted) on the information lines (B field). This field also 
determines the interpretation of the ID and the B fields. For example. when 
the tag code represents COMMAND ADDRESS. the B field contains the 
address. 
( 3) ID field: The ID field of 5 bits is used to identify the logical source of the 
data in a write command and the logical destination of the data in a read 
command. The address of the location is contained in the B field. 
(4) MASK field: The mask field is 4 bits wide and is used to specify operations 
on any or all bytes of the data in the B field. Each bit in the mask field 
corresponds to a particular byte in B. 
(5) B field: The B field is 32 bits Wide (4 bytes) and is used to carry 
information/data. Depending on the TAG neld the 32 bits are interpreted 
either as one data neld of 32 bits or as containing twn subficlds: a FUNC 
fIeld or -l bits which identines read or write mode and an ADDRESS fIeld 
or 28 bits containint! the physical address whlth can be either main 
Illemory or I/O. 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
19 
3.2.2. Experimental setup 
A Tektronix Digital Analysis System CDAS) 9100 Series was used to moni-
tor and sample data transfer activity on the SBI. The DAS probes used can 
acquire data at speeds up to 40 nsec which is faster than the clock speed of the 
SBI (200 nsee). These probes were placed at the card edge connector of the SBI 
control cards [29] where the SBI signals were accessible. The data were read 
into the DAS using the SBI clock for external synchronization. 
The experiment was controlled from the VAX with the aid of Tektronix 
q 1 DVV 1 software and some additional programs. The DAS is programmable 
via an IEEE-488 interface. or an alternative serial line RS232c link to a host 
machine. It is connected to the VAX via the serial line interface. The software 
which controlled the experiment was such that it caused negligible overhead 
and did not bias the experiment. The acquisition system has been tested for 
data bias against itself. This is done by externally triggering the DAS. acquiring 
the data. and comparing memory usage distributions generated by this data 
with the distributions generated from automatically acquired data. It is found 
that the instrumentation is sound and does not indicate any significant 
inB uences of self-bias. The DAS was periodically triggered to acquire data 
lflHIl the ':)131. downlnad the acquisition memory. and lime-stamp the dala. 
This da1a \Vas ihen preprocessed 10 make it cOInpatible \\iit11 ~ubsequent input 
inln sla1islical analy~is pro!!rams and archived on tapes. 
The instrulnenlatinn was ll'sleJ for correCl operation and a.cquisition. This 
\vas i)ertnrmed by 1akin)2. 1 he system unwn into single nser operation. turning 
20 
the cache memory off and running a test program that accesses specific locations 
of memory in sequence. Data collected on the DAS was then examined for 
correct acquisition against the known test program. 
The acquisition memory of the DAS for the probes used. for this instru-
mentation has a depth of 511 words. Two types of data were collected. The 
first is referred to as regulllr mode. This involves logging transactions of every 
SBl cycle. A sample of data acquired in regular mode is shown in Figure 3.3. 
The first line contains a time stamp for the sample. Each line of data represents 
one SBI cycle. The data is in one's complement form. Note that there are a 
number of idle cycles (all l's), Figure 3.4. shows the decoded. version of a sin-
gle observation, The second is referred to as compressed mode and ~onsists of a 
dense trace of addresses that are acquired by storing only those cycles that con-
tain addresses. 
3.3. Measurement 
The experiment collects data on memory activity. i.e., physical memory 
address. access rate and read/write mode. For the purposes of this project, the 
region in memory where the OS reSides is studied. This has the advantage of 
bein~ lh<.' unpagt'd portion of Illemory. The data captures the memory activity 
of the whole physical memory. IIowcver. ror t he purpose of this project the 
concentration is upon a region of mellwry which contains the oper~l1ing system 
which is largely unpaged. It. theret t)re. provides an estimate of inherent latency 
characteristics unaffected by paging. In add.ition. the errors in the operating 
systeIll can be faii.lJ. The mC1.hodology. hnw('v('f. is eqnally valid (or both the 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
PROBE: 
.15 
.Ib 
.17 
.18 
19 
.10 
.11 
FIELDS ;, 
D-\T"-
PROBES 
21 
I II I FIELDS II-I-p---.....------'-------~ 
I BITS I: 2 
B I I 
I 
TAG ID ~ASK FliNC AD~:ESS I 
3 5 4 4 
Figure 3.2. SBI InformationTransfer Group Fields. 
JD JC JB JA 2C 2B 2A COMMENTS 
Fri Nov 2312:15:11 CST 1984 
11111111 11111111 11111111 11111111 11111111 11111111 11111111 idle 
10100101 10010101 11111010 01111111 11111111 11110011 10001111 cmd/adr 
11111011 00000010 11111111 11111111 11111111 11111011 11101111 data 
11111111 11111111 11111111 11111111 11111111 11111111 11111111 idle 
00101010 01001100 11110100 01111111 11111111 11111011 10001111 cmd/adr 
11111001 11001001 10001010 10000011 11111111 11111011 11101111 data 
11111111 11111111 11111111 11111111 11111111 11111111 11111111 idle 
Figure 3.3. Acquired Data in Regular Mode. 
ADDRESS 
! I . I I 
! FL"NC I CNF I unused, \1ASK : p . ! I I unused, TAG lD 
i I I 100 . 01111 0101101000001101111111001101 I 1111 III I 11111 I 0111 i 11 , 11 i 
3D 3C 3B 3A i 3\ 2C 
, 
2C 
, 
2B 213 2B 2,-\ 2,-\ I i 
l I 
Figure 3.4. Decoding the Data. 
22 
unpaged and paged portions of the memory. The possible alternatives in col-
lecting representative data were: 
( 1) Collect data all the time. 
(2) Sample the measured system sufficiently, so as to get a representative dis-
tribution of memory activity. 
The first is not only wasteful but also impractical. given the voluminous 
nature of the data involved and the large buffer sizes which would be required 
in the acquisition instrumentation. The second technique is adopted for its ver-
satility and ease of implementation. The data acquisition was performed at 
intervals of approximately 40 seconds. The sampling is sufficiently frequent to 
capture the workload behavior. In the preliminary analysis the distribution of 
memory access stabilizes within 15 to 20 minutes of sampling. Figure 3.5. 
shows a memory usage histogram of the region studied that is generated from 
the acquired data. This means that if two lS-minute samples are considered, 
and the workload changes considerably during thi~ period. it will be reflected in 
the samples. Thus, the error in the measured latency distribution is limited to 
less than 15 minutes. Figure 3.6. shows the C,\(,f" C Pi) as a function of the time 
nf cia'," for a typical day. Notice that there is a significant variation in U"'(T CPU 
during a day which sllgg(:s1S that the effect of v,,:orkload nn latency should be 
detectable. In examining 1 h(' workload profiles. :t can be seen that the tll'crtlgc 
workload can be (lmsidered in be reasonably s1able in any IS-minute period. 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
?ERC~~!AGE 3AR CHART 
PERC~T""GE 
I.. I ..... ..... 
l~~~~ ~m1 ~~~~~ 
..... 
. ... . 
. ... . 
..... 
..... , . 
... 
10 
a 
4 
z 
..... ..... ..... . .... 
. ... . 
. ... . 
..... 
::::: g;g Eg~ ~gg ~gg ::::: 
g~g ~gg ~E~~ gg~ ;gg ..... ::::: ::::: ::::: Hm ::::: m~~ m~~ Hm m~~ ::::: ..... ~~~:: ;;;; m~~ ~~m 
::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: 
..... ,.... ..... ..... ..... ..... ..... ..... ,.... ..... ..... ..... ..... . ... . 
-"'-"-"~"-"'''''''''''''.--'-''''''''-''-'''-''--''--''' ......... _-_ ... _._ .. - .. - .......... . 
3k 10k 3Jk .:.8k oJk 78k 99k lQ8k l2Jk lJak lS3k l68k l83k 198k 
.... DDRESS ~IDPOI~T 
Figure 3.5. \-1emory Usage Histogram 
23 
I 
2~ I 
I 
I 
I 
I 
5 ca - --I--
1 
ci ! 
~ ~ ~ l ~i 
I 
I ) J j I ~~ I j o~ I ~I i lJl 
, 
I \ I~ .,~ I J .~, ~I~ J\ '" I. 
i V1Vi \, ~1 
~ 
I 
-I 
I 
-I 
4 
~V 
J 
I 
. i I I I I I , I • I 1 j I ii' 
" ~e 15 2e 25 I a I I I I I a 5 
Time of Day in houn I 
I 
I 
I 
I 
Figqre 3.b. l:S(>f CPt.: Lsa~t' \ c.'fS\l~ Time or Day. I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
25 
3.4. Error Latency Determination 
For the purpose of this study. it is assumed that any location in the meas-
ured region of memory is equally likely to fail. The assumption implies a uni-
form failure behavior within the region, i.e., the workload variation within the 
region does not cause any additional failures but merely influences the latency. 
This allows the determination of the distribution of the discovery process, 
without being biased by factors which cause faults. The memory addresses 
chosen to contain a fault are picked from a uniform random distribution. It 
should be noted that the methodology is equally applicable to other distribu-
tions of failure. 
3.4.1. Fault model and latency calculation 
The fault inserted is a flipped or inverted bit in a memory location. This is 
chosen since it results in an active fault which will be detected. during a read 
on the memory location. by the error detection and correctic·n code (ECC). The 
time between the occurrence of the active fault. i.e .. flipped bit, and its detec-
tion is the error latency of the fault (as discussed in Section 1.3). The fault 
model chosen also conforms tht' the definitions proposed in the IFIP working 
group 10.4.. 
Figure 3.7. shows the alg()rithrn used tn t!eneratc error latency distribu-
tions. A random !llemory location. say m l' has a raul! f 1• Let the fault be 
insened at lime r. The data are now scanned to find the fir')t memory reild to the 
location m l' This is when the fault would be detected by the ECC circuitry. In 
Fig.ure 3.7 .. location m 1 has three memory reads: one before and two after the 
26 
Read on 
location m2 ~"'---------------------------;.. 
Faul t occurrence 
l2 = infinity 
~- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
time on m 2 1-------4 ... __ --------------------...,.. 
Read on 
location m 1 1--' ... ---------------4I.~------I. __ -----~> 
II 
~- - - - - - - - - - -> Fault occurrence 
time on m 1 1-------4 ... __ ---------------------;;;.. 
f 1 
[ t 1 
Fault f 1 is on memory location m 1 
Fault f ~ is on memory location m ~ 
Figure 3.7. Latency Time Cllculalion. 
time 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
27 
fault j l' Let t i be the time of the first memory read after the fault. Then the 
latency associated with fault f 1 is l1 = t i - t. The same location may be reac-
cessed (as in the ngure) but that amounts to rediscovery and is not a part of 
this latency study. If. however, the data set never contains a read to the 
memory location in fault, then it goes undetected and amounts to a miss. For 
example, fault f 2 inserted in memory location m2 is never detected. The misses 
are used to estimate the percentage of undiscovered faults. This process, of 
inserting faults and determining the latency. is repeated for a large number of 
faults. yielding a latency time for each inserted fault. The different latency 
times, taken together, generate a distribution of the error latency for the fault 
occurrence {imc {. 
3.4.2. Algorithm implementation 
In order to work with sampled data a class is defined. A ciuss is a set of 
neighboring memory locations which are assumed to have a uniform probabil-
ity of access. The number of memory locations in a class is termed the class 
si=£>. The class size is chosen to reflect approximately uniform access rates 
within the class. The class caters to the fact lhal. although the sampled data 
are represenTative of The IIlemory access pal1t:rn. it need not contain every dis-
TinCT address 1hat is I!cneraled. Thus. in the comptllatinn ,)1' error latency. tht 
access to any member of the class can then he considneti ('f}1livalent 10 access of 
1he whole class. The algoriThm. therefore. uses classes. in place of memory 
locations, to reconstruct the error discovery process. The class size is chosen 
small enough so thal the memory usagt: within a class is unit orIll. i.e .. each 
28 
location in a class has similar access probability. and large enough so that the 
computation is still tractable. The class sizes that are chosen do not bear any 
relationship to the physical organization of the memory, although in reality the 
memory is organized in classes to a limited degree. An access to a byte or a 
word (two bytes) invokes a long word (4 bytes) to be read which corresponds 
to a.class size of 4 bytes. Class sizes. varying from 1/4 page to 3 pages. have been 
experimented with and it was found that the class size did not appreciably 
change the distribution and that the median varied by less than 5%. Section 
3.5. discusses the class size and its ramifications in detail. Figure 3.8. shows the 
flow of data in the experimental setup and the offline processing. 
3.5. Error Latency Distributions 
In this section. the latency distributions generated by the technique are 
described. The workload effect on error la~ency is determined by placing faults 
in the data at a specific time of day and computing the error latency in the 
hours that follow the fault. The fault occurrence time is then moved in time 
across the entire measurement period. generating a distribution at each step. 
This generates a family of distributions (nne for each faull occurrence time) 
\\."hich. taken together with the workload profile. show how the changt5 in 
w~)rkload affect error lalency. A s{:t of err')r latency distribution::- is sh~)\\in (or 
j(\\t11s placed under !tn\' dnd hi!!,h 'vvnrkload condit ions. ihis dl'lIlOm,l rall''' " he 
\'ariabilityor the mean la1encies showing 111at i1 is a strong function of work-
load. Another set of error laiency distributions is shown for measuremen1 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
PERF ::vIONITOR VAX Probes DAS 
PERF DATA 
REAL-TIME ACQUISITIO~ ON VAX 111780 
OFF-LINE PROCESSING A;'\D 
SIMULATION ON IBM 3081 
FACLT '--F-A-C-'L-T-S---I-N-V-B-I-T--~ 
TIME"" L.-___ ~-_---J DATA PRE-PROC 
SORT SORT 
1 j 
DETER\U:.iE LA TE::'IiCY - SELECTIVE :\iIERGE 
j 
LATE:\,CY 
j 
~ 
, /' L-\ TF:\,CY ~ \ 
\ D1STRrBlTIO'\S ) 
, --
Figure 3.8. ,\Cflulsition and Prncessing of Dala. 
29 
30 
periods that span two days. The cyclic nature of the workload causes the 
latency distribution on the first day to repeat itself in the second. This demon-
strates that the results are representative and not just a freak case. The hazard 
calculated from the error latency distribution shows that the observed error 
rate. due to the error latency. increases with workload. 
3.5.1. Faults at low and high workloads 
This subsection shows the error latency distributions for faults placed at 
two different times of the day. One. when the workload is very low and 
another. when the workload is high. It is found that. there is considerable 
difference in the two error latency distributions. The mean latency can vary. 
from as large as 8 hours for a fault at low workload. to as short as 40 minutes 
for a fault at high workload. This clearly demonstrates that error latency is a 
strong function of the workload that followed the fault. 
Recall from Figure 3.6. that the system has a low workload from mid-
night to 7 a.m. and an increasing workload (intermediate) from 8 to 10 a.m .. 
with a peak around 11 a.m. The intermediate period where workload changes 
from low to high is of particular interest. Figure 3.g. shows the latency dislri-
bUlion gene-rated with faulls inser1l'ci at midnight. The distribution is bimodal 
with the second mode being tht· lari!ef ot the lwn. The initial peak corresponds 
t!,,) a small period or high iJcljvil\' \\'hich usually occurs around midnight. 
\Vilhin the fIrst hour about IO'YcJ or the lieltcled errors are found. The bulk l1f 
the errors (70%) are found in 1he ~(,u)lld moue. There is a sharp increase in the 
number 01 errors being cietecil'd aboul g hours after 1he raU11. This corresponds 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
-
I 
I 
CUM 
PERCENT 0 
300 ~ 
~ 
200 
-
-
~ 
100 ~ 
~ 
o 
o 
:-
20 
I 
r-- ,.... 
~ ~ 
~-~ I-
2 4 
30 
1 
,.... 
r-
r-
~ r-tr.[-
6 8 
Parameter 
Mean 
StdDev 
Med 
60 
I 
r-
r--
~ 
r-
r- t-
_I 
10 
Midpoint Latency - Hours 
H:MM:SS 
r-
8:03:28 
4:01:19 
9:18:43 
90 
:i 
I""'" 
~ 
~-
12 
Fig 1_1re 3.°. Error LJH~ncy DisTribulinn - Feudt ,-n 00:0011:-. 
31 
100 
I 
I-
C1 
14 
32 
to 8 a.m. (real time of day) which is the start of the increasing workload 
period. Also note that there is a dip in the distribution after the mode. This 
corresponds to a lull in the activi1y that occurs around 10:30 a.m. or so in this 
system. This system is largely used by graduate studenls. whose day starts at 
around 10:30 a.m. and continues past the lunch hour. The early morning (8 
a.m.) rise in activity is due to secretarial and staff users. This clearly shows 
the influence of workload in determining error latency. The mean latency is 
8:03 h:m. Listed in the ngure are the percentages which correspond to detected 
errors only. Nearly 25% of the faults inserted were undetected. Missed faults 
and the associated miss percentage are discussed in Section 3.6. 
Although these distributions presented here are of a specific day. data from 
a number of different days have been similarly analyzed. No matter how low 
the workload when the fault occurs. there is always an initial Jiscovery of 
faults that contributes to a mode (though small) in the latency distribution. In 
Figure 3.9. the initial peak in this distribution is due to a combined effec1 of the 
initial discovery and also 10 the fact that there is a peak in the early hours of 
the morning caused by some system routines. The second mode (larger) is due 
to the workload that discovers the raul ts. If. however. the faul t occurred at a 
time during the high wnrkload. say 12 p.m .. then the inilial JisC<l\:cry 1I1Odt 
'I,\;ould be dominated by 1 h(' large ciisCL)\'('r\' d\1C 10 l/1C high workload. hgnrc 
3.10 .. :<1S faults inserted at 12 p.m. : well inlL) the high workload period). In 
contrast to Figure 3.g .. the mean error latency is now down to 44 minutes with 
70% of the detected errors discovered in the 1st hour. Thus faults occurring at 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I. 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
CUM 
PERCENT 0 
>-u 
z 
~ 
:=l 
0-
>:l 
~ 
~ 
I 
~r-
80 ~ 
-
60 
-
io-
.. 
40 ~ 
~ 
20 
-
i-
o 
0.0 
30 
I 
~I-
t-
r-r-
io-
r-
r-
l-
f-
0.5 
70 
.1 
r-r-
I-
-
-
-
r-
r-r--
f-
1.0 
I-
Parameter 
Mean 
Std Dev 
Med 
90 
I 
1-1-
f-
r-I-
1.5 
-
Midpoint Latency - Hours 
r-
H:MM:SS 
i-r-
0:44:26 
0:29:19 
0:43:57 
100 
I 
ll, 
2.0 
Fignre 3.10. Error La1t!lcy Dis~rJi1111ion - FJllil aT 12:00 11r. 
33 
34 
low workload can be discovered with latencies as large as 10 hours (on the 
average) versus only 1 hour for errors occurring at high workload. 
3.5.2. Multiple day measurement 
In this subsection the effect of a cyclic workload on the error latency dis-
tribution is studied by considering a measurement period that spans two con-
secutive days. The error latency distribution of the first day reappears in the 
second. This further demons1rates that the distributions generated by this 
technique are stable and hence representative of the workload in general. This 
is also explicitly shown by generating three distributions for three different 
fault occurrence times during these 2 days. A fault is detected with 70% 
confidence within the first day and incrementally in the following days. 
Figure 3.11. shows three latency distributions with the fault occurrence 
times advanced relative to each other. To make the latency distributions easier 
to compare. the latency times have been shifted to match up with the real time 
of day. In Figure 3.11a. the faults occur at 00:00 hours on the first day and the 
latency times (abscissa) are the same as the time of day. In Figure 3.11 b. the 
faults are inserted at 8:00 a.m .. and the latency t irnes shifted hy R hnurs Tn 
maTch up with the time Df day. Figure 3.11c. has j(wlts at ~:()O a.lIl. on the 
second day with the latency liIlles shifted by 28 hours. \inl in: 1 hat whm the 
{a\llts <lre inserted in the 5rs! day. lhe palkrn of lalency dislrib\ltion oj lhl' 
hrst day reappears on the second d<J),. 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
60 
40 
>. 
tJ 
c 
cu 
:::I 
0" 20 
·cu ,.. 
r:. 
0 
0 
Figure 
60 
,1 
I 
40 .J 
>. I 
tJ 1 
c i 
cu i :::I 1 
cr" 20 ~ cu ~ ,.. '-
; 
0 
,) 
Figure 
1 
i 
60 I 1 
1 
I 
1 
40 ... 
1 
>. ~ u ,.. ~ Qj 
:::I 20 
.J 0" 
cu 
:... 
~ 
0 
') 
Figure 
10 
3.lla. 
10 20 
3.l1b. 
10 
3. llc. 
ORIGINAL FAGE ~';" 
OF: POOR QUAUTY 
35 
Fault at 00 hrs on day 1. 
, 
20 30 40 48 Time in hrs. 
Fault at 08 hrs on day 1. 
30 40 48 Time in hrs. 
20 30 40 48 Time in hrs. 
Figure 3.11. Error LaTency DisTributions t(1r 2 CnnS(:cu1i\Oc Da\Os. 
36 
From the latency distributions it is clear that there is a finite number of 
inserted faults discovered in the second day. \Vhen three days of data were stu-
died. there was incremental discovery on the third day. It was found that, 
typically. the first day reveals a fault with 70% confidence. the second 82%, 
and the third 91%. Thus. when considering the unit of time as a day, the 
confidence level of detecting a fault is not very dependent on the specific varia-
bility in workload. This is due to the fact that the workload cycle over a day 
reveals faults with a large degree of confidence (70%), and the subsequent fault 
discovery is incremental. However, the median or 50% confidence level is 
reached within a day, and this is highly dependent on the workload that fol-
lows the fault. The issue of the fault-miss percentage is discussed in detail in 
the following section. 
3.5.3. Latency and hazard 
These results suggest that a steady rise in workload sweeps the errors out 
(higher error discovery) after which few. if any. remain to b(- discovered (low 
error discovery). An increase in workload causes a temporary increase in the 
observed error rate. The error rate drops again after the errors have been 
discovered. In Figure 3.Q .. this phenolllenon is observahll' with the steady 
decline in the number or errors discovered afTer a large init.al discovc:ry. It is 
of \-alue. therefore. to explicitly determine the chan~e in failure rate that 
r(·sults from the dis((wery nC latent l'rrors hy w~)rKlnad changes Un this case 
The memory access). 
---- ------
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
37 
The hu:ard or failure rate. Mr). is a measure of the in.stulltalleOUS specd of 
failure. If F(r> is the failure distribution function. fU) the failure density 
function. and R(r) = 1 - Ht >, the reliability. Then. the hazard h(t) is defined to 
be: 
h( ) -1' 1 F(t+x)-F(t) c - lm-
, -0 x R (t ) 
h (t ) = lim R (t ) - R (t +x ) 
,-() xR (c ) 
h (t ) = f (t ) 
7ffi) 
Thus, hCf}!l.t represents the conditional probability that a component surviving 
to age t will fail in the interval (t. t+D.t J [30]. For computation of failure rate 
from data on failure. a discrete definition is used. The discrete functions 
approach the continuous functions in the limit when the data become large. 
Thus. hazard over the interval (t. t+~tJ is defined as the ratio of the number of 
failures occurring in the time interval to the numher of survivors at thc begining 
of the time illtcr.Jal. divided by the length of the time interval [31]. Thus. 
h (t ) = [n (t ) - n ([ +~t »)In (t ) 
.it 
Figure 3.12. shows three hazard nne plots computed from the three error 
latency distributions in Figure 3.11. These hazard rate plots reveal some 
interesting and important charactt:'ristic~ ot error lalency, 
Note that the hazard rate is n01 conSlant. This clear!v establishes that clas-
stcal models, assuming exponential d.istributions to model failure rale duE' to 
error latency. are not valid in a varying work.load environment. Furthermore. 
simplifying assllmptions such as linc'll!'!\' Zncrcus'Zng or (mcu,-!' .... J('cT'('llsin Q 
38 
T 
.08 
.06 
z 
0 
.04 - 
.02 
. 00 
Time of Day - hours 
Figure 3.12a. 
.08 
. a 2 +  I 
Time or' Day - hours 
Figure 3.12b. 
.10 
20 39 40 50 
T h e  of 3ay - hours 
Tigurz 3 . 1 2 .  
- 
I 
I 
I 
I 
I 
I 
I 
D 
1 
I 
- -  1- 
1 
I 
1 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
... r .... NAt FAG\:. r':' OR\Gl. ,"I&L\T'l Of pOOR ~ 39 
failure rates that are used to model failure rate are also not representative. 
Notice that the change in the hazard in Figure 3.12a. on the second day is 
comparable to that in t:he first day even though around 30% of the faults 
remain. This is seen even when the fault occurrence time is moved across two 
days. The increase in failure rate due to latency is not so much a function of 
remaining faults but is dependent on whether or not there is a latent fault. 
3.6. Validation and an Analysis of Fault-miss Percentage 
The use of sampling to study error latency raises some important ques-
lions: does data not recorded between samples significantly affect the computa-
lion of error latency and its distribution? In particular: 
• Is the error latency distribution that is computed from the sampled data 
Similar to the real error latency distribution? 
• Do the memory references not recorded between samples result in a larger 
computed percentage of undetected faults as against continuous measure-
ment? If yes, what is the rml miSS percentage? 
• \Vhat is the effect of the dtlss si=c parameter (see Section 3.1) and the s"am-
piing /n\/llCi1(Y nn the results obwined ? 
rhi~ :-,\.'ction llnS\\;(:rs l!1l'Sl' C11it.'stions. The dislri!Jutions arc nol sensitive to 
iill' S311lpltng, h~)we\"t'r. the compUTed faul1-l!liss percentage is a function of the 
sampling IreC1llency aDd cla~s SIZl'. This is besl iiluSlra1t'U wilh a simple C'xam-
pie. ConSider lh(: !rnplemenlalilln or numeriCc.l inlegrCltion. It has l)fle degree of 
freedl)lll. namel)", the s1ep size. In 1his technique. llwre are two degrees !)1" [ree-
40 
dom. namely, class si=e and sumpling frequency. The step size does affect the 
accuracy of the result. However, with the step size within the right range, the 
computation can be both fast and accurate. 
The problem of validating the method is one of estimating the true fault-
miss percentage, given the computed values of fault-miss percentage. It must be 
noted that the percentage of· the missed faults provides an estimate of the pro-
bability that a fault goes undetected. A technique for this purpose is discussed 
below and the technique is veri fled using data from a region of memory in 
which the fault-miss percentage is known. 
3.6.1. The effect of class size and sampling factor 
The flrst step in this analysis was to look for the possible errors in the 
latency distribution due to sampling. The effect of sampling on the latency dis-
lribution was studied by further sampling the data. For this purpose a sam-
pUng factor. s. which measures the decrease in sampling rate over the original 
sam piing. was deflned. i.e .. 
s = Sampling frequency of collected data 
Sampling frequency of new sampled data 
Thus) = 1 for lhe collected data. and s = 0 for continuous measurement. 
Error latency dis' ribnt ions were ll1l'n i!-t'neruH:d for a range or sampling 
Rales. The dist ributions did nnt Liiller signiocanl1y as .\ increased. This shows 
the ,nscl1sitivity of t11(.' error latency dis1 ribulion 1.0 the sampling ractor used. 
Ho\\·ever. the compU[('J fault-miss percenlag<: ot undeTected faults varied with 
the sampling faCh)r (\ l. Recall ThaT lhe compurcJ faull-miss percentagE' is 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
41 
defined as the percentage of inserted faults that remain undetected during an 
observation period. A similar dependence was found between the computed 
fault-miss percentage and class size. 
3.6.2. Fault-miss percentage 
The computed miss percentage. during the generation of an error latency 
distribution. is a function of the sampling factor (s) and the class size (c). i.e .. 
together they form a 3-dimenslonal surface. Figure 3.13. and Figure 3.14. 
show the variations in 2 dimensions for a 30 K byte region. Figure 3.13. shows 
miss percentage (m) versus c for three different values of s. The computed m 
increases with decreasing c. The curves plotted for different values of s show 
I 
that m decreases with decreasing values of s. Thus the curve for continuous 
measurement (corresponding to to s = 0) will be below the lowest curve. This 
curve is estimated in order to determine the real m. 
Figure 3.14. shows:n versus s for three different values of c. The com-
pUled m decreases with decreasing s. The curves plotted for different values of 
c. however. show an increase in :n with decreasing c. Recall that for the meas-
ured SysteIll. the real class size is 4 bytes. The curve for this real value of c is 
abtwc 1 he highesl curve and is also l'sTimalc-d in nrder in determine the real m. 
["he [cal Iniss percentage was deiermincli by n11ing a ITlultiple regression 
model in theSe ci<:lla and SUhslitl1iin!! tor l = 4 <:lnll \ = 0 in the regression model. 
This. nC course. requires hackward e'-;trapoiatinn 01 the regression plane. Fol-
lowing this technique. the real miss percentage for the 30 K byTe region ,"vas 
Por: A 30 X byte region 
10 
o 200 500 iOO 1000 
Class Size in bytes 
Figure 3.1.3. Miss Percentage Versus Class Size. 
90 ~ 
• 
30 l 
1 
• 
• 
a 12 
Por: A 30 K byte region 
~ ____ --~, c: - 100 
24 36 
Samplinl Factor 
c: - 200 "---~ c: - 300 
~:::::==:==! c: - 400 
c: • .500 
51 
F:g ure 3.1-1.. \IIiss PerCl'nta~t' Verslls ')ampiing Factor. 
42 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
43 
Although such extrapolation is a commonly used tE'chnique. questions may 
be raised about its validity. For this purpose, further analysis was performed. 
which is discussed in the following section. 
3.6.3. Verifying the miss percentage estimation 
The key to verifying the extraw1ation is to show that the estimated mlss 
percentage is indeed correct. This is possible since in the data there eXists a 
region of memory where the real miss percentage is known. Data from this 
region (which contains the kernel of the operating system). are used in 
verification. The region has very high usage and complete representation in the 
sampled data. Access to this region is exhaustive; hence, it has a zero m during a 
24 hour period. 
The data from this region were truncated in time. to decrease the period of 
observation. thereby increasing the m. It was then further sampled to emulate 
higher sampling factors. Analysis. Similar to that in Section 5.1. was then per-
formed 10 sludy the variation in m as a function of c and s. This analysis 
showed relationships among m. c and s Similar to thal observed in other regions. 
i.e .. a plane. Figure 3.15. and Figure 3.16. show these variations in 1-
dimensions. A regression model was then 5ned 10 This and lhe In determined. 
The miSS perccnlage obtained hy ext rapolating 1 he regression plane was com-
pared \ ... ·ilh The known miss percentage for 1his region. The real miss percenl3.1!t' 
is 0.06 and The predicted miss percentage is O.oq. \\ihieh is close l with a regres-
sion coefficfen t of O. q 11. Th is proves the validity of the technique used to 
predict lIliss percentage. In addil ion. il also shows tha1 the fault-mISS 
301 For: the 'Very high usage region 
tlO 
:s 
c: 
t 
.. 
~ 
.. 
.. 
~1 
o 10 20 30 40 
Class Size in bytes 
s -120 
s -60 
s -30 
s -10 
Figure 3.15. Verification: Miss Percentage Versus Class Size. 
251 For: The very high usage region 
2~ 
t J /' ~l~ /~y ~~ ~ , 
o 21 30 42 
Sampling FaC10r 
c - 10 
o c- 30 
c-40 
51 60 
Figure 3.16. Verification: Miss Percentage Versus Sampling Factor. 
44 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
45 
percentage can vary significantly from one region of memory to another. 
depending on the level of usage. 
It is important to note that the real miss percen1age varies significantly 
depending on the activity in the region. Thus. different regions of memory can 
have widely varying miss percentages. Figure 3.17. illustrates this by compar-
ing the miss percentage versus class size curves from two different regions of 
memory. Region A is the same as used in the earlier figures and is 30 K bytes in 
size. and Region B is 200 K bytes. From Figure 3.17. it is clear that region B 
has a higher miss percentage than region A. These figures are generated from 
data when the system was observed for 24 hours. It is to be noted that the 
miSS percentage will also vary depending on the period of observation of the 
system. This is eVident from the latency distributions of multiple days, where 
there is a small but significant discovery during the second day (Figure 3.11.). 
These issues illnstrate that the miss percentage in a sys1em has a large variabil-
ity between region!,: from q% to 80% during a 24 hour period. Hence. using an 
average value does not well reflect its variability. It can only be expressed with 
reference to a specific ret!ion of memory and a period of time tha1 the system is 
observed. 
In summar:: it has been shown ihJt: 
( 1) the error i<lleIll\' disTribution is insensiTi\'t' tt) the sampling technique used 
f\)r measurelllen 1. 
(:]) the compUlcJ fault-miss percentage. during the generation of an error 
latency dislribuTion. is a function \)1' The sampling factor and class SiLl'. 
80 
0 60 ~ 
!! 
= .. 
... 
... 
0 Q" 40 .. 
.. 
:i 
20 
0 
Class Size - 300 bytes 
'\ 'I ;:Ii" "I : (4 $ 'I"e,,!,o: ,. .. I ""f'''''.'''''''''''! '''I' 
3 12 18 24 30 36 42 51 
Sampliq Pactor 
rii!l1re 3.17. \ar;alinn in yliss PCfct:1Tage BeTween Regions. 
46 I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
47 
(3) the real fault-miss percentage is estimated using a multiple regression 
model. The technique is shown to be valid using data for which the miss 
percentage is known. 
(4) the fault-miss percentage can only be expressed with reference to a specific 
region of memory and for a finite period of time. 
3.7. Summary 
This chapter illustrates a practical approach to the study of error latency 
under real workload. The determination of error latency is an important and 
unsolved issue in fault tolerant computing with significant implications in both 
reliability prediction and testing. The method is based on sampled data of the 
physical memory activity gathered by hardware instrumentation on a V AX 
111780 during the normal workload cycle of the installation. The data col-
lected are then used to reconstruct the error discovery process in memory 
under different workload conditions. The use of sampling is validated by an 
analysis of the sampling factor. the class size. and the computed fault miss per-
centage. A regression based projection is used to determine the real fault miss 
percentage. This is verified using data for \vhicr. the miss percentage is known. 
The analysis and its verification substantial(> the nverali approach ot using sarn-
p ling. to reconst rHO 111(:' error disc('\Very process. 
The results pf()Yicie general f,:!11i<.ielinl's tnr l.lf1ciE'rstanclln~ la1ency behavior. 
The sl \ldy [lnds ihat the mean error latency. in 1 he unpaged memory con1 aining 
the operating SySTem. varies by a factor l)( 10 to 1 (in hours) between the low 
and high Wl1rkh)Jeis wIthin a day. The i7lL:Jr .. i rule. comnnted from the errnr 
48 
latency distribution. dearly shows that the o/:lserved failure rate increases dur-
ing higher workloads. Analysis using consecutive days of da~a shows that a 
fault is typically discovered the same day with 70% confidence, 82% confidence 
within the next day and 91 % confidence within the third, i.e., there is a small 
but significant fault discovery in the second day and third day. This method in 
addition to determining error latency also provides a means to study the fault-
miss probability. The fault miss percentage is seen to vary widely between 
regions of memory depending on the activity. i.e .. workload, and can only be 
expressed with reference to a speCific region of memory and a finite observation 
period. As with any statistical analysis, caution should be exercised in extrapo-
lating the absolute numbers obtained in this study to other non-similar sys-
tems. However. the development of workload based reliability models. based 
on the general characteristics of the latency distribution found here. is an area 
of future study. 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
49 
CHAPTER 4 
FAULT LATENCY 
4.1. Introduction 
The study of latency is an important issue in fault tolerant computing 
with far-reaching implications to both reliability measurement and evaluation. 
Fuult lutency is the time between the physical occurrence of a fault and its corr-
uption of data, causing an error [3}. The difficulty with fault latency measure-' 
ment is that the time of fault occurrence and the exact moment of error genera-
lion are unknown. The detection of a fail ure is the only record of the errors 
caused by a fault. Thus. faults that do not cause a failure are completely 
missed. The only feasible way to quantify latency is through an experimental 
setup. wherein the time of fault is controlled. and the error generation time is 
observable. 
This chapter describes an experiment to accurately study the fault latency 
in the memory subsystem. This is the first attempt to measure /Jult lllte!1cy in 
the memory with a real workload on the machine. The e~perimen1 employs real 
mcmory data from d VAX 1117RO aT lhe Lni\"crsity oi Illinois. Faull la1ency 
Jistribu 1 ions are generated f~)r \1U<K-ll[-(j (s-(1-(J) and ("lILCK-Jt-i (s-a-J) pe1'-
;naDelli fault Inolkls. Res1lITs sht)\\.: 1 hal 1hl' iIle<ln f aull lale!lCY 01 a S-(1-0 
iault is nearly 5 times tha1 of the s-a-l fault. Large variations in fault latency 
are f'onnd for different regions in mem~"'Iry" . .\n 
50 
quantify the effect l)f various workload measures on the evaluated latency is 
also given. 
There have been a number of studies on latency; however. they have 
almost always measured error latency or the sum of both fault and error 
latency [6, 7,5]. An evaluation of these techniques is given in [Ill. Consider-
able confusion in terminology regarding fault. error and failure has occurred in 
the literature. In this thesis the definitions for fault. error and failure are stated 
as proposed by the IFIP WG 10.4 [3]. Almost all the studies so far have used 
specific programs or fault injection on special purpose machines. In [12] the 
measurement of en'or latency under a real workload in the unpaged section of 
the operating system is described. The first significant attempt at determining 
fault latency is found in [Ill. The authors use an indirect technique to estimate 
fault latency at the pins of the chips in the CPU of the Fault Tolerant Mul-
tiprocessor (FTMP). More discussion on this is presented in Section 4.6. 
4.2. Computation of Fault Latency 
This section describes the algorithm used to calculate fault latency. Subse-
quent estimation of error latency based l'n the calculated fault latency is also 
de~crib('d. The computation is best descr;bed by !ollo\\'ing the calculations with 
respect to a single bit position in a word That is chosen to contain a fault. Con-
sider a bit positinn z-. 01 a w()rci \\'. The " .. :tlue or b chan)!.t's between 0 and 1 as a 
function of time. Figure -l.l. shows t he contents of h as it changes in time. The 
limes of change are indicated as [I' l2. etc. However. although the bit h might 
not change. the word \\' ("yuld have changed wit hon1 affcrling bi1 h. The 1 imes 01 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
51 
change in the words are T l' T:.. etc. A change in the bit b implies a change in the 
word w but the converse need not be true. 
4.2.1. Fault latency calculation 
With reference to Figure 4.1.. consider two fault times t f 1 and t f 2' Let F 1 
be a s-a-O and F 3 be a s-a-1 fault at time t f I' Let F:1 be a s-a-O and F 4 be a s-
a-I fault at time t f 2' Note that in the description. a s-a-O and a s-a-l fault are 
inserted at the same time to illustrate the computation. In practice. however. 
only one of the faults can occur on a memory location at any given instant. 
Fault F 1 occurs at t f 1 during which time bit b is 0 and hence the fault is latent. 
At t 3 the bit b is written with a 1. The fault F 1 causes bit b to be stuck at 0 
and the fault becomes active. Therefore. the fault. latency associated with the 
fault F I' namely. L/-,,-Il is 
L/ -., -;; =t 3-t t 1 
A read performed on the word wany time after c 3 will be detected by the ECC 
as an error. In the bgure the fault F'3 is a s-a-l fault occurring at the same time 
as Fl' i.e .. tt l' In lhis case. however. the value of the bit b is O. and the fault is 
actiy(~ as soon as it occurs. Hence. the raul t latency is 
L'_,_I =0 
Similar!\', lIw Jault lalenccs 1M lanlls F~ and. F.; an: ckarly, 
and 
Bit b 
to 
Wordw f t 
To TIT 2 
F 1: s-a-O 
F 2: s-a-O 
F 3: s-a-l 
F 4: s-a-l 
t 
TJ 
, 
I 
F i: -a -lL 2 t" S 
t f :i T 3 
I 
I 
, 
, 
I f 
, L' 
F ';-,,-1 4 I .. 
I 
t 2: 
I 
I 
I:4 
I 
t 3: t4 
I l I t t I 
I ~b ri~ T.s I' T,ST9 
I 
I 
I I 
I I I 
I I 
I I Lf I I Le I 
F <.....,,+ .. 0 I <-.1-0 I 1~'·~~~,---~~~~·~~~~"· 
rd I 
I I 
I 
, , 
" , 
',L" , r; <-,,-1 ~ 3 'AI a.~ 
r fIT b 
Figure 4.1. Lalency Calculatinn. 
52 
t5 
t 
T 10 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
53 
The above discussion referred to a single fault (either s-a-O or s-a-l). A 
distribution of fault latency for a given fault occurrence time is generated by 
inserting a large number of faults (approximately 1000) into randomly chosen 
bit positions of randomly chosen words. The resulting latences. taken together, 
yield a fault latency distribution for the given fault occurrence time. 
4.2.2. Estimating error latency 
An error occurs when a fault becomes active. The error latency is the time 
from when the error occurs until a failure results or a discovery of the error is 
made. For the memory subsystem, a single active fault is discovered on the 
read following the fault to the word in memory. In the V AX 111780. the 
memory write operation, called 1~"'i(e musked, checks the ECC before it updates 
the location. Thus. the write operation will also detect the error. The data (as 
will be described in Section 4.3.) used for this experiment is generated by 
periodic memory scans, which detect changes that take place in the contents of 
the scanned memory locatIOns. Although these data only contain information 
on when the write operations take place and not the read operations. it can still 
be used to cstimule error latency. 
In [l ~]. ext ensi\"(' r:si rnmenlal ion was performed using hardware probes 
1n observe !\Y\\'-!eveJ Opt'riHinns on the memory and 110. From this study it \vas 
tn\lnd. lh~:l c;bOlll 73''/t) ()f ;lw memlW\' operaiinns were wril(' TnLlskeJ and. 'the 
re',nainint! rt'tlJ C'x[('nJ'cJ. This hit!h percentage or writes is most liktly 
explained by the i'act tha1 t he machine has a write-through cache. Since the 
majority ot' the memory operations are ,\.'rile TnLlskcJ. using only the write 
54 
times for error detection provides a good estimate of error latency. However, as 
a read operation can occur before a write operation, this estimate provides an 
upper bound for error latency. 
Using this method to estimate the error latency it can be seen from Figure 
4.1. that T s is the first write that takes place after fault F 1 becomes active. 
Hence the error latency, 
The total latency for fault F 1 is 
Ls -a -n=L! ...... , ~l + L; -.,-0 
=T g-t, 1 
Similarly, the error latency for fault F 3 is 
L;-.1-1 =T6- t f l' 
and as it had a fault latency L/ .... " .... \ = O. the tot.al latency for the fault is the 
same as its error latency. Again. for fault F:! the error latency is 
and for F 4' 
L;' .... " _\ =T s-l ~ 
From these estimates the total latency. which is the SUIll of the rault and error 
latency. can be calculated. 
p!1\"sic<l] IllCIIl()\'Y and its lime of change. I he cp\\(·ctlnn and implementation 10 
c\"aluateL:nIl1 anti. error latency distributions are discusstci in the next seclion. 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
55 
4.3. The Experiment 
The purpose of the experiment is to get data from which the fault latency 
can be determined as described in the earlier section. The measured system is a 
V AX 111780 that runs the Unix operating system Berkeley Ver 4.2 which is 
used mostly for scientific computing and a variety of miscellaneous data pro-
cessing activities. The VAX 11/780 system studied has 4 M byte of main 
memory. three 300 M byte disk drives. a lape drive. and many miscellaneous 
terminals and printers. During the peak hours it has about 20 to 25 interactive 
users who work on a wide range of applications. The primary data used in the 
experiment are derived from actual scans of physical memory. 
4.3.1. Memory data scanner 
The physical memory of the V AX at this installalion is 4 M bytes. Since 
the size of the memory is very large. it is impractical to scan the whole memory 
periodically. Representative samples from different regions of memory were 
chosen to capture the variation in memory usage. The choice of sample size was 
based on engineering judgment so as to keep the data manageable. yet ensure 
that it well reflected the system behaVior. Concepts used 10 determine 
appropriate sample sizes were Similar to those discussed in [12J. Four re~inns of 
5i/.(' 1(J K bytes evenly spaced in the 4 \1 bytes of physical rllem.nry were sam-
pled. The memory data scanner ulpit:'s 1 he u'nlt'nTs of randn!ll Iy chosen loca-
lions from the selected regions at ptriodic intervals. 
Th(:' scanning rate was chosen from knowledge of the dist ributions of the 
lifetimes of data in the mel1lMY lncali()fls. Data were initially acquired at a 
56 
high rate ( < 5 sec). the rate being qualified by the number of identical memory 
content values that were generated in the consecutive scans. The distribution of 
the lifetimes of the data showed a fT,lode around 30 seconds, and more than 75% 
of the changes in the contents of memory locations occurred after 1 minute. 
Hence. a 15-20 second scan interval was considered reasonable. 
4.3.2. The experimental setup 
Figure 4.2. shows the experimental setup and a flow of data. Concurrent 
simulation is performed for all the inserted faults to determine latency limes 
and, hence, generate latency distributions. A computationally efficient scheme 
is used. whereby a distribution for a chosen fault occurrence time is generated 
in one pass over the data. This is accomplished by preprocessing the raw data, 
as shown in Figure ~.2 .. to generate two different data sets. The bit chunge 
Jutus('( contains only the times of transition of the randomly chosen bit posi-
tions that contain faults. The other. worJ chunge Jut usee . contains the times of 
changes in ~he words that contain faults. The primary reason for this separa-
tion is thaT. a word in memory can change in value without affecting the value 
of a particular bit in it. Concurrent sim ulation of all faults can now be per-
formed rL)~' a given fault occurrence time by one selective merge of the two data 
set~. Dis1rihutions for different fault occurrence limes use the sallle data sets 
bllt use another pass 1 hro\l~h the ml'rt!c. 
\Vorkloau and performance data are also gathered on the machine during 
the memory scans. These data are used to merge with the estimaTed lotal 
latency 10 ?enerale a workload-Jatenn' model. 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
Perf 'lonitor :\lemory Scanner 
Performance Data :\lemory Dump Data 
Real-time Acquisition on VAX 1117 0 
Off-line Processing and Simulation 
on 18,\1 3081 r---------~--------~ Random 
Da ta Pre-processing -Bit Loc. 
Bit Change Dataset Word Change Dataset 
Sort by Words 
Latency - Fault. Error. Total 
:\.lerge 
A:\OVA 
Workload 
Latencv \lodel 
Latency 
Figure -l.~. Acquisition and Processing 01 Data. 
Fault 
Time 
57 
58 
4.4. Latency Distributions 
4.4.1. S-A-O and S-'-\-l distributions 
Latency distributions for both s-a-O and s-a-l faults are computed (as 
explained in Section 3) by inserting a large number of faults at a specified fault 
occurrence time. A family of such distributions is generated for different fault 
occurrence times for different regions of memory. Since the salient characteriS-
tics of the distributions repeat themselves among the different regions of 
memory and repetitions of the experiment, only the distribution for a represen-
tative fault occurrence time is discussed here. 
Figure 4.3. and Figure 4.4. show the latency distributions for a s-a-O and 
a s-a-l fault. respectively. Figure .+.3a. shows the fault latency distribution 
for a s-a-O fault at 9:30 a.m. when there is a medium-to-high workload at this 
installation. The vertical axis is the latency midpoint of the histogram and the 
horizontal axis the frequency. BeSide eaeh horizontal bar the frequency and its 
percent contribution are shown. A total of Q60 faults was inserted to generate 
the distribution. Figure 4.3b. shows the estimated error latency distribu-
tion and Figure 4.3c. the estimated total latency distribution. Figure .+.4. 
similarly shows the corresponding latency d~~tribu1ion for a s-a-I fault. at the 
sa me 11l1\ t' • 
\i\Hicc ihm f~1Ulr l"i('T1n' I'm lhc s-C\-O 1~1\llt i~ ncarly S 1imes that for the 
/tlull !u.l<'nn' nf tl1(:' s-a-l. Ho\\/C'ver. T he error latency eSl irnate of lhe s-a-1 
fault is more than l"\\'ice that of The s-a-O. The total tatenees of the two are 
comparable. An e:-;pianation lor 11K nhSt'rH'd r('~mlls loilnws. 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
~ BAR <lJARI' 
MIDPOINr 
OMean: 1:10:24 Std. 1:20:14) 
Fault Latency 
h:mn 
OM. 
PERaNr 
0:00 
0:30 
1:00 
1:30 
2:00 
2:30 
3:00 
3:30 
4:00 
4:30 
5:00 
1 
f************************************** 376 
f******* 73 
f************** 139 
f*********** 105 
f***** 52 
1*********** 114 
1***** 49 
1** 15 
f* 11 
f 4 
1** 22 
-----+----+----+----+----+----+----+---50 100 150 200 250 300 350 
39.17 
46.77 
61.25 
72.19 
77.60 
89.48 
94.58 
96.15 
97.29 
97.71 
100.00 
Figure 4.3a. DIS1RlBUTICN OF F.AlJLT l.A1'ErCi - RR S-A-O FAULT. 
~ BAR CHART OMean: 0:20:40 Std. 0:31:14) 
MIDPOINI' 
Error Latency FRIQ OM. 
h:mn PERaNr 
0:00 
0: 15 
0:30 
0:45 
1:00 
1:15 
1:30 
1:45 
2:00 
f************************** 513 
f********··* 220 
f* 28 
f*··* 82 
I 2 
!**. 65 
I 0 
f*· 41 
I 9 
-----+----+----+----+----+-100 200 300 400 500 
53.44 
76.35 
79.27 
87.81 
88.02 
94.79 
94.79 
99.06 
100.00 
59 
Figure 4.3b. DIS1RlBUTICN OF ERRCR IATEN:Y ESTIMo\TE - HR S-A-O FAULT. 
~BARQJARI' 
l\1IDPOINI' 
OMean: 1:31:04 Std. 1~16:55) 
Latency 
h:mn 
0:00 
0:30 
1:00 
1:30 
2:00 
2:30 
3:00 
3:30 
4:00 
4:30 
5:00 
1******************* 
1**************** 
1************************************ 
1************ 
1********* 
1******************* 
1********* 
f*** 
f*** 
I 
1**· 
----+---+---+---+---+---.~---+---+---+ 
142 
120 
268 
87 
66 
144 
68 
19 
20 
2 
24 
30 60 90 120 150 l80 210 240 270 
OM. 
PERaNr 
14.79 
27.29 
55.21 
64.27 
71.15 
86.15 
93.23 
95.21 
97.29 
97.50 
100.00 
Figure 4.3c. DISlRIBlJI'ICN OF '!OrAL IATEN:Y ESTIM\TE - R:R S-A-O FAULT. 
Figure -L3. S-.-\-O Lalency DisHi bnlions 
~ BARClJART 
MIDPOINT 
OMean: 0:14:39 Std. 0:31:55) 
Fault Latency FRBJ aM. 
h~mn 
0:00 
0:15 
0:30 
0:45 
1:00 
1:15 
1:30 
, 
, •••••••••••••••••••••••••••••••••••• 725 
,.... 72 
, 0 
,.. 34 
, 4 
,.... 73 
,... 52 
-----+----+----+----+----+----+----+-100 200 300 400 500 600 700 
PERCFNI' 
75.52 
83.02 
83.02 
86.56 
86.98 
94.58 
100.00 
Figure 4.4&. DISTRlBUI'ICN (:E FAULT IATEN:Y - FOR S-A-1 FAULT. 
~ BAR ClJART (Mean: 0:45:25 Std. 0:47:53) 
MIDPOINI' 
Error Latency FRBJ aM. 
h:mn PEBCENI' 
0:00 
0:30 
1:00 
1:30 
2:00 
2:30 
3:00 
, 
, ........................................ . 
, ........... . 
, .................... . 
, ........ . 
, ..... . 
, ....... . 
! 
-----+----+----+----+----+----.+----+----+-50 100 150 200 250 300 350 400 
407 
116 
213 
87 
55 
82 
o 
42.40 
54.48 
76.67 
85.73 
91.46 
100.00 
100.00 
Figure 4.4h. DISTRlBUl'ICN (:E ERROR IATEN:Y EST1M!\.'m - FOR S-A-1 FAULT. 
~ BAR <liARI' 
MIDPOINI' 
OMean: 1:00:22 Std. 0:47:35) 
Latency 
h:mn 
0:00 
0:30 
1:00 
1:30 
2:00 
2:30 
3:00 
~ aM. 
,..................................... 276 
,........... 84 
, ••••••••••••••••••••••••••••••••••••••••• 306 
,.............. 106 
,............. 96 
,............ 92 
, 0 
----+---+---+---+---+---+---+---+---+---+-30 60 90 120 150 180 210 240 270 300 
PEBCENI' 
28.75 
37.50 
69.38 
80.42 
90.42 
100.00 
100.00 
F i gur e 4. 4c . DI STRlBUI'ICN OF '1UrAI.. IATEN:.Y ESTIl'vt\TE - FOR S-A-l FAULT. 
Figure 4.4. S-A-J Latency Distributions 
60 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
61 
These investigations suggest that the difference between s-a-O and s-a-I 
fault latences is due to the unequal lifetimes of I's and O's in the system. This 
difference in the lifetimes is most likely due to the way memory is allocated 
and released by programs. It is conjectured that often block storage allocations 
may result in clearing large sections of memory to zero. e.g .. array ini tializa-
tion. The progr~m. however. may use only a fraction of the allocated storage 
resulting in a large number of zeros in the memory. Discussions with system 
programmers have suggested that this phenomenon can occur in both user pro-
grams and system utilities. This may be true in non-Unix operating systems as 
well. 
The results obtained for fault and error latency are now compared. Intui-
tion leads one to believe that in general fault latency should be larger than the 
error latency. because updates to a word need not necessarily change the faulty 
bit. However. a (single) active fault is always discovered by the next access or 
update to the word. It can be shown that this intuition would be true. pro-
Vided that the probability of a fault being inactive. Le .. the fault is latent were 
the same for both s-a-O and s-a-l faults. It is found that lhe above intuition is 
true for the s-a-O fault but not for the s-a-l fault. This difference is attributed 
In the li..Kt that tht' average lifc-'ime oj a 0 is much longer than that of a 1. As a 
c~)nseq\knc(' 01 T/1e difference in lijetiml's. The s-a-O faull remains latent with a 
prnbJbility \)1 apprl)\:imalely 0.7 (1he probahiliTY It)r a s-a-J is aboLlt 0.3). 
So rar the analysis pertains 10 the dislriblltions in a region of rnem.ory. It 
is fnund that the medn j',llIi i ialency and error latency estlmates vary consider-
62 
ably from one region of memory to another. Typical variations in the mean 
fault latency for a s-a-O fault can range from 9 minutes to 50 minutes and for 
a s-a-l fault from 8 seconds to 6 minutes. Although the mean latences have 
large variations. the distributions from different regions are similar. This vari-
ation in the means is attributed to the existence of different activity spots in the 
memory. Thus it is clear that a single estimate of latency is not adequate. The 
variation of latency with activity. caused by the workload. is analyzed and 
quantified in the next section. 
4.4.2. Workload-latency model 
Workload-failure models generated in [32.2] relate failure rates to work-
load. It is believed that an important component of the workload-failure rela-
tionship is due to error latency. Since the time of error occurrence was not 
known in the above studies. an explicit workload-latency model could only be 
surmised. 
To investigate the workload-latency dependency. latency due to faults 
injected under various workload conditions is determined using the method 
described earlier. The mean latency under different workloads is analyzed 
llsing an analysis of \'ariance (AN()\:\) [33]. The ANOV"; analysis can be used 
to estimate the relative intlnence of different sources of variation on the values 
or a performance index. Thus. in this cnse the relative influence of various 
workload measures on latency is es1 imaled. Workload data are gathered by 
running a performance moniTOr on a machine. This performance monitl)r 
rccnrds the average value or a number or high level performance parameters 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
ORIGINAL FAGE t';~ 
Of POOR QUAUTY 63 
every 30 seconds. Table 4.1 shows the various performance measures that were 
recorded and used in the ANDV A analysis. 
The ANDV A analysis reveals that the workload measures. namely. active 
virtual memory (AVM). user CPU (CPUUS). and page rC'ciaims (PRE) had a 
large main effect. More than half of the contribution due to the interaction 
terms were due to the interaction between AVM and PRE. Figure 4.5. is a pie 
chart that shows the different workload parameters and their contributions to 
the model. Figure 4.6. contains a time-of -day plot for the three workload 
measures that had a large main effect. The other workload terms that also 
influenced the variation are system C P(} (CPUSY) and the context switch rate 
(INTCS). The resulting linear model had a R-Square of 0.S4 for latency. This 
analysis was done for a workload range that can be termed as medium-low to 
high and which corresponds to a CPU utilization of above 25 percent. The very 
low workload range has been specifically excluded since activity in that work-
load range tends to be very low and needs to be independently studied. 
4.5. Discussion and Significance of Results 
The scope and implications of fault latency in the memory go far beyond 
The menwry subsys1em. In [2(;] i1 is sh()wn 1ha1 the largest number of faults 
occur in the memory. and. in addilinn. il has been shnwn T hal a large number 
l1f lIlt:' CPC tfrnrs llriginaie in il1(: llil'lllnry. Thus. the illlpo:tance of fault 
latenc~' in the meIIlory cannOT be \)"cr emphasiz.ed. 
In [11] lhf' f;1lI11 laTency of CPC pin !eve! !~u!ts is studied through fault 
m;l'('~inn and error detection. Ii is nOT rossioic h) c()TlljJ<.lrc the two results sincE.' 
TABLE 4.1. WORKLOAD PARAMETERS RECORDED 
BY THE PERFORMANCE MONITOR. 
Mnemonic i Function Descrietion Units 
CPUUS CPU User time percent 
CPUSY CPU System time percent 
CPUID CPU Idle time percent 
MAVM MEM Active virtual pages number 
MFRE i MEM Size of free list number: 
PGRE PAGE Page reclaims per sec 
PGPI PAGE Pages paged in per sec 
. PGPO PAGE Pages paged out per sec 
PGFR PAGE Pages freed per sec 
ININ FAULTS Device interupts per sec 
INSY J:AULTS System calls pe~ sec 
INCS FAULTS Context SWitch per sec 
·PRR PROCS Processes in run queue number 
PRB PROCS Processes blocked (110. etc.) number 
PRW PROCS Processes runnable but swapped number 
64 
I 
1 
I 
I 
I 
I 
1 
I 
1-
1 
-I 
I 
I 
1 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
INTCS 
CPUSY 
Other 
In teraclion 
Terms 
CPUUS 
12.89% 17.76% 
19.25% 
R-Square = 0.84157 
AVM 
14.43% PRE 
figure -l.5. Pie Chan Showing Workload-la1ency Relationship. 
65 
Pages 
6000 
5000 
--
1 Percent 
100 
80 
60 
40 
20 
0 
'J 
Pg/sec 
12.5 
10. 
7.5 
-5.0 
2.5 
U 
5 
5 
A~TIVE VIRTUAL MEMORY 
10 15 20 2.5 
Till'e of day 
USER CPU 
If) 15 20 25 
Time of day 
PAGE RECLAIMS 
25 
Time of day 
Figurt -t.b. PInT or Three WtHkload Measures. 
I 
66 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
67 
they relate to two entirely different system components. However. a com-
parison of the methodology is instructive. The method in [II] is somewhat 
indirect since the exact moment when the fault becomes active is not known. 
Thus. this prOvides only an upper bound for fault latency. Since, in this tech-
nique, the exact moment when the fault becomes active is known the fault 
latency computation is accurate. This is primarily due to the nature of the 
memory data that were collected. 
The measurements on fault and error latency distributions of s-a-O and s-
a-I faults show: 
(1) The fault latency of a s-a-O is much larger than that for s-a-l. The ratio 
of the two is approximately 5:1 
(1) The estimated error latency for a s-a-O is smaller than that for s-a-l. 
(3) The differences in (1) and en are a ttri buted t.o t he difference in lif et imes 
of zeros and ones in the memory. 
4.6. Summary 
This chapter has demonstrated a technique to accurately determine fault 
latency under real workload conditions in the memory subsystem. This tech-
l1lqut' nst:d real memory scan dala t rom (l \".-\X 11/7RO running Unix. Fault 
latency dist ri bu 1 ions were ;!.t'nera1t'<..i fur s-a-O and s-a-l permanent rault 
modds. The mean fault latency ~)j <.1 s-a-O faul1 IS ncarly 5 limes that or s-a-l 
fault. It is likely that t he above phenomenon is characteris1 ic of other SySTems 
as ',vell. L.arge fatIlt. latences are a fCaSt)n for concern since the)" can reSLllt in 
68 
multiple errors. From the data. the s-a-O fault is clearly a cause for greater 
concern. An estimate of the error latency was also provided and a workload-
latency model developed using ANOVA. \\orkload and latency have a linear 
relationship for a medium-To-high workload range. 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
69 
CHAPTERS 
ERROR LATENCY IN THE MICROCONTROL STORE 
5.1. Introduction 
In many Central Processing Unit (CPU) designs, the area taken up by the 
microcontro} store is significant; therefore. its reliability is important. The 
VAX 111780 is a microprogrammed machine; hence. the microcontrol store is a 
key element in the operation of the CPU. Errors in the microcontrol store can 
cause catastrophic failures. 
The execution of an instruction in the 111780 requires a sequence of 
microoperations. The sequence of microoperations is determined by the 
microprogram contained in the microcontrol store. The microcontrol store in 
the 11/780 consists of programmable read only storage (peS) and a writable 
diagnostic control store (\VDCS). The microword in the 111780 is 96 bits wide 
with additional 3 parity bits. Each microword is comprised of several fields 
which control specific functions in the processm. A detailed description of the 
fields and the format can be found in [3...\]. The pes pnwides storage for 4K 
micrnwords and the \VDCS has a WrItable storage for I K words. Thus the 
miCEI address IS 13 bits \vjde. The WDeS is mainly llsed lor modiftcations to 
~ ht: nriginal microprogram and for lls<:r-~Til1t'n l1llCfl)code. The micrn<:ode is 
loaded i1to the W"DCS during system startup and. for the most part dUrIng the 
70 
i.e. an error. in the microcode can cause failure. Although the parity will detect 
any single errors in each 32 bit field of the 96 bit word. there is no recovery 
from these errors. In this chapter the error latency associated with such faults 
in the microcontrol store will be measured and analyzed. 
5.2. Instrumen ta tion 
The type of data that are needed for measuring error latency in the micro-
control store is similar to the data used for the error latency measurements in 
the memory (Chapter 3). Essentially. data on the access and use of the 
different rnicrowords are required to measure error latency. These data are 
generally available from the microsequencer in the machine or the instruction 
decode logic. The primary function of the microsequencer is to provide the 
address of a word in the control store. Description of the 111780 microse-
quencer that is necessary for the instrumentation is presented in the following 
section. Full details appear in [34]. 
5.2.1. The microseq uencer 
The microsequencer controls the entry to the microprogram during the 
normal program flow and also during special condit ions such as powerup. 
lllicrot raps. stalls. Ct)!1so1e operat ions. and IIlltrOWord patches. TIH:' address or 
the next rnicrowonl 10 be eXt-culed is broadcast on the microprogram counter 
bus (UPC bus) 10 the PCS and the \VDCS. In the case 01 a cil:cision pomt 
branch. the lower-order hits of the microaddress are generated by the instruc-
tion decode logic. The most signincant bit (hit l~) or the microaddress deter-
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
71 
mines which control store is addressed. When bit 12 of the microaddress is O. 
thePCS is addressed. and when it is 1. the WDCS is addressed. 
The source of the microword address is dependent on the mode of the 
microsequencer operation. These modes are dependent on conditions that are 
generated in other parts of the CPU. The typical conditions are power up/down 
initialization. maintenance, cache stall, microtrap, microword patch. and nor-
mal operation. During the normal mode of operation. the next microaddress is 
selected from the Jump and Branch Enable Ileld of the current microword. 
Under a microword patch. the microsequencer generates an address for the 
WDCS. This is done by a lookup table through a Field Programmable Logic 
Array (FPLA) which contains pes addresses that 'require changes to the 
corresponding WDCS addresses containing the new microcode. This causes a 
no-op cycle to fetch the new microaddress. When a microtrap occurs. the 
microsequencer generates specific vector addresses which contain trap handling 
conditions in the CPU. This also causes a 11.0-0p cycle in which the new 
microaddress can be formed. The utmp causes microword registers to be cleared 
and an abort cycle to be generalt'd. A cache stall mode is iniliated when a cache 
miss occurs. In this mode the execUTioll ot the next microinstruction is tem-
porarily preventeti. L;ntil'r i\ cache slail mode The microprogram is in a no-op 
S1(Il(>, and this can conlinne for sl'\('ral cycles \lnTi! t/1e stall condition is 
n(:t!Gted. In the llldinttnanlt mt)(k 111(' console can ll)nl n)i \'arions operaTlt)l1s of 
lhe IIlicfosequencer. During power up/down initiali7;:Hion the microsequencer is 
forced to a (L)flstant lnicrul rap vector. 
72 
5.2.2. Data acquisition 
The addresses that are generated by the microsequencer are visible on the 
upe bus. These addresses are accessed by probes placed on the backplane of the 
microsequencer card. From the discussion above. it can be seen that the 
microaddress is not valid on every processor cycle due to the different modes of 
operation. Hence. the stall and abort cycle signals are used to qualify the cycles 
when the data are valid. 
The probes on the DAS that are used to acquire the data have an acquisi-
tion memory that is 1024 words deep. Thus, each sample will contain 1024 
microaddresses. Similar to the data acquisition in Chapter 3. the acquisition can 
be in either a regular or a compressed mode. In the regular mode. each CPU 
cycle is stored; thus. the data contains cycles which include cache stalls and 
abort cycles. In the compressed mode. the stall and abort cycles are ignored. 
The regular mode acquisition is particularly useful for performance measure-
ments since cache stall can be studied. For analysis that only needs the 
microaddress trace. the compressed mode is preferred. 
The DAS is connected to t he host machine (a GOll ld g050) via an RS232-C 
ron. This faCilitates program ming the DAS thrnugh a GPIB protocol and pro-
Vides up-loading or the acquisition IlH.'lItnry 10 store on la~. In this sC'tup. the 
DAS lan be periodically triggered and the SVStl'lI1 rt:,pe<ltedly sampled. 
The VAXl1/780 proVides a racilil~· called the Pert'nrmance Monitor Enable 
(P\1E 1. The P:vtE is a signal in hard ware that can be seen on the backplane and 
IS also a bit in tme or the registers in r he process cont rol block. If the P\1E bit is 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
73 
set. then the hardware PME signal is set whenever the particular process exe-
cutes. This enables monitoring of a specific process in a mix. This instrumenta-
tion facilitates qualifying the data acquisition as the PME bit provides a means 
to make microaddress or cache stall measurements on a specific process. By 
looking at the transition times of the PME signal. one can study the context 
switch times which are particularly hard to study in a simulation environment. 
5.3. Measurement and Analysis 
5.3.1. Microcode usage distribution 
The total address space of the microcon trol store "is 5K words. This is 
comprised of 4K of pes and lK of WDCS. Measurements were made during 
the medium workload and a mix of interactive and batch programs. In this 
workload the usage distribution of the microcode stabilized with around 32 to 
~8 acquisitions, each containing about 1000 microaddresses. This microcode 
usage distribution is shown in Figure 5.1. By studying the microcode usage dis-
tribution, it is clear that a small portion of the microcode accounts for a large 
part number of the access. This type of usage is typical for machines with large 
instruct ion sets. 
5.3.2. In teraccess time 
The lime between access TO The same' micn)\l,,·~)rd in the control ~lorc is 
called the interaccess time. This interaccess time can be measured from the data 
prOVided that it is less than 1000 cycles since the acquisition buffer is 1000 
words deep. Computing the interaccess time provides TWO 1[seful measures. 
MIDPOINI' 
l\IHCROAI:IlmSS 
I 
o !**************************************. 
200 1*****.**** ••••• **.*.* 
400 1*"· 
600 1** •• *··.* ••••• 
800 ! ** •••••• 
1000 1* 
1200 ! ** 
1400 
1600 
1800 
2000 
2200 * ... 
2400 
2600 * 
2800 * 
3000 
3200 ** •••••• * 
3400 ** 
3600 ** 
3800 *******-
4000 * 
4200 *** 
4400 .. 
4600 
4800 *** 
5000 ** 
5200 ***** 
5400 ! *** 
5600 ! 
5800 ! 
6000 ! 
-----+----+----+----+----+----+----+----1000 2000 3000 4000 5000 6000 7000 
PERC:E'rr 
28.27 
15.31 
3.18 
10.46 
5.85 
0.71 
1.81 
0 .. 32 
0.34 
0.28 
0.07 
2.74 
0.19 
0.47 
0.77 
0.22 
6.76 
1.36 
1.20 
5.91 
0.38 
2.04 
1.45 
0.35 
2.07 
1.24 
3.62 
2.33 
0.04 
0.27 
0.00 
aM. 
PER.CB'll' 
28.27 
43.57 
46.75 
57.21 
63.06 
63.77 
65.58 
65.90 
66.24 
66.52 
66.59 
69.33 
69'.52 
69.99 
70.76 
70.98 
77.74 
79.10 
80.30 
86.21 
86.59 
88.63 
90.08 
90.42 
92.49 
93.73 
97.35 
99.68 
99.73 
100.00 
100.00 
Figure 5.1. l'sZll,?,e DisTriDulinn f\)r the Microconlrol StMe. 
74 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
75 
First. one sees the distribution of the interaccess time. Second. the percentage of 
the used microaddresses that have an interaccess time that is less than 1000 
cycles can be determined. This is necessary to develop a measure of confidence 
for the error latency distribution that is later computed. 
Figure 5.2 shows the interaccess time distribUtion computed over a large 
number of acquisition samples. The number of samples required to stabilize the 
interaccess time distribution is comparable with the numbers needed to stabilize 
the usage distribution. It is found that the microwords that have on the aver-
age interaccess time less than 1000 cycles constitute over 80% of the micro-
words that are commonly used. This percentage is determined by comparing 
the microwords in the interaccess time distribution with those in the microcode 
usage distribution. This essentially quantifies the coverage of the error latency 
distribution that can be computed from these data. In summary. the error 
latency distribution that is generated will be limited to latenCies that are a 
maximum of 1000 cycles which is the case 80% of the time. 
5.4. Error Latency Calculation 
Recall from Chapter 2 that the error latency is the time between the 
nl..'cnrrence of an error and the consequent failure. In the case of the microcon-
tro1 store. there is only an error latency issue since the measured microcode is 
read only. Error latency is computed by simulating the occurrence of active 
taults (inverted bit) in the data and determining the time taken 10 cause 
fadure. The failure will be caused on the follOWing use of the microword. 
Microword interacces time distribution: 
- For 80 % of the Addresses 
MIDPOINr 
DEL.TIM 
o !.* •••••••••• 
25 , ••••••••••••• 
50 '** •••••••• 
75 , •••••••• 
100 ,.* ••••••••••• 
125 ,.* •••••••• 
150 f.·· ... ·· ............... . 175 , •••••••••••••••••••••• 
200 , ••••••••••••••••••••••••••• 
225 , •••••••••••••••• 
250 ! ••••••••••••••••• 
215 ! ••••• 
300 ! ••• 
325 ! •• 
350 !.* ••••••••••••• 
375 !** 
400 !* •• 
425 !.*. 
450 ! * 475 , 
500 !. 
525 ! 
550 !* 
575 ! 
600 
625 
650 
675 
700 
725 !*** •••••••••••• 
-----+----+----+----+----+--10 20 30 40 50 
5.31 '. 
6.00 
4.62 
3.46 
5.11 
4.39 
10.85 
10.16 
12.24 
1.39 
1.85 
2.31 
1.39 
0.92 
6.70 
0.69 
1.39 
1.15 
0.23 
0.00 
0.23 
0.00 
0.23 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
6.70 
5.31 
11.32 
15.94 
19.40 
25.17 
29.56 
40.42 
50.58 
62.82 
70.21 
78.06 
80.37 
81. 76 
82.68 
89.38 
90.01 
91.45 
92.61 
92.84 
92.84 
93.07 
93.01 
93.30 
93.30 
93.30 
93.30 
93.30 
93.30 
93.30 
100.00 
Figure 5.~ .. \1icroword [n ttracc(>ss Time Distribution. 
76 I 
I 
I 
I 
I 
I 
I· 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
77 
From the discussion in the previous section on interaccess time, it is clear 
that 80% of the errors have a latency which is less than 1000 cycles. Figure 
5.3. illustrates the error latency calculation. Errors are inserted on randomly 
chosen microwords at a given instant in time. With reference to Figure 5.3., at 
time terrors e 1 occur on microword w 1 and e:? on w 2' The first access and use of 
microword WI will cause a failure at t' ; hence. the latency for e 1 is 11' If how-
ever, there is no access to a microword in the sample, as is the case of W2' that 
error may be larger than 1000 cycles. This process of determining latency is 
repeated over a large number of samples which then provide an average 
behavior. The intersampie time is randomly distributed. and .the microwords 
chosen to contain a fault are drawn from a uniform distribution. Error latency 
is computed for the same set of randomly chosen erroneous locations over 
many samples. This results in generating a stable error latency distribUtion. 
The number or samples required to stabilize the distribution is comparable to 
that needed t·.) stablize the usage distribution. 
Figure 5.4. shows the error latency distribution. Note that there is a large 
mode in the 50 to 100 cycle range. There are also two other modes around 250 
and 600 cycles. The mean for this distribution is 310 cycles with a standard 
deviation or 267. Unlike the error lalency distribution in the memory \\/hich 
can he vcry large and can have n second large moue, this is skewed to have the 
silorll;'f iatences dominate. It is interes1ing In compare this with the interaccess 
i.ime distribution. The interaccess time dis! ri/)ulion has moues arounu 200 and 
350 cycles. If the access of the microcode was uniform. \vhiCh it is not, one 
Access on 
microword w 2 rl-J-----------------------~ 
I.., unknown 
Error occurrence <- - - - - - - - - - - --- - - - - - - - - - - - - - - - - - - -
time on w 2 f-----'---... -------------------
Access on 
microword w 1 1---4I-..._----------..... ------~-..._----_ 
[1 
<- - - - - - - - - - -> Error occurrence 
time on wI 1------'-..... -------------------
l t i 
Error e 1 is on microword wI 
Error e:: is on microword w 2 
Time 
End of Sam pIe 
Figure 5.3. Error Latency Time Calculation. 
78 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
79 
uv~an 310 eye: Std ~v 267) 
MIDPOINr 
I..A1'B'C.Y PERCENr Q\,L 
Cia..ES PERCENI' 
0 1*************************************** 10.30 10.30 
50 1**************************************************** 13.74 24.04 
100 1*************************************** 10.37 34.41 
150 1****************************** 7.91 42.32 
200 !********************* 5.61 41.93 
250 f**************************** 1.39 55.31 
300 1********************* 5.67 60.99 
350 1**************** 4.20 65.19 
400 1***************** 4.51 69.69 
450 1************* 3.58 13.21 
500 1************* 3.35 76.62 
550 f********** 2.53 19.15 
600 1*************** 3.91 83.06 
650 1*************** 4.10 81.16 
700 1************ 3.22 90.38 
750 1********** 2.66 93.04 
800 1***** 1.45 94.48 
850 1******* 1.99 96.47 
900 1****** 1.55 98.02 
950 !***** 1.36 99.38 
1000 !** 0.62 100.00 
--------+-------+-------+-------+-------+-------+----
200 400 600 800 1000 1200 
Fig1.ue 5.4. E;-ror La 1 C:!ley DiSl ri b1l1 km fnr the \tlicrocon trol Store. 
so 
would expect a one-one relationship between the distributions. Results show 
that this is not the case. 
The [5] study on error latency in an Avionic miniprocessor using a gate 
level simulation and test programs had faults injected into the ALU, address 
processor, and micromemory. Al though a direct com parison between the two is 
not possible, since this study is restricted to the microcontrol store. a discussion 
on the two results is instructive. The latency distribution reported in the 
McGough study showed an almost exponential behavior. with the slight non 
monotonicity attributed to statistical fluctuation. In this study it is found that 
the distribution is not exponential. The distribution has a large mode with two 
other smaller modes. 
5.5. Summary 
This chapter determines error latency in the microcontrol store of a VAX 
111780 processor. The microcontrol store is a Significant part of the processor: 
hence. errors in the control store cause a catastrophic failure of the machine. 
Microaddress traces occurring during the regular workload of the machine are 
gathered from probes placed in the microsequencer of the processo.r. The 
la1l'ncy distribution has a large mode between 50 and 100 microcycles and two 
additional smaller modes. It is interest ing to note that 1 he error latency dist ri-
buUon in the rnicrocontrol s10re is nOl exponential as noted in a similar study 
performed using a gate level sim ulation of an avionic processor. 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
81 
CHAPTER 6 
CONCLUSIONS 
6.1. Summary and Discussion of Results 
This thesis has developed a systematic experimental approach to study 
fault and error latency under real workload. The determination of these 
latences is an important and unsolved issue in fault tolerant computing with 
significant implications in both reliability prediction and testing. The metho-
dology. based on gathering relevant low level data from the machine during the 
normal workload cycle of the installation. was demonstrated on a VAX 11/780 
system. In particular. error latency in the system region (largely unpag€'rl) of 
the memory was studied by using hardware probes placed on the synchronous 
backplane interconnect and gathering data on physical memory access and 
usage. Fault latency in the user sections of the paged memory was determined 
using data from memory scans. This latency information. taken together with 
the workload data on the machine was used to develop a workload-latency 
relationshi p. Error latency in ~ he rnicrocontrol store is determined by \~sing 
prnbes in the micrnsequencer <lnci gathering data on the microaddn:ss seq lh;nce 
t":eculed. 
Chapter 3 slwiies the error latency in tht; sysH:m region or the memory. 
The ua 1a collected are used to reconst rUCl the errnr discovery process in 
memory under diiferent workload cDnditions. The usc of sampling is validated 
82 
by an analysis of the sampling factor. the class size. and the computed fault 
miss percentage. A regression based projection is used to determine the real 
fault miss percentage. This is verified using data for which the miss percentages 
are known. The analysis and its verification substantiate the overall approach 
of using sampling to reconstruct the error discovery process. 
The results provide general guidelines for understanding latency behavior. 
The study finds that the mean error latency. in the unpaged memory containing 
the operating system. varies by a factor of 10 to 1 (in hours) between the low 
and high workloads within a day. The hazard rate. computed from the error 
latency distribution. clearly shows that the observed failure rate increases dur-
ing higher workloads. Analysis using consecutive days of data shows that a 
fault is typically discovered the same day with 70% confidence, 82% confidence 
within the next day and 91% confidence Within the third: i.e .. there is a small 
but significant fault discovery in the second and third days. This method. in 
addition to determining error latency. provides a means to study the fault miss 
probability. The fault miss percentage varies widely between region!' of 
memory depending on the activity. i.e .. workload. and can only be expressed 
with reference to a specific region of memory and a nnite observation period. 
Chdpter 4 dcmnnstrales a lechniqUl> to accurately determine fault latency 
under real worklnad conditions in 1 he l11emory subsystem. This technique \ised 
n:al :llt..'lilory scan data Iroll1 a VAX 11/780 runnin~ Ur.i.\. Fault latency distri-
butions were genera1ed lor s-a-O and s-a-l permanent fault models. The mean 
fault la1ency ~)t a s-a-O lanl1 is nearly 5 times thal of s-a-I fault. It is likelv 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
---------
83 
that the above phenomenon is characteristic of other systems as well. Large 
rault latences are a reason for concern since they can result in multiple errors. 
From the data. the s-a-O fault is clearly a cause for greater concern. An esti-
mate of the error latency was provided and a workload-latency model 
developed using ANOVA. Workload and latency have a linear relationship for 
a medium-to-high workload range. More experimental analysis on different 
machines is suggested to further understand the latency problem. 
Chapter 5 determines error latency in the microcontrol store of a VAX 
111780 processor. The microcontrol store is a significant part of the processor: 
hence. errors in the control store cause a catastrophic failure of the machine. 
Microaddress traces occurring during the regular workload of the machine are 
gathered from probes placed in the microsequencer of the processor. It is found 
that the latency distribution has a large mode between 50 and 100 microcycIes 
and two additional smaller modes. It is interesting to note that the error 
latency distribution in the micrr.control store is not exponential as compared 
with a Similar study performed using a gate level simulation of an avionic pro-
c('ssor. 
As with any slatistical analysis. caution should be exercised in eXlrapOlal-
ing the absolute numbers nbWi1ed in this study 10 other nonsimilar systems. 
6.2. Suggestions for Future Research 
This thesis has e:xtensivc:lv analyzed the problem of fault and error 
latency with respect t.o hardv.:are faults in a single machine. \\/ith the growth 
M the computing en\'ironrnenl intn dislribll1ed machines and clusters or 
84 
machines. the problems of latency gain even greater importance. Fault and 
error latency in such enVironments remain to be studied. Although a Similar 
methodology may be employed. the joint collection of data and simulation of 
error in multiple environments provides a challenging problem for future 
research. Some extensions to this research are: 
Specific Environment 
It will be interesting to see this study performed on different computing 
environments. Specincally. it is likely that the behaVior of batch and interac-
Uve workloads may be different. Real-time applications and other specific 
applications are also likely to have different latency behaviors. 
CPU Study Extension 
The data acquired on microcode usage have a number of different applica-
tions. In particular the microcode usage trace. taken together with the micro-
code of the machine. can be used for some excellent studies on fault propagation 
and diagnosis. Since. the nelds of the microword are known the severity of a 
fault in the CPU can be determined by simulating faults in the CPU and 
exercising them with the microcode trace. This 51 udy can also lead to some 
\"tTY lIs(;'1 \lllnl ormation on designing dia!:!l1ostics lor the mad:ine. 
Th\.o possibility of 1l1111tiple faults nccllrring due to large latences is an 
important consequence. This isslie has 10 he speciocally studied. It will be 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
, I 
85 
interesting to see the effect of different fault arrival distributions given that 
there now exists some idea of the latency distributions. 
SoftwarE' failures 
A large number of failures occur due to software faults. Latency of 
software faults in a production system needs investigation. By definition. 
software faults are latent; however, the concept of a software fault needs more 
precise definition. It is likely that the existence of an active software fault is 
conditional on a variety of workload conditions. This condition may in some 
sense be termed the moment of error generation. The problem of detecting 
software error generation is quite complex and requires further research. 
86 
REFERENCES 
[Il R. K. Iyer. S. E. Butner. and E. J. McCluskey. "A Statistical Failure/Load 
Relationship: Results of a Multi-Computer Study." IEEE Transactions on 
Computers, vol. C-31 no. 7. pp. b97-706, 1982. 
[2] X. Castillo and D. P. Siewiorek. "\\lorkload, Performance and Reliability 
of Digital Computing Systems." in Proceedings 11th In.ternational Sympo-
sium on Fu.ult-Toleru.12t Computing, Portland, Maine, pp. 84-89, 1981. 
[3] J. C. Laprie, "Dependable Computing and Fault Tolerance: Concepts and 
Terminology," in Proceedings 15th I12ten1l.ltionu.l Symposium 011 Fault-
Tolemnt Computing, Ann Arbor, Michigan. pp. 2-11. 1985. 
[4] J. G. McGough and F. L. Swern, "Measurement of Fault Latency in a Di-
gital Avionic Mini Processor Part II." NASA Contractor Report 3651. 
1983. 
[5] J. G. McGough. F. L. Swern. and S. Bavuso. "New Results in Fault Laten-
cy Modelling," in Proceedings IEEE EASCON Conference, Washington, 
DC. 1983. 
[6] J. H. Lala. "Fault Detection. Isolation and Reconfiguration in FTMP: 
Methods and Experimental Results," Proceedings 5th Avionics Systl'ms 
Conference. 1983. 
[7] B. CourtOiS. "Some Results about the Efficiency of simple Mechanisms for 
the Detection of Microcomputer Malfunction," in Proceedings 9th Inter-
nu.tiOI1u.l Symposium on Fu.ult-Tolcn111t Cmnpu1ing, Madison. Wisconsin. 
pp. 71-74, 1979. 
[8] B. Courtois. "A Methodology for On-line Testing of Microprocessors." in 
Proceedings 11th Imemu.tiona1 Sympo5jium on Fault-Tolertlllt Computing, 
Portland, Maine, pp.272-274. 1981. 
[9] J. J. Shedletsky. "A Rollback Interval for Networks with an Imperfect 
Self -checking Property." IEEE Transactions on Computer.",. vol. C-27 no. 
b. pp. 500-508. 1978. 
[10] J. J. Shedlt'tsky and E. J. McCluskey. "The Error Latency of a Fault in a 
Comhinational Digital CirCUit." in Proceeding.)' 51h International Symposi-
um I.m FLulr-TPlerant Computing. Paris. Franct'. pp. 210-214. I Q75. 
[11] K. (i. Shin and Y. H. Lee. "Measuremcnt and Applicalion of Fault Laten-
cy.'· ! EEE rrLil1S(]ClioTl.'> on Computers. vol. C-35 no. 4. pr. 370-375. 
I C)Rb. 
[1.2] R. Chilldr(>?t' and R. K. Iyer. "The [tree! t)1' Syskm Workload on Error 
Latency: ;\n Experimental Study." in .\C.\l ,';/0AIETRICS Conference on 
/\lcusur"Tncnt Lind AloJdlil1g (If C,mpl1lCr Systems. Austin. Texas, pp. 
bQ-77. 1 ()S5. 
C-l 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
88 
[28] DEC. \ 'AX A.rchitecture Hu.nJh<.l(lk, Digital Equipment Corporation. 
1980. 
[29] DEC. AA7S0 Field ~lu.i11tenu.nce Print Set. Digital EqUipment Corpora-
tion. 1980. 
[30] K. S. Trivedi. ProbaNlity u.nJ Stllti.<;tics with Reliability. Queueing. and 
Computer Science Applications. Englewood Cliffs, N.J.: Prentice-Hall, 
1982. 
[31] M. L. Shooman. Probabilistic Relillbility: An Engineering Approach. New 
York: McGraw-Hill. 1968. 
[32] R. K. Iyer and D. J. Rossetti. "A Statistical Load Dependency of CPU Er-
rors at SLAC," in Proceedings 12th lnternu.tionai Symposium on Filult-
Toleru.nt Computing, Santa Monica. California. pp. 363-372. 1982. 
[33] W. Mendenhall, Statistics for the Engineering and Computer Sciences. 
California: Dellen Publishing Co .. 1984. 
[34] DEC. KA780 Central Processor Technical Description.. Digital Equipment 
Corporation. 1979. 
------
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 89 
I VITA 
I 
Ram Chillarege was born in on _ •. He received 
his B.Sc. degree from University of Mysore. India. in September 1977. He 
I obtained his B.E. degree in Electrical and Electronics Engineering and his M.E. 
I 
degree in Automation from the Indian Institute of Science. Bangalore. India. in 
1977 and 1979. respectively. At the University of Illinois he was employed as a 
I research assistant with the Computer Systems Group at the Coordinated Science 
I Laboratory from 1981 to 1986. 
His current research interests include measurement and performance 
I analysis. computer architecture. high-performance systems. reliability. and 
I VLSI systems. 
I 
I 
I 
I 
I 
I 
I 
I 
[0"' I , 
I 
I TTn"l "'''''''; f'; an OF POOR QUALITY. ECUR'TY CL.ASSIFICATION OF THIS PAGE ORIGINAL PAGE r6 
REPORT DOCUMENTATION PAGE 
1 t .. REPORT SECURITY CL.ASSIFICATION 1b. RESTRICTIVE MARKINGS TTn"lqc:c:.;f';ar! Non I'> 
2a..SECURITY CLASSIFICATION AUTHORITY 3. DISTRIBUTION/AVAILABILITY OF REPORT 
I NA Approved for public release, distribution lZb.. OeCL.ASSIFICATION/OOWNGRADING SCHEDULE 
NA unlimited 
I '" PERFORMING ORGANIZATION REPORT NUMBER(S) 5. MONITORING ORGANIZATION REPORT NUMSER(S) UILU-ENG-86-2230 (CSG-55) 
NA 
-
6a.IIW.ME OF PERFORMING ORGANIZATION ~b. OFFICE SYMBOL 7 .. NAME OF MONITORING ORGANIZATION 
1 i Ci:rordina ted Science Laboratory (If appUaable) JSEP (Joint Services Electronics Program) N~~A J~~~~ona1~eronau~£ijs l~ns~YfeJ~trati b 'TTn-tversitv of Illinois NA !R G.r~ Late tes~rc. tn~ r'\ 
k.4QQRESS (City. Stow Qfta ZIP Code) 7b. ADDRESS (City, S14m lIIIa ZIP COiU) 
1 ;Lilll W. iNASA: NASA langley Research Center, Hampton, VA Springfield Ave. ~SEP: Office c£ Naval Research, ax> No QJ:i.n:y, Arli~tDn, V, Umana, lL 61801 GRB: 125 Coble 1Sl1, 001 S. veigh!:, Chmpaign, lL 
-I a.. ~ME OF FUNDINGISPONSORING Sb. OFFICE SYMBOL sa. PROCUREMENT INSTRUMENT IDENTIFICATION NUMBER QRGANIZATION (If applicable) JSEP: NOOO14-84-C-0149 
JSEP, NASA, GRB (see 7a) NA N~~t: ,AG-1-613 :) 
. [no i d number 
-I 
8LACORESS (City. Stom ana ZIP CodII) 10. SOURCE OF FUNDING NOS. 
- --
.~~ ---
- --~ --
PROGRAM PROJECT TASK WORK UNIT 
ELEMENT NO. NO. NO. NO. 
I 
(see 7b) N/A -N/A -" ~ ... - NfA· - ---N-/A----tt. TITL.E (ineluM Security ClIJ8iflcatiOll) Fault and Error 
Latencv Under Real \<k)rkJoad- An EKretinental Study 
t2..PEASCNAL AUTHOR(S) 
I R",m r.hill"'TPOP 13 .. TYPE OF REPORT 1'3b. TIME COVERED 14. OA TE OF REPOAT (y~ .• .vo •• Doy, 1'5. PAGE COUNT 
1'a"hni"",1 FROM - TO - August, 1986 
I 
1ILSUPPLeMENTAFlY NOTATION 
NA 
't7. COSATI COOES 18. SUBJECT TERMS (Continue on ,..u.1"M if nece_ry and ianti(y by blocll numOerj 
I FIEl..C GROUP I SUB. GR. Error and Fault Latency, Failure Rate, Experimental Study, I 
I Workload Effects, Ana~ysis of Variance 
I 19. ASSTRACT ,Conlin ... on rwve .... if nec:e_ry and iMntify by bloelt numbe~1 This thesis demonstrates a practical methodology for the study of fault and error la-
tency under real workload. This is the first study that measures and quantifies the laten-
I cy under real workload and fills a major gap in the current understanding of workload-failure relationships. The methodology is based on low level data gathered on a VAX 11/780 during the normal workload conditions of the installation. Fault occurrence is simulated 
I 
on the data, and the error generation and discovery process is reconstructed to determine 
latency. The analysis proceeds to combine the low level activity data with high level ma-
chine performance data to yield a better understanding of the phenomenon. This study 
I 
finds a strong relationshop between latency and workload and quantifies the relationship. 
The sampling and reconstruction techniques used are also validated. 
Error latency in the memory where the operating system resides is studied using data 
on physical memory access. These data are gathered through hardware probes in the machine 
I·· that samples the system during the normal workload cycle of the installation. The tednique 
~O. ::::S~M! eUTION/AVAII..ABII..:TY OF ABST~AC-:- j21. ABSTRACT SECURITY CLASSiFICATICN I ~NCt..ASSIFIEOiUNL.:MITEO ~ SAME AS FlPT. L! D1"!C U5E~S C Unclassified 
:::a. NAME OF RESPONSIBLE INDIVIDUAL 2=b. TELEo"HCNE NUMBER 22c. OFFICE SYMSOL 
linciude .-\rea Codel • 
I to FORM 1473, 83 APR ..n.onp 
, 
EDITION CF ~ JAN 73 is OBSOLETE. Unclassified 
SEC~AI"!'V CLASS1;:ICAT:C"I .")F "'-!,S "AGE 
fIpe] BSSH; ed I SECURITY CLASSIFICATION OF THIS PAGE 
provides a means to study the system under different workloads and for multiple days. Theil 
data are used to reconstruct the error discovery process in the system. An approach to de-
termine the fault miss percentage is developed and a verification of the entire methodolOgl 
is also performed. This study finds that the mean error latency, in the memory containing 
the operatng system, varies by a factor of 10 to 1 (in hours) between the low and high wor -
loads. It is also found that of all errors occurring within a day, 70% are detected in the 
same, 82% within the following day, and 91% within the third day. I 
Fault latency in the paged sections of memory is determined using data from physical 
memory scans. Fault latency distributions are generated for s-a-O and s-a-1 permanent fault 
models. Results show that the mean fault latency of a s-a-O fault is nearly 5 times that 
the s-a-1 fault. Performance data gathered on the machine are used to study a workload-
latency behavior. An analysis of variance model to quantify the relative influence of var-
ious workload measures on the evaluated latency is also given. 
Error latency in the microcontrol store is studied using data on the microcode 
and usage. These data are acquired using probes in the micro sequencer of the cpu. 
found that the latency distribution has a large mode between 50 and 100 microcycles and 
additional smaller modes. It is interesting to note that the error latency distribution 
the microcontrol store is not exponential as suggested by other reported research. 
