The Effect of System Workload on Error Latency: An Experimental Study by Chillarege, Ram & Iyer, Ravishankar K.
C3C
... I
M
REPORT CSG-40 JANUARY 1985
» COORDINATED SCIENCE LABORATORY
COMPUTER SYSTEMS GROUP
THE EFFECT OF SYSTEM 
WORKLOAD ON ERROR LATENCY: 
A N  EXPERIMENTAL STUDY
RAM CHILLAREGE 
RAVISHANKAR K. IYER
APPROVED FOR PUBLIC RELEASE. DISTRIBUTION UNLIMITED
REPORT R-1027 UILU-ENG 85-2202
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
Unclassified_______________
S E C U R ITY  C LA S S IF IC A TIO N  OF TH IS  FAGS
REPORT DOCUMENTATION PAGE
1a. R EP O R T S E C U R ITY  C LA S S IFIC A TIO N
Unclassified
1b. R E S TR IC TIV E  M AR KIN G S
None
2a. S E C U R ITY  C LA S S IF IC A TIO N  A U T H O R IT Y
N/A____________________
2b. DECLASSI F I C A T  ION/OOWNG R A C IN G  S C H ED U LE
N/A
3. O IS TR I8 U TIO N /A  VAI L A B IL IT Y  OF R EP O R T
Approved for public release, distribution 
unlimited.
4. P ER FO R M IN G  O R G A N IZ A TIO N  R EPO RT NUM BER (S)
CSG #40 R-1027
5. M O N ITO R IN G  O R G A N IZ A T IO N  R EP O R T N UM BER (S)
N/A
6a. N AM E OF PER FO R M IN G  O R G A N IZ A TIO N
Coordinated Science Lab. 
University of Illinois
5b. O FF IC E  S Y M B O L 
( If  applicable)
N/A
7a. NAM E OF M O N ITO R IN G  O R G A N IZ A T IO N
Joint Services Electronics Program
6e. AOORESS (City. Stata and ZIP Coda)
1101 W. Springfield Avenue 
Urbana, Illinois 61801
7b. AO O RESS (City. State and Z IP  Coda)
Office of Naval Research 
800 Quincy Street 
Arlington. VA 22217
N A M E OF FUNOING/SPONSORING
o r g a n i z a t i o n  Joint Services 
Electronics Program
8b. O FF IC E  S Y M B O L 
(If  applicable)
N/A
9. P R O C U R EM E N T IN S TR U M E N T ID E N T IF IC A T IO N  N UM BER
N00014-84-C-0149
8c. AO O RESS (City. State and ZIP  Code) 10. SO URCE OF F U N O IN G  NOS.
Office of Naval Research PROGRAM P R O JEC T TA S K W ORK U N IT
800 Quincy Street E L E M E N T NO. NO. NO. NO.
Arlinpfnn. VA 22217 N/A N/A N/A N/A11. T IT L E  (Include Security Claeaification) T H E  EFFECT OF
SYSTEM WORKLOAD ON ERROR LATENCY
12. P ER SO N A L A U TH O R (S ) RAM CHILLAREGE, RAVISHANKAR K. IYER
13a. TY P E  OF R EPO RT 13b. TIM E  CO V ER ED 14. D A TE  OF R EP O R T (Yr.. Mo., Day) IS. PAGE C O U N T
Technical FROM  TO January 1, 1985 25
16. S U P P LE M EN TA R Y  N O TA TIO N
N/A
17. C O SA TI COOES
F IE L D GROUP SU8. GR.
18. S U B JE C T TE R M S  (Continue on reverse if necemary and identify by block number)
Workload Measurements, Error Latency Experiments, 
Fault Tolerance, Error Detection, Reliability
19. A B S TR A C T  <Continue on reverse if necessary and identify by block number)
In this paper we have established a methodology for determining and characterizing 
error latency. The method is based on real workload data gathered by an experiment 
instrumented on a VAX 11/780 during the normal workload cycle of the installation.
This is the first attempt at jointly studying error latency and workload variations 
in a full production system. Distributions of error latency were generated by 
seeding errors under varying workload conditions. A family of error latency dis­
tributions so generated illustrate that error latency is not so much a function of 
when in time an error occured but rather a function of the workload that followed 
the error. The study finds that the mean error latency varies by a 1 to 10 (hours) 
ratio between high and low workloads. The method is general and can be applied to 
any system.
20. 01ST Rl S U TIO N /A  V A ILA  81 L IT  Y OF A B S TR A C T
U N C LA S S lF IE D /U N LIM ITE D  3  SAME AS RPT. □  D TIC  USERS □
21. A B S TR A C T  S E C U R ITY  C LA S S IFIC A TIO N
Unclassified
22«. NAM E OF RESPONSIBLE IN O IV IO U A L 22b. TELEP H O N E  NUM BER 
(Include Area Code)
22c. O FFIC E  S Y M 8 0 L
None
DD FORM 1473, 83 APR E D I T I O N  OF 1 J A N  73 IS O B S O L E T E . Unclassified
S E C U R I T Y  C L A S S I F I C A T I O N  O F  T H I S  P AG E
THE EFFECT OF SYSTEM WORKLOAD ON ERROR LATENCY: 
AN EXPERIMENTAL STUDY
RAM CHILLAREGE 
RAVISHANKAR K. IYER
Computer Systems Group 
Coordinated Science Laboratory 
University of Illinois at Urbana-Champaign, 1985
Urbana, Illinois
THE EFFECT OF SYSTEM WORKLOAD ON ERROR LATENCY: 
AN EXPERIMENTAL STUDY
Ram Chillarege and Ravishankar K. Iyer 
Computer Systems Group 
Coordinated Science Laboratory 
University of Illinois at Urbana-Champaign, 1985
In this report we have established a methodology for determining and characterizing error 
latency. The method is based on real workload data gathered by an experiment instrumented 
on a VAX 11/780 during the normal workload cycle of the installation. This is the first 
attempt at jointly studying error latency and workload variations in a full production system. 
Distributions of error latency were generated by seeding errors under varying workload 
conditions. A family of error latency distributions so generated illustrate that error latency is 
not so much a function of when in time an error occurred but rather a function of the 
workload that followed the error. The study finds that the mean error latency varies by a 1 to 
10 (hours) ratio between high and low workloads. The method is general and can be applied to
any system.
TABLE OF CONTENTS
1. Introduction ...................    i
2. Instrumentation .......     3
3. Measurements and Analysis - ............ -............ 7
3.1 Workload and Memory Usage Profiles_______ -......7
3.2 Estimating Error Latency................ ...... - ........  9
3.3 Algorithm Implementation_________________ __________________  12
4. Error Latency Distributions.................... .....- ................ , 14
4.1 Discussion of Results-------- ------- ------  18
5. Concluding Remarks ........... —-............ ................  19
Acknowledgment..........................................     20
References 21
11. INTRODUCTION
There is now considerable experimental evidence to show that computer reliability is a 
dynamic function of system activity (as measured by the workload). Experimental studies on a 
number of machines, with varying configurations, [Butner 80], [Iyer 82a], [Iyer 82b] (on IBM 
machines) and [Castillo 80], [Castillo 81], [Castillo 82] (on DEC machines) provide evidence that 
CPU and memory failure rates increase exponentially as the system workload approaches 
saturation. i.e. there is a threshold beyond which an increase in workload results in a non­
linear increase in the error rate. Measurements show that the increase in the error rate can be 
more than two orders in magnitude. This is an important issue since it affects both reliability 
and performance aspects of computer systems. It suggests that it is not useful to push a system 
close to its performance limits since the gain in performance is more than offset by the loss in 
reliability, as end points are reached.
Two possible reasons for the observed workload/failure dependency are thought to be: 
discovery of latent errors [Iyer 82] and stresses imposed by high currents and voltages.1 Many 
failures can only be detected when a particular module or subsystem is "exercised." Thus, 
although the failures may not be caused by increased utilization, they are "revealed" by this 
factor. The time between the occurrence of a failure and its manifestation as a system error has 
been referred to as "error latency" [Shedletskv 73]. The latent discovery effect can cause a 
noticeably high error rate during high workloads periods. A theoretical random walk model to 
describe this dependency is discussed in [Guther 80].
However, to-date no measure of the relative contributions of the two possible effects dis­
cussed above exist. This is an important issue because an engineering solution depends on which 
of the two factors is predominant. For example, if latent discovery is the predominant effect,
XAs the activity increases within a computer the higher rates of exercise of the gates can lead to higher tempera­
tures at the devices themselves due to switching-induced transients resulting in a higher failure rate. A model based on 
this is given in [Cortes 84].
2new research into workload scheduling and periodic scrubbing is suggested. If however, this is 
not so we have to consider new designs which take workload patterns into account (much the 
same way as instruction mixes) in developing new designs.
There is no accurate technique for determining error latency. The only available study is 
[McGough 81 & 83] which performed a gate-level emulation of an avionic m ini processor. A 
set of a specific programs were used to exercise the machine. The p ro g ram s do not however, 
represent a real workload environment. Therefore both the methodology and results are not 
generally applicable.
We propose a methodology to study the latency characteristics of medium to large com­
puter systems. This is the first attempt at jointly studying error latency and workload varia­
tions in a full production environment. The technique is developed using real data from a
11/780 timeshared system at the University of Illinois, Coordinated Science Laboratorv. 
The system runs the UNIX operating system Ver 4.2, and is used mostly for scientific comput­
ing and for a variety of miscellaneous data processing activities. The measurements and 
analysis show a strong relationship between system workload and the Error Latency in the 
unpaged portion of memory. The mean Error Latency varies nearly by a 1 to 10 ratio between 
high and low workload. Distributions of error latency are generated by seeding errors under 
varying workload conditions. The resulting family of error latency distributions illustrate 
that latency is not so much a function of when in time an error occurred but rather a function 
of the workload that followed the error. The methodology is not system specific and has gen­
eral applicability.
32. INSTRUMENTATION
For the purposes of this study we concentrated on memory activity. An important reason 
for this was that previous studies show that the memory subsystem has the largest number of 
errors [Rossetti 81] [Iyer 83].
The VAX 11/780 system has 4 MByte of main memory, three 300 M Byte disk drives, 
and has during the peak hours about 20 to 25 interactive users. The backplane of the VAX 
CPU was probed so as to obtain data on memory activity. This data was sampled by the instru­
mentation connected to the backplane, and forms the basis for the analysis from which error 
latency is measured.
The VAX central processor and the memory subsystem are linked through a data path 
called the Synchronous Backplane Interconnect (SBI). Figure 1 shows the organization of the 
machine. The details of the machine organization are in [DEC 80a] and [DEC 80b]. The SBI is a 
parallel datapath which is multiplexed for address and data and uses a 200 nsec clock to 
achieve a maximum information transfer rate of 13.3 million bytes per sec.
The best approach for obtaining memory activity information is to monitor the SBI 
through which all transactions occur. High speed devices such as Disks, connect to the SBI 
through an interface called the Massbus Adapter. Unibus devices [DEC 80a] similarly connect 
to the SBI through a Unibus Adapter. Requests to memory can arise from either the CPU or 
from the I/O devices and all of them are transacted through the SBI. Monitoring the SBI there­
fore captures all requests to the memory subsystem. The address space on the SBI is partitioned 
so that addresses to the main memory subsystem, Unibus subsystem or other adapters are 
unique, permitting them to be individually extracted.
The SBI consists of 84 signal lines that belong to five diiferent groups, namely, arbitration, 
information transfer, response, interrupt and control. The information transfer group with 46 
signal lines contains the memory activity information. It is used to transfer addresses, data and
300 H Byte 
Disks
q
d
Figure 1. VAX 11/780 System Configuration
interrupt summary information. This group is subdivided into five fields that represent parity 
check (P), information tag (TAG), source or destination identification (ID), masks (MASKS), and 
32 bits of information lines (B) as in Figure 2.
(1) P field; The parity field of 2 bits provides even parity for detecting single bit errors in the 
information transfer group. One of the bits provides parity over the TAG, ID and MASK 
fields and the other over the B field.
(2) TAG field; The TÀG field is 3 bits wide and indicates the information type (being 
transmitted) on the information lines (B field). This field also determines the interpreta­
tion of the ID and the B fields. For example, when the tag code represents COMMAND
5n r TAC 1 P^T] FXJSC ADDRESS
F ie ld  2 3 5 4 4 28
length
Figure 2. SBI Information Transfer group fields
ADDRESS, the B field contains the address.
(3) ID field: The ID field of 5 bits is used to identify the logical source of the data in a write 
command and the logical destination of the data in a read command. The address of the 
location is contained in the B field.
(4) MASK field: The mask field is 4 bits wide and is used to specify operations on any or all 
bytes of the data in the B field. Each bit in the mask field corresponds to a particular byte 
in B.
(5) B field: The B field is 32 bits wide (4 bytes) and is used to carry information/data. 
Depending on the TAG field the 32 bits are interpreted either as one data field of 32 bits 
or as containing two subfields: a FUNC field of 4 bits which identifies read or write mode 
and an ADDRESS field of 28 bits containing the physical address which can be either 
main memory or I/O.
2.1. Experim ental Setup
A Tektronix Digital Analysis System (DAS) 9100 Series was used to monitor and sample 
data transfer activity on the SBL The DAS probes used can strobe the data at speed upto 40 
nsec which is faster than the clock speed of the SBI (200 nsec). The SBI signals are accessible at
6the card edge of the SB1 control cards [DEC cl The data was read into the DAS using the SBI 
clock for external synchronization.
The experiment was controlled from the VAX with the aid of Tektronix 91DW1 
software [TEK 84] and some of our own programs. The software which controlled the experi­
ment was so that it caused negligible overhead and did not bias the experiment. The DAS was 
periodically triggered to acquire data from the SBI, download the acquisition memory and 
time-stamp the data. This data was then preprocessed to make it compatible for subsequent 
input into statistical analysis programs and archived on tapes.
The instrumentation was tested for correct operation and acquisition. This was performed 
by taking the system down into single user operation, turning the cache memory off and run­
ning a test program that accesses specific locations of memory in sequence. Data collected on the 
DAS was then examined for correct acquisition against a known test program.
Two types of data are collected. The first involves logging transactions of every cycle on 
the SBI. We refer to this as Regular Mode. The DAS has a memory of 511 words, and there­
fore each sample contains the trace of 511 cycles on the SBI. Note that, not every cycle of the 
SBI contains a command/address, as some cycles may contain data and some may be idle. A 
sample of data acquired in Regular Mode is shown in Figure 3a. The first line contains a time 
stamp for the sample. Each line of data represents one SBI cycle. The data is in one’s comple­
ment form. Note there are a number of idle cycles (all l ’s). Figure 3b shows the decoded version 
of a single observation.
In certain types of experiments, such as the latency study, it is advantageous to produce a 
dense trace of addresses by storing only those cycles that contain command/address transmitted
2The DAS is programmable via an IEEE-488 interface, or an alternative serial line RS232c link to a host machine. 
We have it connected to the VAX via the serial line interface.
3The acquisition system has been tested for data bias against itself. This is done by externally triggering the DAS 
and acquiring the data and comparing memory usage distributions generated by this data with the distributions generat­
ed from automatically generated data. We find that the the instrumentation is sound and does not indicate any 
significant influences of self-bias.
7on the B lines. Toward this end a compressed mode acquisition was performed by collecting on 
those signals which transmitted a memory address. This is accomplished by generating a "store 
only" signal to the DAS by decoding the TAG field that contains a code to identify 
command/address transaction. In the Compressed Mode the DAS acquires 511 consecutive 
addresses that are generated on the SBI. Figure 3c shows a sample in the Compressed Mode
3. MEASUREMENT AND ANALYSIS
Recall that the experiment collects data on memory activity, e.g. physical memory 
address, access rate and read/write mode. For the purposes of this project we studied the region 
in memory where the OS resides. We looked at this region because errors in the operating sys­
tem can be fatal. This also has the advantage of being the unpaged portion of memory hence 
provides an estimate of inherent latency characteristics unaffected by paging. The methodology 
however is equally valid for both the unpaged and paged portions of the memory.
The data acquisition was performed at intervals of 25 to 40 seconds per sample. This was 
found to be sufficiently frequent to provide an estimate of memory access patterns and usage. 
For this experiment, we acquired data for a number of different usage periods which include 
the varying workload environments. On this base, we implemented a methodology for accu­
rately estimating error latency characteristics for different usage regions and workloads. In 
particular the effect of increasing the workload on error latency characteristics was also deter­
mined.
3.1. W orkload and M em ory Usage Profiles
Recall that we were interested in estimating the errors discovered during the different 
workload environments and also the effect of a changing workload on error latency. Therefore
8Fri Nov 23 11:28:59 CST 1984 
PROBE: 3D X  3B
OBS
,15 i l l u n i  i l i u m
,16 10100101 10010101
,17 11111011 00000010
,18 n i n n i  n i n n i
,19 00101010 01001100
,20 11111001 11001001
,21 n i n n i  n i n n i
2B 2A
n i n n i  n i n n i  n i n n i  
ì i i i i o i o o n i n n  n i n n i  
n i n n i  n i n n i  n i n n i  
n i n n i  n i n n i  n i n n i  
ì i i i o i o o o n i n n  n i n n i  
ìoooioio ìooooon n i n n i  
n i n n i  n i n n i  n i n n i
COMMENTS 
n i n n i  n i n n i  idie
11110011 10001111 cmd/adr
U n i o n  i n o l i l i  
n i n n i  n i n n i  idie
11111011 10001111 cmd/adr
U n i o n  i n o l i l i  
n i n n i  n i n n i  idie
3A 2C
Figure 3a Acquired data in Regular Mode
PROBE: 3D 3C 3B 3A 2C 2B 2A
o io i io io  oooo iio i ì i i i i i o o  n o i 1111 i n l i n i O lii 11 11 100 O lil i
ADDRESS
<--------------------- - B ----------------------
FUNC
— ¡4
CNF x x x x x
X  ■ >
KASK 
no t a
P
sed
zz ZAC ID
Figure 3b Decoding of the data fields
Fri Nov 23 12:15:11 CST 1984
PROBE: 3D X  3B 3A 2C
OBS
,136 11101001 10110100 11111011 01111111
,137 10001111 00110110 11110100 11011111
,138 10001110 00110110 11110100 11011111
,139 01011010 00001101 11111100 11011111
,140 ìo o o in i  o o n o n o  n i io io o  n o n n i
2A COMMENTS2B
11111111 11110011 
11111111 00000011 
n i n n i  ooooion  
11111111 01111111 
n i n n i  oooooon
10001111 cmd/adr 
10001111 cmd/adr 
10001111 cmd/adr 
10001111 cmd/adr 
10001111 cmd/adr
Figure 3. Acquired data in Compressed Mode
9it was important to have an idea of possible variations in system workload Figure 4 shows two 
workload parameters CPU user and system usage as a function of the time of day. CPU user 
usage represents the percent time spent by the CPU in executing user code and CPU system 
usage corresponds to the percent time spent executing system code. The data was collected by a 
modified Unix system utility.
These load measures are cyclic over a 24 hours period and forms a stationary pattern for 
the week-days. This pattern of workload is typical of many computer installations [Iyer 82] 
[Castillo 81]. Between 8am and 10am there is a sharp increase in workload as noticeable in Fig­
ure 4. This period of the day is of particular interest for determining how latency characteris­
tics change with increasing workload During the peak periods the machine has about 20 to 25 
interactive users. The workload is mostly interactive but there is a certain amount of batch 
processing.
An analysis of the memory access data showed two distinct patterns in the unpaged area. 
A 300 byte region with very high usage is shown in Figure 5a and the rest is shown in Figure 
5b. These are shown separately since together one totally dominates the other. The high usage 
region which is most likely the kernel accounts for approximately 70% of the references made 
to system memory. In each of these plots it was noticed that a sampling of 20 to 30 minutes 
was sufficient to obtain a stable pattern. Interestingly enough, the distribution shapes did not 
change appreciably with workload i*e. the relative access probabilities remained the same 
although the access rates changed significantly.
3.2. Estim ating Error Latency
For the purpose of this study we assume that the error rate is constant and can equally 
affect all memory locations. This implies that the workload does not cause any additional 
errors. It allows us to determine the intrinsic latency characteristics without being biased by 
the workload distributions. Another way to look at this is that we find the system error rate
Pe
rc
en
t 
Us
ag
e 
Pe
rc
en
t 
Us
ag
e
80
CPU Systen Usage
-
-
-
111H i  ,
u-1—1—1—1—1—iiJdkJW'1—I—I—I—I—I—I— I I I I I I “1--1--1--1--
00:00  04 :00  08 :00  12:00  16:00 20 :00
Time of Day
Figure 4. System workload
11
«t*«o** u3Ace o* ▼*€ "it* usâot «tolo* 
•CtCe^täGe *â» CHâGT
Kt^CCNfâ6C
«•
1«
IG 
I«
U  
1»
• ••ftft• •••• • •••• • •••• • ••••
• •••• «#••• 
•  • • • •  
•  • • • «  
•  • • • •  • •••• • •••• • •••• 
• • • • •
• • • • •• ••ftft •••••• •••• • *••• •ttttt i t ê t• ••fti• ••••
• «•ftft• Ht*• ttü 
• • • • •  «A«tt• ••••• •••ft
• • • • •  • • • • •  • • • • •• •••ft• • • • •  
• •••ft ■ ••••
• • • • •• •••ft• ••••
• •••ft• •••ft• • • • •  • • • • •  • ••••• •••ft• •••ft• •••ft
•  • • • •  
•  • • • •  
•  • • • •
• •••ft
•  • • • •  
•  • • • •  
•  • • • •  
•  • • • •
• •••ft
•  • • • •  
•  • • • •  
•  • • • •• •••ft• •••ft• •••ft• •••ft• •••ft• •••ft
•  • • • •  
•  • • • •  
•  • • • •  
•  • • • •  
•  • • • •• •••ft
•  • • • •  
•  • • • f t
•  ft • • • •  • • • • •  • • • f t •  • • • f t •  • • • f t •  • • • • •  • •  • • •  • • • f t
•  • • f t  ft •  • • • • •  • • • • •  • • • f t •  • • • f t •  • • • • •  • • • f t •  • • • f t •  • • • f t
•  • • • f t « • • • f t •  • • • f t •  • • • f t •  • • • f t •  • • • f t •  M i t •  • • • f t « * • « »
•  • • • f t •  • • • • •  • • • • •  • • • • •  • • • f t •  • • • • « • • • • •  • • • f t . • • • • f t •  • • • f t •  • • • f t •  • • • f t •  • • • f t
( « • 0 P * * * 5 * 4 0 ¿ * • 0 1 * * 0 1 * 5 * 3 0 4 0 30*>0
•  • •  
•  • •  
•  • •
• •ft
•  • •  
•  • •  
•  • •  
•  • •  
•  • •• •ft
• •ft
•  • •  
•  • •• •ft
• ••• •ft
•  • •  
•  • •  
•  • •  
•  • •  
•  • •
• ft ft• ftft
• ft ft• ••
• •ft
•  • •  
•  • •• •ft
• •ft• •• 
•  • •  
•  • •• ft ft
• •ft• ••
• •••ft
• • • • •  • • • • •• •••ft• •••• • ••••• •••ft• •••• • •••• • • • • •
40* «ÎD*fl|NT
Figure 5. <*■
utAce op rue »est
PMCENT46C 84* C"A*T
MKCIxIitt 
»« 
t*
»•
• •••ft• •••ft• ••••• •«•ft• ••••• •••ft• •••ft• •••ft• •••ft• ••••• •••ft• •••ft• •••ft• •••ft• •••ft• •••ft• •••ft
¡J*'” ÏHïi
.'••••ft. .
• •••ft• •••• ft ftft ••
• •••ft• •••ft• •••ft• ••«ft• •••ft• ••••
• •••ft• •••ft• •••• •••••.
• ••••• •••ft• •••ft• •••ft• •••ft
• •••ft• •••ft• •••ft• •••• • ••••• •••ft• •••ft
Tol*** ÎÏÎ555 tiiflaa
• «•
• •• • ••• •ft• ••• ft ft• •ft
• ftft
• ftft• ••• •ft• ••  • ••
• ftft• •ft• •• • •• • •• • ••
• ftft
• ftft
• ftft• •ft
• ••
• ••••
• ftft ftft
• ••ft •• ••••
• •••ft• •••• «•••• • •••• • ••••• •••ft• •••ft• •••ft
• ••••• •••ft• •••ft• •••ft• • • • •  • • • • •• •••ft• ••••• •••ft• •••ft• ••••
• • • • •
• •••ft• •••ft• •••ft
Î5î53ï'” îfcê55î'*"’îi5î55 *"«*«
40* MIDPOINT
Figure 5. b
12
(distribution) contributed only by the latent discovery effect.
The memory addresses chosen to contain an error are picked from a uniform random dis­
tributions. However, if one intends to study the Error Latency given the assumption that 
increased memory usage increases the probability of error [Cortes 84] then the algorithm can be 
easily modified. The system memory has error detection and correction codes which corrects a 
single error and will detect up to two errors. An error is said to be discovered when the faulty 
memory location is read.
Figure 6 shows the algorithm used to generate error latency distributions. We commence 
by picking a random address of a memory location, say m 1 and assume that it has an error, say 
e r  This is termed seeding an error. The error also has a time associated with it, sat t v  The 
data is now scanned to find the first memory read to the location m j. This is when the error 
would be detected by the ECC circuitry. In Figure 6 location m A has three memory reads to it. 
One before and two after the error e v Let i j  be the time of the first memory read after the 
error. Then the latency of error e l \sl l = 1 1 - tt . The same location may be reaccessed (as in 
the figure) but that amounts to rediscovery and is not a part of this latency study. If however, 
the data set does not contain a read to the memory location in error then it goes undetected and 
amounts to a miss. For example in Figure 6 error e 2 is seeded on memory location m T However 
the data does not contain a read to m 2 and therefore e 2 is undiscovered. The misses are used to 
estimate the percentage of undiscovered errors. This process can be repeated for a large number 
of errors yielding a latency time ll for every random error e-t chosen. These different latency 
times now form a distribution of the error latency. The misses generated by this process are 
used to determine the percentage of undiscovered errors.
3.3. A lgorithm  Im plem entation
Error seeding was the means by which workload effects on latency was studied. The error 
seeding time was varied over the entire usage period to investigate the change in latency with
13
READ ON 
LOCATION n»1
SEEDED ERRORS
READ ON 
LOCATION m2
Figure 6. Latency time calculation
workload. To be able to determine the first time that the location is acessed the data should be 
available past the error instant for a few hours. Rather than seed all the errors at the same 
time, we seeded the errors in a 1 hour interval. The interval was so that the workload did not 
change appreciably over this period. This interval is referred to as the error seeding window. 
This is now useful in studying different workload environments by choosing the error seeding 
window to lie within the workload period.
To keep the computation tractable we define a class as a set of neighboring memory loca­
tions. Wherein, the access to any member of the class is considered equivalent to access of the 
whole class. The class serves two purposes. Firstly, it reduces the number of memory locations 
that one has to compute on. The number of memory locations is very large and the implemen­
tation of the above algorithm would necessitate a lot of computation. Secondly, it caters for the
14
fact that the sampled data although is representative of the memory access pattern need not 
contain every distinct address that is generated. The class size is chosen small enough so that 
the memory usage within a class is uniform (i.e. each location in a class has similar access proba­
bility) and large enough so that the computation is still tractable. We experimented with class 
sizes varying from 1/a page to 3 pages and found that it did not appreciably change the distribu­
tion and the median shifted by less than 5%. For a fixed class size, the percentage of misses 
remained almost constant.
4. ERROR LATENCY DISTRIBUTIONS
We wished to determine error latency characteristics for low, high and intermediate 
workload periods. The movement of the error seeding window was the vehicle by which 
latency characteristics were investigated. The error seeding window was shifted across the time 
of day axis to obtain a family of distributions. The moving of the window provides an insight 
into the nature of the error latency distribution and isolates the effect of workload. The change 
in these distributions then reflects the effect of changing workload on error latency. We keep 
moving the window until a cyclic behavior is seen in the distributions (which is over a 24 hour 
period in our data).
To determine the consequences of workload variations we consider a large span of time 
which includes the low and high workload periods. Recall from Figure 4 that from midnight 
to 7 am is the low workload, 8 to 10 am has an increasing workload (intermediate) with a 
peak around 11. The intermediate period where workload changes from low to high is of par­
ticular interest. Since the system workload forms a cyclic pattern every day we expect that a 
24 hour observation of error seeding to be sufficient. This allows studying of both low and high 
workload periods. As the workload is stable for a large length of time the error seeding time
15
will not appreciable change the distribution so long as the measurement period is sufficiently 
larger than the latency period.
Figure 7 shows the latency distribution generated with a one hour error seeding window 
starting at midnight. The distribution is bi-modal with the second mode being the larger of the 
two. The initial peak corresponds to a small period of high activity which usually occurs 
around midnight. Within the first hour about 10% of the detected errors are found. The bulk 
of the errors, (70%) are found in the second mode. There is a sharp increase in the number of 
errors being detected from about 8 hours after the seeding. This corresponds to 8 am (real time 
of day) which is the start of the increasing workload period. The mean latency is 8:03 him. 
Listed in the figure are the percentages which correspond to detected errors only. Nearly 25% 
of the errors that are seeded were undetected. This figure of 25% miss is almost constant for all 
the following latency distribution.
Figure 8 shows a similar latency distribution but with the error seeding window shifted 
in time towards the increasing workload. In Figure 8 the error seeding window it is between 4 
and 5 am. Notice that the two modes have begun coming closer. It is interesting to note that 
the movement of the window forward in time has caused a decrease in the mean latency. This 
is caused by the movement of the second increase in error discovery (beginning of the second 
mode) closer and closer to the seeding time (shorter latency). The time of the sharp increase 
corresponds to 8 am, when workload starts to increase. Also note that there is dip in the distri­
bution after the mode. This corresponds to a lull in the activity that occurs around 10:30 or so 
on our system.4 This clearly shows the influence of workload in determining error latency.
Figure 9 has errors seeded at 8 am (start of the rise in workload) and Figure 10 has errors 
seeded at 12 pm (well into the high workload period). In contrast to Figure 7 it is seen that the 
mean error latency is now (Fig 10) down to 44 mins and that 70% of the detected errors are
4In our system which is largely vised by graduate students, their day starts at around 10:30 am and continues past 
12:30 pm or so. The early morning (8 am) rise in activity is due to secretarial and staff users.
ERROR LATENCY DISTRIBUTION ERROR SEEDING WINDOW FROM 00:00:00 TO 01:00:00
FREQUENCY BAR CHART
MIDPOINTLATENCY
I•1:46:40 1
-1:20:00 1-0:53:20 1
-0:26:40 10:00:00 1
0: 26 40 !0:33:20 i1:20:00 j1:46:40 «2:13:20 •2:40:00 !3:06:403:33:20
4:00:004:26:40
4:33:205:20:003:46'40
6:13;206:40:007:06:407:33:20 i6:00:006: 26: 406:33:209:20009:46:40
10:13=2010:40:0011:06:40
11:33:2012:00:0012:264012:33:2013:20:0013:46:40 !14:13:20 i14:40:00 !
13:06:40 i15:33:20 !
20 AO 60 80
Distribution
Parameter hhanmas
Mean &03:28
Sid Dev 4:01:19
Skewness -0.690
"cv 49.917
Rurtosis -0.844
75% Q3 11:18:15
50% Med 9:18:43
25% Q1 426:33
Mode 3.-09:19
- ♦----------- ♦--
100 120 140 160 160 200 220 240 260 280
PERCENT CUM.PERCENT
0.00 0.000.00 0.00
0.00 0.000.00 0.00
3. 15 3. 154.39 7.552.48 10.02
1.51 11.331.32 12.851.37 14.232.78 17.002.34 19. 33
1.21 20.562.78 23.343.61 26.93
1.13 28.080.43 28.310.38 28.890.84 29.72
0.92 30.641.16 31.802.34 34. 142.36 38.703.83 40.536.12 46.657. 79 54.43
6.17 60.603.38 64.194.47 68. 666.66 73.327.49 82.816.31 89.114.02 93.134.20 97.332.07 99.41
0.39 100.000.00 100.00
0. 00 100.000.00 100.000.00 100.00
FREQUENCY
Figure 7
ERROR LATENCY DISTRIBUTION ERROR SEEDING WINOOW FROM 04:00:00 TO 05:00:00
MIOPOINTLATENCY
- 0 33 20 !
- 0 16 40 !
0 00 00
0 16 4 0
0 33 20
0 30 00
1 06 40 ! **
1 23 20 ! * *
1 4 0 00 ! « *
1 36 40 ! * *
2 13 20
2 30 00
2 4 6 4 0
3 03 20
3 20 0 0
3 36 40
3 33 20
4 1 0 00
4 26 4 0
4 43 20
5 00 00
5 16 40
5 33 20
3 50 00
6 06 4 0
6 23 20
6 4 0 00
6 36 4 0
7 1 3 20
7 30 00
7 46 40
8 03 20
8 20 00
8 36 40
3 53 20
9 1 0 00
9 26 4 0
9 43 20
1 0 00 00 !
10 16 4 0 •
FREQUENCY BAR CHART
Distribution
Parameter h hanm M
Mean 5:54:56
Sid Dev 2:19^9
Skewness -0.822
ICY 39.442
| Rurtosis 0.185
| 75% 03 7:41:15 !
! 50% Med 6:07:25 j
i 25% Q1 4:52:34 !
; Mode 5:09:37
PERCENT CUM.
PERCENT
0 . 0 0  0 . 0 00.00 0.001.15 1.15
2 . 2 4  3 . 3 9
2 . 7 9  6 . 1 8
1 . 3 4 7 . 5 2
0 . 2 7  7 . 8 0
0 . 4 9  8 . 2 9
0 . 3 6  8 . 6 4
0 . 3 2  9 . 1 6
0 .7 1 9 . 8 7
0 . 6 8  1 0 . 5 6
0 . 7 1 1 1 . 2 7
1 . 0 4 12.3 1
1 . 2 3 1 3 . 3 4
1 . 3 3 1 3 . 0 7
1 . 4 8  1 6 . 3 3
2 . 0 8  1 8 . 6 3
3 . 1 5  2 1 . 7 7
3 .0 1 2 4 . 7 8
3 . 7 4  3 0 . 5 3
6 . 2 6  3 6 . 7 9
6 . 1 8  4 2 . 9 7
4 . 9 0  4 7 . 8 7
3 . 5 6  5 1 . 4 2
2 . 7 6  5 4 . 1 8
3 . 7 2  5 7 . 9 0
4 . 8 7  62 77
3 . 2 2  6 8 . 0 0
6 . 1 0  7 4 . 1 0
5 63 7 9 . 7 3
3 . 2 8  8 3 .0 1
3 53 88 34
3 .3 1 91 83
3 . 5 6  9 5 . 4 0
2 . 4 6  97 87
1 4 8 9 9 . 3 4
0 6 3 9 9 . 9 7
0 . 0 3  1 0 0 . 0 00.00 100.00
20 4 0 60 80 100 120 140 160 180 200 220
FREQUENCY
MIDPOINT
LATENCY
•0:20:00 
-0:10:00 
0:00:00 
0 : 1 0 : 0 0  
0:20:00 
0  30:00 
0:40 00
0 : SO : 00 
1:00:00 
1:10:001 : 20: 00 
1:30:00 
1 : 40:00 1 : 30: 00 2:00:00 
2:10:00 2:20:00 
2:30:00 
2:40:00 
2:SO:00
3 : 00 : 00 
3:10:00 
3:20:00 
3:30:00 
3:40:00 
3:50:00
4 : 00: 00 
4:10:00 
4:20:00 
4 : 30: 00 
4:40.00 
4:50.00 S: 00: 00 
5:10:00 
5:20:00 
5:30:00 
S:40'00 
5:30:00 S: 00: 00 
6:10:00
ERROR LATENCY DISTRIBUTION 
ERROR SEEOINO WINDOW PROM 08:00:00 TO 09:00:00
FREQUENCY BAR CHART
! ................ « f i n i i. .......
! »«««
Distribution
Parameter Ihhanmss
Mean 2:40:42
Sid Dev 1126:31
Skewness 0.110
¡CV 53.839
kunosis -1.073
75% 03 • 330:19
509c Med i  2-43:16
25% Q1 1:24:36
Mode 1:30:13
PERCENT
0.00 0.00
0 . 0 0 0.00
0.48 0.48
2.00 2.47
1.38 3.85
2. 33 6. 18
2.42 8.60
2. 64 11.25
3.88 15.13
4.92 20.04
5. 17 23.22
4.55 29.77
4.33 34.10
3.82 37.93
3.20 41.13
2.42 43.55
2.02 45.57
2.42 47.99
2.36 50.35
2.87 53.22
3 63 56.85
3.85 60.70
3.85 64.55
4.47 69.02
3.88 72. 90
4.22 77.12
3.99 81.11
3 23 64.34
2.47 86.61
2.08 88.90
2 25 91.14
1.91 93.06
2.36 95.42
1.66 97.08
1.29 98.37
0.93 99.30
0.39 99.69
0. 31 100.00
0.00 100.00
0. 00 100.00
10 20 30 40 SO SO 70 80 90
--- ---------- -----
100 110 120 130 140 150 160 170 180
17
Figaro 9
ERROR LATENCY DISTRIBUTION 
ERROR SEEDING WINOOW FROM 1 2 : 0 0 0 0  TO 13:00:00
MIDPOINT
LATENCY
FREQUENCY BAR CHART
PERCENT CUM.
PERCENT
0:10:00 
0 :0 6 : 4 0  
•0 :0 3 : 2 0  0:00:00 
0 : 0 3 : 2 0  
0 : 0 6 : 4 0  0: 10:00 
0 : 1 3 : 2 0  
0 :1 6 : 4 0  
0 :2 0 : 0 0  
0 :2 3 : 2 0  
0 :2 6 : 4 0  0:30: 00 
0 :3 3 : 2 0  
0 :3 6 : 4 0  0:40:00 
0 :4 3 : 2 0  
0 :4 6 : 4 0  
0 :5 0 : 0 0  
0 :5 3 : 2 0  
0 :5 6 : 4 0  1:00:00 
1 :0 3 : 20  
1 : 06 : 4 0  1 : 10:00 
1:1 3 : 2 0  
1 :1 6 : 4 0  1 : 20: 00 
1 :2 3 : 2 0  
1 : 2 6 . 4 0  
1:3 0 : 0 0  
1:3 3 : 2 0  
1 :3 6 : 4 0  
1 :4 0 : 0 0  
1 : 4 3 . 2 0  
1 : 4 6 : 4 0  
1 :3 0 : 0 0  
1 : 3 3 : 2 0  
1 : 5 6 : 40  2:00:00
! a a a a a a a a
• i n t i
D istribu tion
i P aram eter |h hanm ss
1 Mean 0:44:26 j
i  Sid Dev 0:29:19 !
j  Skewness 0.243 i
! CY 65.994 !
[kunosis -0.985 j
! 75% Q3 1:07:11 !
j  50% Med 0:43:57 j
i  25% Q1 0:17:50 |
j  Mode 051228 i
80 90
0 . 0 0 0 . 0 0
0 . 0 0 0 . 0 0
0 . 0 0 0 . 00
3 . 0 8 3 . 0 8
5 . 16 8 . 2 4
4 . 5 4 1 2 . 7 9
4 . 4 3 1 7 . 2 2
4 . 0 9 2 1 .3 1
4 .2 1 2 3 . 52
4 . 3 7 2 9 . 8 9
3 . 4 2 3 3 .3 1
2 . 5 8 3 5 . 8 9
2 . 97 3 8 . 8 7
3 31 4 2 . 18
2 . 8 0 4 4 . 9 8
2 . 9 7 4 7 . 9 5
2 . 9 7 5 0 . 9 3
3 . 0 8 5 4 .0 1
3 . 4 2 5 7 . 4 3
4 . 0 9 61 . S 3
4 .2 1 6 5 . 7 3
3 . 7 0 69 43
3 . 31 7 2 . 7 4
3 . 5 3 7 6 . 2 8
2 . 9 2 7 9 . 1 9
3 .3 1 8 2 . 5 0
2 97 8 5 . 4 7
2 . 5 2 8 8 . 0 0
2 .4 1 9 0 .4 1
1 . 4 6 9 1 . 8 7
1 68 9 3 . 5 5
1 .5 1 9 5 . 0 6
1 . 0 7 9 6 . 1 3
1 . 29 9 7 . 4 2
0 . 90 9 8 . 3 2
1 .0 1 9 9 . 3 3
0 . 3 6 9 9 . 6 9
0 . 1 1 1 0 0 . 0 0
0 . 0 0 1 0 0 . 0 0
0 . 0 0 1 0 0 . 0 0
FREQUENCY
F ig u re  10
18
discovered in the 1st hour. Thus errors occurring at low workload are discovered withing 10 
hours (on the average) versus only 1 hour for errors occurring at high workload.
4.1. Discussion o f Results
The family of distributions generated by moving the error seeding window reveal a 
number of interesting issues.
It is seen that error latency is not so much a function of when in time the error occurred 
but rather a function of the workload that followed the occurrence of the error. A mode in 
the error latency distribution always occurs when there is a rise in system workload. Thus any 
rise in the workload results in a mode appearing in the latency distribution. A steady rise in 
workloads sweeps the errors out (higher error discovery) after which few if error remain to be 
discovered (low error discovery). The results suggest that an increase in workload causes a tem­
porary increase the observed error rate. The error rate drops again after the errors have been 
discovered.
These results, apart from providing insight into the latency behavior, also suggest that the 
safest period for computing (assuming only latency effects) is at the tail end of an increasing 
workload. Since errors that have occurred in the past would have been discovered and the 
error rate will be nearly constant. The SLAC flyer 82b] and CMU [Castillo 81] results however 
show a sustained increase in the error rate as workload increases. Thus we have fresh evidence 
to suggest that there is more to a workload/error relationship than simply a latent discovery 
effect. This work however serves to isolate the impact of error latency which was not possible
so far.
19
5. CONCLUDING REMARKS
In this report we established a methodology for determining and characterizing error 
latency. The method is based on real workload data gathered by an experiment instrumented 
on a VAX 11/780 during the normal workload cycle of the installation. Distributions of error 
latency were generated by seeding errors under varying workload conditions. The family of 
error latency distributions obtained illustrate that error latency is not so much a function of 
when in time an error occurred but rather a function of the workload that followed the error. 
The study finds that the mean error latency varies by a 10 to 1 between low and high work­
loads. The methodology is not restricted to systems studied and has general applicability.
20
ACKNOWLEDGMENT
We thank Professor E  S. Davidson for valuable discussions during the course of this 
work; G. D. McNiven, R. Berry and S. Laha for their assistance in instrumenting the acquisition 
system on the VAX.
This work was supported in part by the Joint Services Electronics Program (U.S. Army, 
UJS. Navy and the UJS. Air Force) under contract N00014-84-C-0149 and in part by the Gradu­
ate Research Board, University of Illinois, Urbana-Champaign.
21
REFERENCES
[Butner 80]
S. E  Butner and R. K. Iyer, "A Statistical Study of Reliability and System Load at 
SLAC," Digest, Tenth International Symposium on Fault Tolerant Computing, Kyo­
to, Japan, Oct 1980.
[Castillo 80]
X. Castillo and D. P. Siewiorek, HA Performance Reliability Model for Computing 
Systems," Digest, Tenth International Symposium on Fault Tolerant Computing, 
Kyoto, Japan, Oct 1980.
[Castillo 81]
X. Castillo and D. P. Siewiorek, "Workload, Performance and Reliability of Digital 
Computing Systems," Digest, Eleventh International Symposium on Fault-Tolerant 
Computing, Portland, Maine, June 1981, pp. 84-89.
[Cortes 84]
Mario L. Cortes and R. K. Iyer, "Device Failures and System Activity: A Thermal 
Effects Model, Digest, Fourteenth Inter. Symposium on Fault-Tolerant Computing 
Orlando, Florida June 1984.
[DEC 80a]
Digital Equipment Corporation, V A X  Architecture Handbook, Digital Equipment 
Corporation, Maynard, MA, 1980.
[DEC 80b]
Digital Equipment Corporation, V A X  Architecture Handbook, Digital Equipment 
Corporation, Maynard, MA, 1980.
[DEC 80c]
Digital Equipment Corporation, KA780 Field Maintenance Print Set, Digital Equip­
ment Corporation, Maynard, MA, 1980.
[Gunther 80]
N. L. Gunther and W. C. Carter, "Remarks on the Prob. of detecting faults," Digest 
10th International Symposium on Fault-Tolerance Computing, Kvoto, Japan, Oct 
1980.
[Iyer 82a]
R. K. Iyer, S, E, Butner and E  J. McCluskey, "A Statistical Failure/Load Relationship; 
Results of a Multi-Computer Study," IEEE Transactions on Computers, July 1982.
[Iyer 82b]
R. K. Iyer and D. J. Rossetti, "A Statistical Load Dependency of CPU Errors at SLAC," 
Digest, 12th International Symposium on Fault Tolerant Computing, Santa Monica, 
California, June 1982.
22
[Iyer 83]
R. K. Iyer and David J. Rossetti, Permanent CPU Errors and S ystem Activity! Meas­
urement and Modeling , Digest, Real-Time Systems Symposium, Arlington, Virginia,
[McGough 81]
John G. McGough and Fred L. Swern, "Measurement of Fault Latency in a Digital 
Avionic Mini Processor," NASA Contractor Report 3462, Oct 1981.
[McGough 83]
John G. McGough and Fred L. Swern, "Measurement of Fault Latency in a Digital 
Avionic Mini Processor Part n," N ASA Contractor Report 3651, Jan 1983.
[Rossetti 81]
D. J. Rossetti and R. K. Iyer, "A Software System for Reliability and Workload 
Analysis, CRC Tech Rpt 81-18, Center for Reliable Computing, Computer System s 
Laboratory, Stanford Univ, Stanford, C.A., Dec 1981.
[Shedletskv 73]
J. J. Schedletsky and E. J. McCluskev, "The Error Latency of a Fault in a Combina­
tional Circuit," Digest FTCS-3, June 1973.
[TEK 84]
Tektronix, Users Manual 91D W 1 For V A X /U N IX  4.1bsd Releasel, 1984, Tek­
tronix, Oregon, USA.
