Fault-free behavior of reliable multiprocessor systems:  FTMP experiments in AIRLAB by Siewiorek, D. et al.
NASA Contractor Report 177967 
FAULT-FREE BEHAVIOR OF RELIABLE 
MULTIPROCESSOR SYSTEMS: FTMP 
EXPERIMENTS IN AIRLAB 
Ed Clune, Zary Segall, and 
Daniel Siewiorek 
CARNEGIE-MELLON UNIVERSITY 
Pittsburgh, Pennsylvania 
NASA-CR-177967 
19860001381 
I .,- t. , o! 
Grant NAGl-190 
August 1985 
NJ\SI\ 
National Aeronautics and 
Space Administration 
Langley Research Cent. 
Hampton, Virginia 23665 
, ' 
, 
1111111111111 1111 111111111111111 11111 11111111 
NF00717 
https://ntrs.nasa.gov/search.jsp?R=19860001381 2020-03-20T16:05:56+00:00Z
1 Report No. I 2 Government Ac:ctuIon No 3 Rec:l~t's <:'tliog No 
NASA CR-I77967 
4 Title" Subtitle 5 Report Dete 
Fault-Free Behavior of Reliable Multiprocessor Au_gust 1985 
Systems: FTMP Experiments in AIRLAB 8 Performll'lg Organization Code 
7 Author!s! Ed Clune, Zary Segall, and Daniel Siewiorek 8 Performll'l Organization Report No 
10 Work Unit No 
9 Performing Orlll'IIZ1tlon Name and Addr_ 
Carnegie-Mellon University 11 Contract or Grant No Pittsburgh, PA 
NAGI-190 
13 Type of Report and .... ,od Covered 
12 Sponsorlrlg Agency Name and Addr_ Contractor Report 
National Aeronautics and Space Administration 14 SponSOl'I"9 Agency Code 
Washington, DC 20546 505-34-13-32 
15 Supplementary Notes 
Langley Technical Monitor: George B. Finelli 
16 Abstract 
ThlS report descrlbes a set of experiments which were implemented on the Fault 
Tolerant Multi-Processor (FTMP) at NASA/Langley's AIRLAB facility. These 
experiments are part of an effort to formulate and evaluate validation 
methodologies for fault-tolerant computers. This report deals with the 
measurement of single parameters (baselines) of a fault free system. 
The initial set of baseline experiments lead to the following conclusions: 
1. The system clock is constant and independent of workload in the tested cases; 
2. The instruction execution times are constant; 
3. The R4 frame Slze lS 40mS with some variatlon; 
4. The frame stretching mechanism has some flaws in its lmplementation that 
allow the possibil1ty of an infinite stretching of frame durat1on. 
Future experiments are planned. Some will broaden the results of these initial 
experlments. Others will measure the system more dynamically. The 
implementation of a synthetic workload generatlon mechani~m for FTMP is planned 
to enhance the experlmental environment of the system. 
17 Key Words {Su9!lftted by Author!s!! 18 Distribution Statement 
Validation Fault-free Unclassified - Unlimited 
Fault-tolerant Subject Category 62 
Multiprocessors 
Performance measurement 
Synthetic workload 
19 Securoty Oa.,f {of thIS report! 20 Securoty Cla .. f {of this page! 21 No of Pages 22 Proce 
Unclassified Unclassihed 32 A03 
"-305 For sale by the National Technical Information Service, Springfield Virginia 22161 
Table of Contents 
Abstract 
1. Introduction 
2. Background 
2 1 Proposed Methodology 
2.2 Experiment Environment 
2.2.1 Stage 1 . Standalone 
2 2 2 Stage 2 . Operating System (OS) 
2.2 3 Stage 3 . Integrated Instrumentation Environment 
2.3 Present and Future Experiment Environment 
2.4 The Fault Tolerant Multl·Processor (FTMP) 
2 5 Proposed Experiments 
3. The Experiments 
3.1 Clock Read Time Delay 
3.2 Instruction Times 
3.3 Measuring R4 Frame Size 
4. Results 
4.1 Read Time Clock Delay 
4 2 Instruction Measurement 
43 MeasUring R4 Frame Size 
5. Summary and Future Work 
6. Acknowledgment 
2 
3 
5 
5 
6 
7 
7 
8 
9 
10 
12 
14 
15 
16 
16 
18 
18 
19 
20 
26 
27 
ii 
List of Figu res 
Figure 2·1: System Evaluation List 
Figure 2·2: Levels of Abstraction In Multiprocessor Systems 
Figure 2·3: FTMP System 
Figure 2·4: Frame Structure 
Figure 2·5: Frame Stretch Mechanisms 
Figu re 3·1: BasIC Experiment Task Algorithm 
Figu re 4·1: Single Triad R4 Frame Distribution 
Figure 4·2: Double Triad R4 Frame Distribution 
Figure 4·3: Stretched Frame - 2000 Iterations 
Flgu re 4·4: Stretched Frame - 3000 Iterations 
Figure 4·5: Stretched Frame - 5000 Iterations 
Flgu re 4·6: Frame Size (mSeconds) vs. Iteration Count 
5 
6 
10 
11 
12 
14 
21 
22 
23 
23 
24 
25 
IIi 
List of Tables 
Table 4·1: Instruction Results 
Table 4·2: Frame Measurement Results 
Table 4·3: Frame Stretchmg Results 
19 
20 
22 
1 
Executive Summary 
This IS a report on research by Carnegle-Meilon University for NASA-Langley Research Center 
under contract NAG-1-190 on the validation of fault-tolerant aVIOniCS multiprocessors This report 
covers the Initial phase of experimentation. 
In this period a series of basIc performance measurements were conducted on the Fault Tolerant 
Multi-Processor (FTMP) as part of a process to evaluate validation methodologies. The results of 
these experiments and proposals for future work are presented In this report. 
A paper, based on this work, entitled "Validation of Fault-Free BehaVior of a Reliable Multiprocessor 
System - FTMP: A Case Study" was presented at the American Control Conference In June 1984. 
2 
Abstract 
This report descnbes a set of expenments which were Implemented on the Fault Tolerant Multi-
Processor (FTMP) at NASA/Langley's AIRLAB facIlity These expenments are part of an effort to 
formulate and evaluate validation methodologies for fault tolerant computers. ThiS report deals with 
the measurement of single parameters (baselines) of a fault free system. 
The Initial set of baseline expenments lead to the follOWing conclUSions: 
1 The system clock IS constant and Independent of workload In the tested cases, 
2 The instruction execution times are constant; 
3 The R4 frame size IS 40mS with some vanatlon; 
4. The frame stretching mechanism has some flaws In Its Implementation that allow the 
pOSSibility of an infinite stretching of frame duration 
Future experiments are planned. Some Will broaden the results of these Initial expenments Others 
will measure the system more dynamically The Implementation of a synthetic workload generation 
mechanism for FTMP IS planned to enhance the expenmental environment of the system.1 
1Thls research was sponsored by the National Aeronautics and Space Administration. Langley Research Center under 
contract NAG-1 190 The views and conclusions contained In thiS document are those of the authors and should not be 
Interpreted as representing the offiCial poliCies, either expressed or Implied, of NASA, the United States Government or 
Carnegie-Mellon Umverslty 
3 
1. Introduction 
An aircraft of the 1990's will have computer systems that must function correctly for the aIrcraft to 
fly Many studies have been performed on f.lult tolerant aVioniCS computers One such study by 
NASA, m Its Aircraft Energy Efficiency (ACE E) program, reqlllres that an aircraft computer failure 
probability shoull! be less than 10-10 per hour Systems have been bUilt with thiS goal In mmd (SIFT 
and FTMP[5, 6, 11]) Techniques must be developed for measuring the performance and reliability of 
these systems. 
Since a probability of failure of 10 10 per hour translates to les<) than one failure per million years of 
operation, It IS not feaSible to walt for enough accumulated operational hours to demonstrate 
compliance with the goal prior to the release of the aircraft computer A comprehenSive validation 
methodology can greatly reduce the amount of time reqUired to determine If a system meets lis deSign 
goals. An overall validation methodology has many components Includmg theorem proving, 
mathematical modeling, and phYSical experimentation. NASA has held several workshops to develop 
a system validation procedure One workshop In particular [8] produced a detailed list of validation 
tasks to verify a system In an orderly manner 
Theorem proving and mathematical modeling are often bdsed on a SimplificatIOn and abstraction of 
the phYSical system These Simplifications are required to reduce the comph:~xlty of the mathematiCs 
to a tractable level Experimentation IS a key element In a validatIOn methodology since It serves to 
validate the model and abstractions assumed In the mathematical treatments as well as to discover 
unanticipated phenomena. 
The foundations for an experimental validation methodology have been developed and are being 
tested at the AVionICS Integrated Research Laboratory (AIRLAB) at the NASA Langley Research 
Center AIRLAB IS a faCility for developing technologies and methodologies to evaluate and Integrate 
aVIOniCS and control functions of future aIrcraft and to establish a store of performance evaluation 
and reliability evaluation statistiCS 
In parallel, Carnegie-Mellon University had developed several multiprocessor systems including 
C mmp, a system With 16 processors communicating With 16 memones through a crossbar SWitch 
[12], and Cm·, a 50 processor system WIth a hierarchIcal processor-memory SWitch [101 Over the 
past decade resedrchers at CMU were evolVing experimental methodolo91es for evaluatmg these 
multiprocessor systems AIRLAB prOVided an opportunity to dPply and e..<tcnd the e..<penmental 
methodologies to real time fault tolerant multiprocessor systems The generality of the expenmental 
4 
methodology could be demonstrated by Its applicatIOn to four diverse multiprocessor systems while 
producing actual measurements which compared and contrasted these architectures. 
The remamder of this document IS organized as follows. Section 2 gives background on the 
validation methodology used as the baSIs of the baseline experiments It also Introduces the Fault 
Tolerant Multiprocessor (FTMP), on which the baseline experiments were performed, and gives a 
short description of the types of baseline experiments that were run on FTMP Section 3 describes 
the baSIC experimental structure and the variants used to measure different baseline parameters In 
Section 4, the results of the experiments are presented and conclUSions drawn A summary of the 
experiment results and a diSCUSSion of future work are presented In Section 5. 
5 
2. Background 
This section contains information necessary to understand the motivation for the experiments 
discussed In the next section. Included are a description of the validation methodology used as a 
basIs for the experiments and a description of the Fault Tolerant Multl·Processor (FTMP). 
2.1 Proposed Methodology 
NASA held several workshops to determine system validatIOn procedures One In particular [8], 
produced a detailed list of validation tasks to verify a system In an orderly manner. The methodology 
was based on a bUilding block approach m that confidence would be bUilt up In an Incremental 
manner through the understandmg and measurement of pnmltlve actIVIties Once these pnmltlve 
activities were charactenzed, more complex expenments would be devised to explore the interaction 
of primitive activities as well as more complex activities constructed from these primitive activities 
ThiS orderly progression Insures umform coverage as well as maximizes the ability to locate the cause 
of an unexpected phenomenon A modified version of thiS list IS shown In Figure 2-1. 
Fault Free Evaluation 
a Imtlal Checkout and OIll.gnostlCs 
b. Programmer's Manual Verification 
c. Executive Routme Verification 
d Multiprocessor Interconnect Verification 
e Multiprocessor Executive Routme Venflcatlon 
f Application Program Verification and Performance Baseline 
2 Fault Handlmg Evaluation 
a Simulation of Inaccessible PhYSical Failures 
b. Smgle Processor Fault Insertion 
c Multiprocessor Fault Insertion 
d. Smgle Processor Executive Failure Response Charactenzatlon 
e MultIprocessor System Executive Fault Handling Capabilities 
f. Application Program VerificatIOn on Multiprocessor 
g Multiple Application Program Verification on Multiprocessor 
Figu re 2-1' System Evaluation list 
The first set of SIX tasks verifies the fault free functionality of the system while the next set of seven 
verifies fault handling capabilities The reader IS referred to [8] for a more detailed explanation of the 
above listed tasks The expenments descnbed In thiS paper deal only With the set of fault free 
performance evaluation tasks The experiments run on FTMP were actually Involved m performance 
baselines (part of task 1 f) although verification at other levels was accomplished as well (for example, 
Executive Routme VenflcatlOn, task 1c) while running baseline expenments 
6 
2.2 Experiment Environm~nt 
Multiprocessor systems are enormously complex In order to make them easier to comprehend, It IS 
necessary to divide the system Into several levels One can then proceed from the most primitive level 
upwards to the highest conceptual level by introducing a senes of abstractions Each abstraction 
contains only informatIOn Important to ItS particular level, and suppresses unnecessary tnformatlon 
about lower levels The levels In a digital system frequently cOincide with the system's phYSical 
boundanes since the concept of levels was utilized by the system's designers to manage complexity 
Once details at one level are comprehended, only the functionality prOVided for the next higher level 
need be conSidered Figure 2·1 depicts one pOSSible set of levels of abstractions 
Multiprocessor 
Program 
Hardware 
Sublevel 
Application Software 
Executive Software 
InstructIOn Set 
Logic 
Typical Components 
Processor, memory. 
SWitches 
Display, navigation, 
flight control 
Me~sage system, task 
scheduler, memory 
allocator 
Memory state, 
processor state. 
effective address 
calculation. 
instruction execution 
Gates, flip-flops, 
registers, sequential 
machines 
Figu re 2·2: Levels of Abstraction In Multiprocessor Systems 
Our experience at CMU indicates multiprocessors go through a senes of evolutionary stages A 
stage IS defined by the amount of functionality available to the user ThiS functionality, In turn, 
determmes the compleXity and sophistication of experiments that can be conducted ThiS 
functionality can usually be defmed In terms of the actiVities In the life of an experiment First, the 
code has to be designed and wntten Next, the code must be compiled, followed by loading, 
debugging, measurement, and analYSIS Consequently. a view of the stages of a system's life IS the 
number of these actlVltiEls that are directly supported by the system for the user The follOWing are 
three representative stages In the evolution of a tYPical multiprocessor system. 
7 
2.2.1 Stage 1 • Standalone 
The system IS completed through the instruction set level of abstraction That IS, the Instruction set 
has been defined and the hardw<.ue has been Implemented There IS Virtually no software to support 
user apphcatlons The only software utility would be a loader whereby programs compiled on another 
machine can be loaded Into the system under test. Experiments are hmlted to simple, regular, 
compute bound algorithms Only a hmlted number of parameters may be vaned, and this vanatlon 
reqUires rewrltmg of the SOllrce code of the experiment There are several attributes to Stage 1 
expenments The programmer must be a hardware expert since there IS httle software to prOVide a 
higher level Virtual (abstract) machine. Hence the program IS tied closely to the hardware The user 
must speCify where code IS placed, define the memory map, and write code to Inltlahze the memory, 
create processes, manage resources, and collect data. 
TYPical basehne experiments In Stage 1 Inclllde: 
• Hardware Saturation Programs consist of two or three instruction loops with vanatlon 
In placement of code and data The capacity of various system hardware resources IS 
determmed as well as the Impact of contention for those resources 
• Speedup due to AlgOrithm/Data Variation. Experiments seek the Impact of 
synchrOnization for data, as well as variation due to data dependenCies and size of data 
• Errors Diagnostic programs can be contmuously run and monitored on the system 
Distribution of diagnostic detected errors can be studied. 
2.2.2 Stage 2 • Operating System (OS> 
The user IS presented the abstraction prOVided by the executive software ThiS software prOVides 
basIc functions such as resource management and scheduling. In programming experiments, the 
user employs operating system primitives Hence, the user needs a substantial operatmg system 
expertise Also, characteristic for thiS phase IS the discrete Incremental nature of the experimentation 
process; each experiment represents one pOint In the deSign space 
The attributes of Stage 2 apphcatlons can be stated as follows. 
• very regular, data bound with limited variatIOn of parameters 
• the general program organization has a Master process controlling a collection of Slave 
processes domg the actual computation 
• code IS replicated 
• heavy use of OS mechanisms 
TYPical basehne expenments In Stage 2 Include' 
• Measurements of the cost·per-feature of the operatmg system's functions 
Experiments statically exercise each OS function on a one by one baSIS Examples 
mclude memory management, communication pnmltlves, synchrOnizatIOn, scheduling 
and exception handling 
8 
• Measurements of different implementation of parallel ~Igorlthms. The Impact of 
uSing vanous strategies In pardllel program organization, data structure and resource 
allocatIOn IS studied 
2.2.3 Stage 3 • Integrated Instrumentation Environment 
At this stage hardware and software have been provided for generating expenmental stimulus, 
dynamically observing hardware and software actIVIties, and analYZing results [9] With this 
enhanced support, the user can experiment at the application level of abstraction With full vanatlon of 
parameters A major characteristic of this stage IS the proVISion of stimulus generdtlOn, mOnltonng, 
data collection and analYSIS grouped under a unique user Interface Also the OS, the support 
software and the user application are uniformly Instrumented enabling Improved behavior VISibility. 
Only With thiS capability, the interactIOn between OS, support software and user application became 
measurable With acceptable effort Hence, the programmer could be a relative system novice 
Furthermore, the effort to conduct advanced expenments becomes manageable Experiments at thiS 
stage have the following attributes: 
• Measurements of dynamiC behavior of OS and applications. 
• Measurements are continUOus Program could be mOnitored on-line and sometimes In 
real-time. 
• Studies of different virtual machines. 
• Studies of different loqlcal intercommUnication structures. 
• Scaling application performance With respect to different Virtual machines 
Examples of advanced experiments that can be conducted In Stage 3 mclude: 
• Companson of various OS poliCies as reflected by classes of applications 
• Tuning a virtual machine for a specifiC application. 
• Deslgnmg application Oriented architectures 
• Study of multiprocessor intercommUnication strategies. 
• Study of the architectural effectiveness and effiCiency 
• The handling of faults represents additional load for the aVIOniCS system The fault 
capabilities represent another aspect of system functionality Whereas a system Without 
faults may be able to meet all of ItS deadlines, the addition of fault handling workload may 
cause schedule slippage and/or Violations of realtime constraints 
A key part of the Stage 3 methodology IS the specification and generation of a controlled parallel 
workload [9] Such a worL;.load for aVIOniCS applications IS given In [2] The workload IS represented 
as a speCial purpose parallel data-flow graph A run-time expenmentatlon enVIronment provides 
capability of controlling, varYing and measuring the workload Without havmg to recompile or re-debug 
the parallel program. 
9 
2.3 Present and Future Experiment Environment 
A slgmflcant amount of work was required by AIR LAB personnel to bring the system environment up 
to Stage 2 At present, each experiment generally requires some code compilatIOn, followed by 
linkage and dm'''nloadlng of the whole FTMP binary file Experiments can be deSigned With 
modifiable varldbles so that some vanatlon can be made by changing values In memory Without 
haVing to go through the entire code development cycle An example of a modifiable van able IS the 
number of Iterations In a loop The experiments descnbed In this paper used the modifiable vanable 
approach. 
A more speCialized workload generation mechamsm IS being developed for use on real-time 
multiprocessor systems (FTMP In particular [1]) With this mechanism In place, expenments can be 
run In an environment somewhere between Stage 2 and Stage 3 This model conSiders tasks of a 
speCific organization and deals With a simple set of parameters The system IS assumed to be made 
up of a bus With several processors (each With local memory), one global memory, and I/O. 
Operating system tasks are conSidered part of the system under measurement TraffiC on the bus IS 
restricted to I/O and Inter-Process Commumcatlon (IPC), each of which access memory 
In this real-time model, tasks are made up of five sectIOns These sections Include read In I/O data, 
read In IPC data, perform some representative operation, wnte out I/O data and write out IPC data. 
The amount of work performed In each section can be varied by parameters 
The workload structure was deSigned for simplicity so that vanatlons In the workload parameters 
and the resulting measurements could be easily understood The system parameters consist of total 
I/O, total IPC, and total instructIOn executions per second Each system parameter IS divided 
between functions as a percentage of the total work each function performs Each function IS In turn 
made up of tasks which divide the work of the fllnctlon as evenly as pOSSible Measurement of the 
throughput, system utilizatIOn and interaction of the system IS done by uSing the system clock to 
measure when a task beginS and ends 
Prior to presenting a detailed ddlnltlon of a :,et of baseline experiments, we Will first deSCribe the 
experimental vehicle 50 that the reader can observe how indiVidual baseline experiments have to be 
modified to take Into account the speCifiC Implem(')ntatlon of an architecture 
10 
2.4 The Fault Tolerant Multi-Processor (FTMP) 
The Fault Tolerant Multi-Processor (FTMP) has been discussed In several papers and manuals (5, 
6] This section will only describe those details necessary for understanding the experimental re~u'ts. 
For more details the Interested reader IS referred to the literature. 
I 110 POR r 1 
PROCES~OR PRO!.F!>SOR PROCFS:'OR 
1 Z 3 r 
L TlU PORT 
8K OK 8K 8K UK 8K 
PROM RAM PHOM RAM PROM RAM r PROM I I/O PORT 
r 110 POIlT L 
r 1/0 POIlT 
L 
GLOBAl MEMORY I [10 PORT 
3ZK I 
L [10 PORT 
SYSTEM BUS r I/O PI)R T 
1 L 
I I L 
I/O PORT 
REAL TIME ERROR 
CLOCK LATCHES 
r 110 PI)R T L 
Figu re 2·3: FTMP System 
Figure 2-3 depicts FTMP at the software level (as seen by the application programmers) There are up 
to three triads, each With local memory A triad conSists of three processors that the programmer 
sees as a Single processor A bus connects the triads to global or main memory, 1/0 deVices, a 
real-time clock and several latches needed for fault handhng The trrads only execute Independently 
when accessing local memory. 
Work IS performed by tasks A task IS a process that can be started Independent of other tasks 
Each tnad will run tasks according to a schedule Due to the real-time nature of the application, triads 
do not necessarily execute the same tasks In the same order Each task IS assigned a time limit II a 
task cannot be completed wlthm the time limit, the task IS stopped and the next task started 
1 I 
z I 
3 I 
4 I 
5 I 
6 I 
7 I 
8 I 
9 I 
10 1 
11 
Tasks are run wlthm frames Frames also act as a synchrortlzatlon mechanism between triads. One 
of the triads becomes the leader and starts a frame for that tnad and signals all of the other tnads to 
start the frame In the time allotted by the frame. the group of working tnads must execute ~\II of the 
tasks they are assigned The tasks are In a global linked list with each pOinting to the next task 
(except the last which has a null pOinter) The individual tndds access the global list to select a task 
If there IS more than one triad. some tasks Will be executed In parallel When there are no more tasks 
available for a triad to execute, the triad becomes Idle until tllo end of the frame. At tillt time, a triad 
becomes leader and starts a new frame. 
In FTMP there are actually three frame sizes. each haVing a different frequency of execution as seen 
In Figure 2·4 Each triad has separate pOinters to tasks for each rate group. 
40 Hz III H 11111111111 
, t 
I~l nor Frames R4 Frame 
20 Hz ~ I 
t , R3 Frame Frame Marks 
5 Hz ~ i I , t 
Rt Frame Major Frames 
Figure 2·4: Frame Structure 
The frame sizes are: 
• R4, the baSIC frame size 
• R3. eqUivalent to 2 R4 frames 
• Rl, eqUivalent to 4 R3 frames, the 'maJor' frame 
Task execution becomes more complicated with multiple frame sizes One triad stili Signals the start 
of the R4 frame, however, every second R4 frame It also starts an R3 frame and every eight R4 frames 
It starts an Rl frame The order In which a triad executes tasks for the different frame groups IS fairly 
Simple First, It executes all of the R4 tasks. then (stili In the same R4 frame) It executes R3 tasks If 
the R3 tasks do not fmlsh before the next R4 frame, execution of the R3 task IS suspended and 
another R4 frame IS started Agam, when all of the R4 tasks are done, the R3 tasks are contmued If 
the R3 tasks are finished, the R1 tasks are started If these tasks are not finished before the beginning 
of the next R4 frame, they are sllspended and started after the R4 tasks are done In the next frame If 
another R3 frame starts before the R1 tasks fmlsh, the current Rl task IS suspended In the triad until 
time IS available In a frame 
12 
R4 Frame lnltldU<Jn tlmC! marks 
I I I I I 1 -I 1 I 1 1 1 1 1 H 1 
~ Ill' due to R4 10ngUI than 60% or R4 fr .. me 
R3 Frame initlatlon tlme marks 
I I H 
Sllp of one R4 frame due to R3 lncomplutlon 
Rl Frame lnitiation tlme marks 
~llp of one R3 frame due to Rl lncompletlon 
Figu re 2·5: Frame Stretch Mechanisms 
There IS another interesting Item concerning frames According to the documentation, If a task 
needs more time In a frame, the frame can be stretched as Illustrated In Figure 2-5 An R4 frame IS 
stretched by a specific amount and R3 and R1 frames are stretched by giVing them more R4 frames. 
The third baseline experiment uncovered some interesting properties of this stretching mechanism 
Time IS kept uSing a global clock The clock has a resolution of 25mS (that IS, each clock tick IS 
25mS) The clock and I/O deVices are accessed by uSing a function called HREAD A 
complimentary function IS HWRITE HREAD allows a program to transfer bytes between a deVice and 
the local memory of a triad. Transfers between local and global memory occur by invoking the 
functions RD and WRT KnOWing the amount of time that these transfers take and how the time can 
vary are crucial to understanding the accuracy of measurements. 
Several computer systems are used to run the experiments. Programs for FTMP are written In a 
language called AED and assembly language The compiler, assembler and linker for these 
languages reside on an IBM 4341 Load files are transferred to a VAX·111750 Special Interface 
programs on the VAX are used to load, read and write global memory locations In FTMP Also, a 
batch facIlity on the VAX allows experiments to be run unattended. An HP terminal displays the status 
and other features of FTMP while It IS running Recently, It became possible to remotely access FTMP 
through the VAX so that expenmenters do not have to be present at AIRLAB to conduct experiments. 
2.5 Proposed Experiments 
A candidate set of baseline experiments was organized according to the levels of abstraction 
depicted In Figure 2·2: 
• mstructlOn Set Level 
13 
o Assembly and HIgh-level language instructIon tunes. 
• ExecutIve Software Level 
o ExecutIve primItive and overhead tImes 
o Interrupt procedure tImes 
o Memory access tIme 
o Bus access and contentIon delays 
o Fault tolerance overheads 
• System and ApplicatIOn Level 
o Frame utIlizatIon characteristIcs (Ulcludmg OS overhead and bus contentIon delay 
and fault tolerance overhead) 
The speCIfIC baseline expenments that are reported upon In thIS paper are 
• Clock Read Delay In order for subsequent expenments to be valid, the delay and 
vanatlOn In reading the clock must be determined. 
• Processor Performance for SImple OperatIons ThIS IS a measure of the amount of tIme 
requIred by the processor to perform SImple AED instructIons, for example 'A = l' or 
'A=B+C' 
• R4 Frame IteratIon Rate The measurement of the R4 frame under nominal condItIons as 
well as when stressed by long tasks. 
14 
3. The Experiments 
The goal of the experiments described In this report IS to obtain simple performance measures of 
processor and operating system functions. The method used to do this conSists of three steps' 
• Record start time 
• Perform operatlon(s) 
• Record end time 
Variations In Implementing this approach are due to the constramts of the FTMP system The 
experiments use a framework as In Figure 3·1, which IS described In the following paragraphs A 
framework related to this one has been used to Implement a synthetic workload environment on 
FTMP [3] 
Begln 
EXEC = Read(CMU.EXEC); 
If CMU.EXEC <= SomeCount Then 
End 
begln 
RTCNUM = Read(CMU RTCNUM); 
Hold = Read(RT.CLOCK); 
For X=t to RTCNUM do 
begln 
Somelnstructlons; 
end, 
Holdl = Read(RT CLOCK); 
Wrlte(Hold,CMU TIME(I»; 
Wrlte(Holdl,CMU TIME(2»; 
EXEC = EXEC + 1; 
Wrlte(FXEC,CMU EXEC); 
end; 
Figu re 3·1: BaSIC Experiment Task Algorithm 
When a task starts, a global variable called CMU EXEC IS read from global memory If It IS above a 
certam value (which depends on the experiment) the task IS terminated If It IS not, a second variable, 
CMU RTCNUM, IS read from global memory CMU RTCNUM IS the number of Iterations that a loop 
must execute In most cases, the global time IS read, an instruction IS repeated a number of times 
(defined by CMU RTCNUM) and time 1'3 read again These numbers are then stored In the global array 
CMU TIME Finally, CMU EXEC IS Incremented 
The time limit for each experiment task must be large enough that the task can finish Also, some of 
the experimental tasks must finish before the 60% mark of the R4 frame has been reached According 
to the dOCUmentation, after that time an Interrupt will occur The Interrupt would invalidate any clock 
Interval times 
15 
USing the VAX/FTMP Interface program c-llied CTA and a b<ltch command SCript, the value,> In the 
global array can be read and stored In a file CT A can also set memory locations Therefore, a 
command SCript can set CMU EXEC to 0, Walt, read the global array and repeat as many times as 
deSired. 
It IS Important to note that the experiments allow the number of Iterations to be changed uSing CT A 
For example, If the number of Iterations IS found to be too small to obtain useful results, 
CMU RTCNUM can be II1creased uSll1g CT A and the experiment can be run again Changes can be 
made without havll1g to recompile, rehnk and reload FTMP Thus a great deal of experimentation 
overhead IS saved. 
3.1 Clock Read Time Delay 
In order for any subsequent experimental results to be conSidered valid, the characteristics of the 
clock must be determined The delay and variation 111 reading the clock must be determined, as well 
as the causes of any variations If these vanatlOns cannot be charactenzed or minimized, any further 
experiments uSll1g the clock would be suspect For example, 111 the Cm· multiprocessor system, there 
was as much as 4 6% difference In clock frequency, and sub::.tantlal vanatlon of clock read delays [7] 
An example of possible characteristics of clock read delay would be a cunstant offset that could be 
subtracted from any future experiment results USing the clock 
On FTMP the time IS read with the instructIOn 'HREAD(RT CLOCK,vanable,2),' In the experiment 
task, 16 Iterations, each of 5 clock reads were made with the time before starting and the time of the 
last read being stored In global memory Referring to the framework 111 Figure 3·1, 'Some InstructIOns' 
IS replaced by 5 consecutive clock reads and RTCNUM becomes 16 A second task was created that 
did exactly the same as the first task except that It did not write to global memory ThiS second task 
was placed so that It would be the second task to start executIOn If two triads were m use, the 
second task would execute In parallel with the first and add contention for the clock 
The experiment VIas repeated about 100 tnnes for three situations 
1 Triad 1 runnmg alone, 
2 Triad 2 runnll1g alone, 
3 Triad 1 and 2 running Simultaneously (contention for the clock), 
The first two runs determll1ed single triad clock read time with no contentIOn, and variation between 
triads The third case determmed how the contention for the common clock resource effects the 
clock read tllne 
16 
3.2 Instruction Times 
The times for the following AED Instructions were measured: 
1. 'Null' 
2. A= 1; (Integer assign) 
3 A1 = 1; (real assign) 
4 A2= 1; (long assign) 
5. A=B+C; (Integer add) 
6 Al = B1 + Cl; (real add) 
7. A2 = B2 + C2; (long add) 
8 A=B*C; (Integer multiply) 
Each of these Instructions was executed In a loop 100 times along with the instruction 'A = 1,'. The 
'A = 1,' instruction was added because the compiler would not accept a null statement for the first 
Instruction. The 'Null' statement was Included so that the overhead from clock reading and loop 
control can be eliminated from the other instructions, leaVing only the time for instruction execution 
Again, refernng to Figure 3·1, 'Somelnstructlons' IS replaced by 'A = 1,' and the instruction being 
measured and RTCNUM becomes 100 ThiS task was executed 308 times. 
3.3 Measuring R4 Frame Size 
There were three parts to thiS expenment In the first part, time from the real·tlme clock was read at 
the beginning of the first R4 task Refernng to Figure 3·1, thiS time was stored In the CMU TIME array 
If appropnate (depending on the value of CMU EXEC) and the instruction 'A = 1,' was executed a 
specified number of times as determined by CMU RTCNUM. 
If CMU EXEC was above the value eight, the task would finish Without dOing anything else Eight 
consecutive R4 TASKl start times were stored In each Iteration of the expenment The objective was 
to determine the R4 fro.me duration The experiment was run about 100 times for a Single triad and 
two triads 
The second part of the experiment was to determine how the system behaved when an R4 frame 
was stretched Only one triad was used The time limit for the task was set to a very large number (so 
that a task would not abort before It was finished) Finally, RTC NUM had to be set to several values 
that would stretch the frame. These values were 2000, 3000 and 5000 Data was recorded about 100 
times for each Iteration value 
17 
The last part of the expenment was to determine the effect on the system of a rate group with an 
infinite number of tasks This could be easily done because each task had control information 
associated with It. One of the words of information was a pOinter to the next task. An Inflmte stnng of 
tasks could be generated by haVing a task pOint to Itself as the next task One R4 task was caused to 
execute over and over In thiS way Another task was checked to see If It rdn once the R4 task started 
repeatmg. 
18 
4. Results 
4.1 Read Time Clock Delay 
The clock read overhead was virtually constant for a single triad configuration The data never 
vaned more than a clock tick For the two different triads the results were 
Triad 1: 55.9 + 047 tlcks/ 16 lteratlOn'i (!J5% (.onfldence 2) 
14.0 
-
+ .012 mSec / 16 lteratlons 
.874 + 
-
.00073 mSec / lteratlon 
Trlad 2: 56.0 + 
-
036 tlcks/ 16 lteratlons 
14.0 + 0091 mSec / 16 lteratlons 
.875 + 00057 mSec / lteratlon 
Each Iteration has 5 clock reads plus loop overhead Loop overhead per IteratIOn IS 15 7ILSeconds 
(see Experiment 2) ThiS IS subtracted from the Iteration time, then the result IS diVided by 5 
Triad 1 Triad 2 
(ILSec) (ILSec) 
874 ± 73 875 ± 57 Initial Data 
-15 7 ± 11 -15 7 ± 11 Overhead 
858.3 ± 84 859 3 ± 68 
/5 /5 Number of Reads 
172 ± .17 172 ± 14 Clock Read Time 
A read with no contentIOn on the bus reqUires 172ILSeconds Although there IS an indication of some 
variatIOn between triads, It IS not significant and '1'I1thin the margm of error for a 95% confldence 
Interval 
In the second measurement, two triads were started, each executing roughly the same code so that 
contention for the bus IS created The result for two triads was 
56 3 ± 091 Ticks / 16 iterations 
173 + 31 ILSec / Clock Read 
It IS eVident that the contention for the clock at thiS rate does not affect the delay III readmg the clock 
greatly (less than 1%) However, the contention IS large enough that the range of the 95% confidence 
2 All Intervals are 95% confidence IntervClls assuming normal distribution for the vanables r~t!fer to Ferrl1n [4] for a 
descnpliOn of confidence Intervals and how they dre calculated 
19 
Intervals for the single triad read time and double tnad read time do not overlap These results do not 
take Into account other contention for the bus like memory access or I/O device access. 
The reason that this vanatlon IS so small IS that the section of code In the read procedure that 
actually uses tho bus IS a small percentage of the whole clock re..ld procedure Since both 
contending procedures are exactly the same when In the Iteration section, they Will tend to be 
synchronized so that only one Will actually request control of the bus at a tlrne The slight vanatlon 
from the single tnad case could be due to slight variations In the execution rates of the different 
processors so that occaSionally the two tnads do conflict However, thiS would seem to be very 
minor. 
On the whole, the real-time clock on FTMP should serve as a reliable measurement device with 
preulctable delays that can be factored out of experiments ThiS IS especially true In the slOg Ie triad 
case However, thiS assume::. that the experimenter has complete control of all of the tasks If an 
experimenter on the system with multiple triads lets one triad run uncontrolled, the clock results may 
not be reliable The range of system activities under which the clock times are repeatable should be 
explored further. 
4.2 Instruction Measu rement 
Clock Ticks p.Sec per p.Sec per p.Sec per 
Instruction per100 Instruction, Instruction, Instruction, 
Inst r .(ave) w/ Overhead w/o Overhead Predicted 
Null 123 307.±. 013 
Integer Assignment 18.3 45 7 .±. 013 150.±. 026 83 
Real Assignment 184 461.±. 014 154.±. 027 83 
Long Assignment 196 491 .±. 014 184.±. 027 123 
Integer Addment 230 577.±. 004 270.±. 017 223 
Real Add 232 580.±. 011 273.±. 024 223 
Long Add 274 686.±. 014 379.±. 027 300 
Integer Multiply 251 629.±. 010 322.±. 023 274 
Table 4-1: Instruction Results 
The result of the measurements are shown In Table 4-1 The three times given for each instructIOn 
are as follows The first column IS the time to execute each instruction Includmg the overhead of 
readmg the clock, maintaining the loop, and the time to execute' A = 1,' The second column adJusts 
the time from the first column by subtracting out these overheads The third column represents time 
per InstructIOn predicted by the assembler The range IS for a 95% confidence Interval 
20 
The results showed little vanance The number of 'clock ticks' per frame vaned only by one for each 
AED instruction. The instructions tool( longer than suggested by the times given by the assembler 
and Draper Labs documents. The predicted times according to the document are actually the times 
under best conditions ThiS makes the predicted times of marginal value In real·tlme applications. In 
order to get a complete view of the instruction execution times, aU of the Important AED instructions 
must be measured on the actual machine. 
The overhead neened to measure the instruction (the Iteration time and the two clock read tnnes) 
can be found by subtracting the Null Instruction from the time for the instruction 'A = l' If the 
overhead IS assumed to consist of only the loop instructions, then the amount of overhead per 
instruction Iteration IS 15.0 ..±. 039 ILSeconds ThiS overhead IS useful for calculations III other 
expenments. 
Another aspect of the looping overhead IS the error due to the clock resolution On average thiS 
turns out to be half a clock tick ThiS value would be subtracted from any absolute time average to 
give the actual average time that was measured 
Usmg 'A = B + C' as an average high level instructIOn, a rough order of magmtude of the number of 
instructions that can be executed In an R4 frame and the rough high level throughput of a tnad can be 
calculated. 
40mS/ R4Frame 1500lnstructions/R4Frame 
27 0ILS/Ills/ruc/lOn 
1 37 KOPS(AED)Th roughput 
27 0ILS/ fns/ruc/101l 
The instruction 'A = B + C,' actually used four assembly instructions Therefore, a rough assembly 
level throughput would be 150KOPS. 
4.3 Measu ring R4 Frame Size 
Triads Average Standard Range 
Time DeViation 
(mSeconds) (mSeconds) (mSeconds) 
Smgle 400 741 3775·4225 
Double 400 623 3775·4225 
Table 4·2: Frame Measurement Results 
R4 frames vaned conSiderably In size (the amount of time between consecutive frames) from one 
21 
frame to another. There may be cyclic vanatlon, however It IS hard to determme from the method 
used to obtalll the data. The nom mal R4 frame measures In the single and double tnad cases are 
shown In Table 4·2 The dlstnbutlons of frame sIzes are shown In FIgures 4·1 and 4·2. The 
dlstnbutlon looks approxImately normal except that the frame sIzes nedr the average occur less 
frequently than would be expected. The reason for thIS IS unclear. 
~ 42.50 
e 42.25 
Q; 42 00 
§ 41.75 
~ 41.50 
41.25 
41.00 
40.75 
40.50 
40.25 
40.00 
39.75 
39.50 
39.25 
39.00 
38.75 
38.50 
38.25 
38.00 
37.75 
37.50 
~ 
p. 
p 
F==3' 
:::::3-
::r 
:::3-
~ 
o 
+ 
+ 
+ 
+ 
+ 
30 60 90 120 150 
Amount 
Flgu re 4·1: Single Triad R4 Frame Dlstnbutlon 
In the second part of the expenment, the R4 frame was stretched The results of the stretchmg are 
shown m Table 4·3 In all of these runs, the tIme measured for a frame was usually close to the 
average (wIthin a few clock tIcks) WIth some takIng several tIcks longer and none taking more than 2 
clock tIcks less than the average (see FIgures 4·3,4·4 and 4·5) The reason for thIS dlstnbutlOn IS 
again unknown, but It IS probably due to the operatIng system and dispatcher vanatlons rather than 
the task that runs wlthm the frame (see experiment 2) The actual variations compare to roughly nme 
instructIons per tick ThiS could be the difference due to one conditIOnal (If - then - else) 
statement. 
When the average tImes were plotted agamst the Iteration rate, a linear relatIon emerged (see FIgure 
4·6). From the documentation, a step functIon tncrease was assumed WIth a step of 24 mSeconds 
ThiS IS also shown on the graph When the actual code was read, the linear tncrease was to be 
Ci) 42.50 
E 42.2 
~ 42.00 
.5 41 .7 
~ 41.5 
5 
5 
0 
5 
O 
5 
0 
5 
0 
5 
0 
5 
O 
5 
41.2 
41.0 
40.7 
40.5 
40.2 
40.0 
39.7 
39.5 
39.2 
39.0 
38.7 
38.5 
38.2 
38.0 
37.7 
37.5 
0 
5 
0 
5 
0 
F1-
P-
P 
~ 
P. 
~ 
F1-
o 
22 
~ 
+ 
+ 
t-
50 100 150 200 
Amount 
Figure 4·2. Double Tnad R4 Frame Distribution 
Frame Average Standard Range 
Size Time Deviation 
(Iterations) (mSeconds) (mSeconds) (mSeconds) 
2000 808 480 805-830 
3000 108 480 1078- 1105 
5000 163 481 1623- 1650 
Table 4·3: Frame Stretching Results 
expected The reason for the supposed step function was a timer Interrupt that was to happen every 
24 mSeconds In fact, after the first timer Interrupt, 24 mSeconds Into the R4 frame, the timer was not 
used until the R4 tasks flmshed Therefore, the size of the frame would Increase linearly above 40mS 
The fmal part of the experiment was to determine the behavior of the system when an infinite set of 
R4 tasks was started In the experiment, an R4 task POinted to Itself as the next task If there were no 
mechanism for aborting a frame, the R4 frame would continue forever This could be shown, by 
attempting to use another task while the R4 task continues to loop For this expenment the task that 
was used to test whether the system was running was an R3 task that failed and restored processors. 
Normally, It took only a few seconds from entering a request to reconflgunng the system However, 
en 83.25 
i 83.00 
~ 82.75 
82.00 
81.751-_-' 
81.50 
81.25 
81.00 
23 
BO.751-_____________________ ~ 
BO.50r-_________________ ~ 
BO.25~--~--~--~--~--~----' 
o 50 100 150 200 250 300 
Amount 
Figu re 4·3' Stretched Frame - 2000 iterations 
en 110.75 
.§. 110.50 
Q) 
E 110.25 
~ 
110.00 
109.75 
109.50 
109.25 
109.00 r-----' 
10B.75 
108.50 
10B.25 
108.00r-____________________ ~ 
107.751-__________________ ~ 
107.50~--~---'---~-~----~~ 
o 50 100 150 200 250 300 
Amount 
Figu re 4·4 Stretched Frame - 3000 Iterations 
Cii 165.25 
.§ 165.00 
II) 
E 164.75 
t:: 
163.75 
163.501--_---' 
163.251---' 
163.00 
162.75 
24 
162.501--____________________________ ~ 
162.251--_____________ ~ 
162.00~~--~--~~--~--~~~~--~--L-~ 
o 30 60 90 120150180210240270300330 
Amount 
Figure 4-5: Stretched Frame - 5000 Iterations 
when the R4 task began to repeat IIlflnltely, the R3 task could not execute at all When the mflnlte 
loop was stopped (by nullifYing the R4 pOinter), the R3 task ran Immediately. 
ThiS last test POints out a flaw In the scheduhng software Although tasks are regulated by giVing 
them time limits, frames are not limited In thiS manner A frame of any rate IS simply stretched until all 
of the tasks Within the frame can finish. ThiS mechanism IS not reliable In at least two situations The 
first was described above, In which all other tasks were locked out by one task that POinted to Itself 
Another possibly hazardous situation would be a task WIth ItS tIme hmlt set too hIgh If, In most cases, 
the task takes much less tIme than the hmlt, th,s error may not be noticed However, If some untested 
sectIon of the code starts a long, VIrtually infInite loop, the system WIll hang (at least at that rate group) 
untIl that task has stopped In a real· time applicatIon th,s IS equlvdlent to faIling 
200 
180 
160 
140 
120 
Time 
(ms) 
100 
80 
60 
Frame Size 
YS 
Tterations 
40 +-___ -V' 
20 
25 
O~--------r-------~---------~-----------~~-------~ 
o 1000 2000 lterations 3000 4000 5000 
Figure 4·6: Frame Size (mSeconds) vs Iteration Count 
26 
5. Summary and Future Work 
This paper described three expenments that were designed within the framework of a validation 
methodology The methodology was denved eallier and IS undergOing changes as experience 
Increases The experiments were concerned with bazellne measurements of the running system The 
major results of these experiments were: 
1 The real·tlme clock IS a reliable measurement device and can be used In timing 
experiments. 
2 The instructIOn execution tunes are constant and reproducible The measured times are 
slower than the documented best times. 
3 The frames are nominally 40 milliseconds long There IS a vanatlon of many clock ticks In 
all measurements. 
4 The stretchmg mechanism allows a linear Increase In the size of the frame depending on 
the number of instructions to be executed, not a stepwise Increase as expected from 
reading the documentation 
5 Frame stretching continues until all tasks finish or abort ThiS IS unreliable In some cases 
More work needs to be done to fully characterize the FTMP system ThiS IS espeCially true of 
instruction and procedure call measurements Major omiSSions of the present results were the 
call/return times for different types of procedures and the system reaction to arithmetiC faults Other 
AED instructIOns should also be measured to get a more complete eVdluatlon of the system 
Enhancement of the experiment environment IS planned The goal of the enhancement IS to have 
the capability of runnmg several different experiments on FTMP by only changmg certain values In 
memory With thiS environment It IS hoped that informatIOn can be collected on the time to run 
various sizes and types of tasks In many combinations Information on scheduling and other 
operating system overhead might also be obtamed With thiS environment 
27 
6. Acknowledgment 
We wish to acknowledge the help of all of the people of AIRLAB at NASA/Langley Research Center. 
We would especially like to thank Carlos Llceaga, Frank HIli, Dan Koppen, Bnan Lupton, George 
Finelli and Dale Holden We would also like to thank Matt Reilly for hiS Imtlal work on the FTMP 
system and Frank Feather for hiS help 10 the data analysIs. 
28 
References 
[1] Clune, E. 
Analysls of the Fault Free Behavlor of the FrMP Multiprocessor System: 
Basellne Measurements and Synthetlc Workload. 
Master's ProJect, Cal'negle-Mellon Ulllverslty, September, 1984. 
[2] Draper Labs. 
AlPS System ReqUirements. 
Technlca1 Report AIPS-83-50, Charles Draper Laboratory, 1983. 
[3] Feather, Frank E. 
Va1ldatlon of Fault-Free Behavlor of a Re1lab1e Mu1tlprocessor System. 
Workload Implementation. 
Master's ProJect, Carnegle-Me110n Unlverslty, March, 1985. 
[4] Ferrari, D. 
Computer System Performance Evaluation. 
Prentlce-Ha11, Inc, 1975. 
[5] Development and Evaluation of a Fault Tolerant Multiprocessor 
(FTMP) Computer, Volume I, II, III, IV. 
Contract Reports 166071, 166072, 166073, 166074. 
Draper Laborator 1 es, 1983. 
[6] Hopklns, A. L., et a1. 
FTMP - A Hlgh1y Rellable Mu1tlprocessor. 
IEEE Trans on Computers, October, 1978. 
[7] Kong, T. H 
Measurlng Tlme for Performance Eva1uatlon of Mu1tlprocessor Systems. 
Master's Thesls, Carnegle-Mel10n UnlverslY, November, 1982. 
[8] Research Trlang1e Instltute. 
Validation Methods Research for Fault-Tolerant Computer Systems--
Preliminary Working Group /I Report. 
NASA Conference Pub11catlon 2130, NASA-Langley Research Center, 1979. 
[9] Segall, Z., A. Slngh, R. T. Snodgrass, A. K Jones, D. P. Slewlorek. 
An Integrated Instrumentatlon EnVlronment for Multlprocessors. 
IEEE Trans on Computers C-32(1), January, 1982. 
[10] Swan, R. J., S. H. Fuller, D. P. Slewlorek. 
Cm·: A Modular, Multl-Mlcroprocessor. 
Proc AFIPSNCC, vol. 46, 1977. 
[11] Wensley, J. H , et al 
SIFT. A Computer for Alrcraft Control. 
IEEE Trans on Computers, October, 1978. 
[12] Wu1f, W. A .. C. G. Bell. 
C.mmp: A Multl-Mlnl-Processor. 
Proc AFIPS FJCC, vo 1. 41, pt. 2, 1972. 
End of Document 
