Multiprocessor architectural study by Vandever, W. H. et al.
Final Report
MULTIPROCESSOR
ARCHITECTURAL
STUDY
By : Alex L. Kosmala, Saul F. Stanten, Woodrow H. Vandever
November 1972
Prepared for the George C. Marshall Space Flight Center,
Huntsville, Alabama 35812, under Contract NAS8-28605
by : Intermetrics, Incorporated
701 Concord Avenue
Cambridge,
Massachusetts 02138
Intermetrics Technical Report #01-73
IN'[ERMETRICS INCORPORATED. 701 CONCORD AVENUE - CAMBRIDGE, MASSACHUSETTS 02138 . (617) 661-1840
https://ntrs.nasa.gov/search.jsp?R=19730011492 2020-03-23T02:54:05+00:00Z

TABLE OF CONTENTS
FOREWORD
ABSTRACT
Chapter I: Introduction
1.1 Scope and Objectives
1.2 Overview of Intermetrics'
References
Multiprocessor
Chapter 2: Multiprocessor Operating System
Introduction
Problems of Multiprocessing
2.1
2.2
Design
2.2.1
2.2.2
2.2.3
2.2.4
2.2.5
Parallelism
Exclusive Sections
Shared Data
Conflict Over System Resources
Overhead
2.3 Exclusion and Synchronization
2.3.1 Exclusion Primitives
2.3.2 Synchronization
2.4 Scheduling
2"4.1 Space and Time Allocation
2.4.2 Deadlock Prevention
2.5 Memory Management
2.5.1 Operating Memory Multiplexing
2.6 Implementational Aspects
2.6.1
2.6.2
2.6.3
System Specification
Structure
Systems Programming Language
References
i
ii
i-i
I-I
1-2
1-8
2-1
2-1
2-3
2-4
2-5
2-6
2-7
2-8
2-8
2-8
2-12
2-15
2-15
2-18
2-20
2-21
2-24
2-24
2-24
2-25
2-27
INTERMETRICS INCORPORATED" 701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 " (617) 661-1840

PageTable of Contents (continued)
Chapter 3:
3.1
3.2
Interrupt Structure
Assumptions
Interrupt Categorization
3.2.1
3.2.2
3.2.3
Process Oriented
System Oriented
Processor Oriented
3.3 Multiprocessor Interrupt Problem Areas
3.3.1
3.3.2
3.3.3
3.3.4
3.3.5
Which Processor to Interrupt?
Response Time
Innovations
The Interrupt Sequence
Interrupt Functional Response
Chapter 4 :
4.1
Memory Hierarchy
Basic Hierarchy Description
4.1.1
4.1.2
4.1.3
4.1.4
4.1.5
M0 - Micro Level Control Memory
M1 - Local Memory
M2 - Operating Memory
M3 - Mass Memory
M4 - Archival Storage
4.2
4.3
Local Storage
The Probiem- Memory Contention vs.
Performance
Two Approaches to an Implementation
Operating Memory and Memory Management
4.3.1
4.3.2
4.3.3
4.3.4
Background
Segmentation
Paging
Implementing Virtual Memory
Re fe re nces
Chapter 5:
5.1
Addressing
Addressing and Instruction Architecture
5.1.1 The Number of Operands in an
Instruction
3-1
3-1
3-1
3-2
3-2
3-2
3-3
3-3
3-4
3-4
3-6
3-6
4-1
4-1
4-1
4-1
4-1
4-2
4-2
4-2
4-2
4-11
4-15
4-15
4-18
4-20
4-21
4-23
5-1
5-1
5-1
INTERMETRICS INCORPORATED ' 701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 - (617) 661-1840

Table of Contents (continued)
Single Accumulator and General
Registers
How to Address Operating Memory
5.2 The IBM 360 and Burroughs B6700
5.2.1
5.2.2
5.2.3
5.2.4
5.2.5
Two Dimensional Addressing (Static
and Dynamic)
Implicit Addressing
Descriptors
Type Differences
Semantic Conciseness
5.3 Implementation Aspects of a Stack Machine
5.4
5.3.1 Definitions
5.3.2 PUSH
5,3.3 POP
Effective Address Generation (EA).
(Lexical Level Offset Addressing)
5.5 Stack Fetch
References
Chapter 6 :
6.1
6.2
6.3
6.4
I/O Considerations
Space Station System Requirements
Data Bus I/O
Mass Storage i/O
I/O Controller Design
Page
5-3
5-5
5-7
5-7
5-9
5-10
5-11
5-12
5-13
5-13
5-14
5-14
5-17
5-19
5-19
6-1
6-1
6-2
6-8
6-10
6-10Central Control (CC)
Interprocessor Communication
Interfaces (IPCI) 6-10
6.4.3 Operating Memory Interface 6-12
6.4.4 Channels 6-13
6.4.5 Interrupt Handler 6-13
6.4.6 Timer 6-13
6.5 I/O Configuration Organized for Recovery
References
6-13
6-16
Chapter 7: Fault Tolerance Philosophy for the SUMC
Multiprocessor 7-1
7.1 Requirements 7-1
7.2 Error Detection 7-1
INTERMETRICS INCORPORATED "701 CONCORD AVENUE ' CAMBRIDGE, MASSACHUSETTS 02138 " (617) 661-1840

Table of Contents (continued)
7.2.1 Implementing Hardware Error Detection 7-3
7.3 Recovery 7-6
7.3.1 Processing Unit (P-M1) 7-7
7.3.2 Recovery from an Operating Memory
(M2) Failure 7-11
7.3.3 Fault Tolerant Aspects of the I/OC,
Channel 7-17
7.4 The Implications of Fail Safe 7-26
Chapter 8: Concept Verification 8-1
8.1
8.2
Background
Phase 1 - Initial Analys3s and High-Level
Simulation
Objectives
Tools for High-Level Simulation
8-2
8-3
8.3 Phase 2 - Low-Level, Detailed, Mixed
Simulation 8-6
8.3.1
8.3.2
The Simulation Process
Simulation Design Issues
8-6
8-11
References 8-16
Chapter 9 : Critique of SUMC's Architectural
Characteristics 9-1
9.] 9-1
9.2 9-2
9.3 9-4
9.4
Design Goals
Micro Instruction Sequencing
Choosing Functions to Optimize
Field Manipulation - Maskings - Shifting -
Bit Addressina and Shifting 9-7
9.5 Limited Scratch Pad Addressing 9-7
9.6 Micro and Main Memory Speed Ratio 9-9
9.7 Main Memory Synchronization 9-9
9.8 Limited Modularity Concept 9-10
9.9 The "U" in SUMC - Ultra Reliability 9-12
9.10 Confusion Between Design Levels 9-12
References 9-13
INTERME]RICS INCORPORATED. 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840

FOREWORD
This document is the Final Report of a multiproces-
sot architectural design study, whose objective was to
establish a baseline design for a central multiprocessor
for a Space Station Data Management System exploiting
the NASA/_.ISFC developed SUMC hardware where possible.
The study was sponsored by the NASA Marshall Space Flight
Center, IIuntsville, Alabama, under contract NAS8-28605,
entitled, Research Study on Memory Hierarchy. It was
performed by Intermetrics, Inc, Cambridge, Massachusetts,
over the period June to October 1972, under the direc-
tion of Alex L. Kosmala. Technical monitors for MSFC
were Mr. Gerald L. Turner and Mr. James L. Lewis.
Publication of this report does not constitute
approval by NASA of the findings or conclusions contained
therein.
-i-
INTERMETRICS INCORPORATED. 701 CONCORD AVENUE • CAMBRIDGE MASSACHUSETTS 02138. (617) 661-1840

AB S TRAC T
This is an architectural design study of a multipro-
cessor computing system intended to meet functional and per-
formance specifications appropriate to a manned space station
application as defined by NASA's Marshall Space Flight Center.
Intermetrics previous experience and accumulated knowledge of
the multiprocessor field is used to generate a baseline philo-
sophy for the design of a future SUMC* multiprocessor.
The operating system design problem for multiproces-
sors is to approach the theoretical performance without sacri-
ficing fault tolerance, flexibility, and expandability. Para-
llel tasking is described as a necessary operating capability
in this regard, while exclusive operators are also needed to
avoid critical section conflicts. Synchronization, scheduling,
and deadlock prevention are other system design features which
are discussed, along with memory management. Treatment of the
topics of operating system specification and structuring, and
the use of a higher order language complete the discussion of
multiprocessor operating systems.
Interrupts are defined and the crucial questions of
interrupt structure, such as processor selection and response
time, are discussed. Memory hierarchy and performance is dis-
cussed extensively with particular attention to the desian ap-
proach which utilizes a cache memory associated with each pro-
cessor. The ability of an individual processor to approach its
theoretical maximum performance is then analyzed in terms of a
hit ratio, which is the proportion of time that a memory re-
quest can be supplied from cache only. Memory management is
envisioned as'a virtual memory system implemented either through
segmentation or paging.
Addressing is discussed in terms of various register
design adopted by current computers and those of advanced de-
sign. Using examples, two dimensional addressing, implicit
addressing, and the use of descriptors are described. Imple-
mentation of a stack-oriented machine is explained, along with
the generation of an Effective Address scheme. The overall I/O
architecture set forth is upon a Data Bus I/O to service an
* Space Ultra-reliable Modular Computer
-ii-
INTERMETRICS INCORPORATED "701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 " (617) 661-1840

advanced data bus concept and a Mass Storage I/O. The I/O Con-
troller d_ign is then discussed in terms of interfaces to the
processors and to the memories with special emphasis given to
recovery from failure.
A complete chapter is devoted to error detection,
fault isolation, and recovery philosophy as applied to a mul-
tiprocessor system. The important topic of concept verifica-
tion is given careful scrutiny in terms of
a) analytical techniques and high--level computer simula-
tion, and
b) detailed, low-level simulation.
Finally, the report concludes with a detailed critique of SUMC's
architectural characteristics in relationship to the overall
design objectives.
-iii-
INTERMETRICS INCORPORATED • 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840

Chapter 1
INTRODUCTION
i.i Scope and Objectives
The work described in this report is the result of a
study of multJprocessinq system design principles, performed
in sup[:_ort of the MSFC in-house multiprocossor computer deve-
lopment. The initial objectives of the study were to achieve
a to]?-level architectural design capable of rTeeting the func-
tional and performance specifications established for the Phase
B Space Station Information Management System Central Processor,
and in doing so to exploit as much as possible the current MSFC-
developed SUMC processor design. However, during the early
phases of the study it became apparent that in order to preserve
the value of an independently derived evaluation of multipro-
cessor design features by Intermetrics, some deviation from
these objectives would be necessary. The basic philosophies of
muitiprocessor design and operation espoused by Intermetrics in
defining an architecture appropriate to the Space Station re-
quirements were found to be incompatible with those already adop-
ted by MSFC in arriving at the current SUMC design. Consequently
it was mutually agreed that rather than using the existing SUMC
design as the basis for the study, Intermetrics should apply the re-
sults of their previous experience and accumulated knowledge of
the multiprocessor field to establishing a SUMC architecture
from an entirely independent point of view. Much of that point
of view was gathered in the performance of a previous design
study [I] with very similar objectives to those expressed for
the SUMC multiprocessor. Although some of the philosophies
which are embodied in that design were directly applicable, it
was decided not to tailor the complete design to the SUMC app-
lication by adopting some features and discarding others. In-
stead, it was decided to select certain multiprocessor design
areas and hardware features and perform an in-depth analysis,
review and evaluation for each, in order to establish the phil-
osophies and the rationale developed by Intermetrics in their
approach to a multiprocessor design. The objective was to pro-
vide a baseline philosophy for the design of a future version
of the SUMC multiprocessor, radically different from the one
proposed in the present MSFC in-house development program.
In Intermetrics opinion the design of a multiprocessor
for a Space Station application should be guided by the follow-
ing considerations:
i-i
INTERMETRICS INCORPORATED '701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840
a) The performance potentially achievable through the use
of multiple mrocessors (often quoted as the main moti-
vation of multiprocessing but, as will be explained in
Chapter 2, very difficult to achieve) should not be
compromised by implementational incompatibilities, es-
pecially in the executive system, nor sacrificed to
achieve other MP objectives such as fault tolerance,
flexibility, and expandability.
b) Since the overall cost of providing computational capa-
bilities (especially in a difficult environment like a
Space Station) may be dominated by software costs rather
than hardware, the architecture and operating character-
istics of the computer must reflect the needs, desires
and techniques of the programmer rather than those of
the logic designer.
c) The outstanding advantage of a multiprocessor is its
potential tolerance to failures of its components.
This capability should be realized in the initial ar-
chitectural design, and not provided as a final touch
after most design decisions have been made.
The detailed analysis of the areas of multiprocessor
design which were selected for this study reflect the above
basic attitude. They form most of the chapters in the remainder
of this report, and include the following topics: Operating
System design (Chapter 2); Interrupt Structure (Chapter 3);
Memory Hierarchy (Chapter 4); Addressing (Chapter 5); I/O Con-
siderations (Chapter 6); Fault Tolerance (Chapter 7). Additional
chapters cover Concept Verification (Chapter 8), since it was of
some concern to MSFC how any given multiprocessor design could
be given a quantitative evaluation without incurring the initial
investment of a hardware build phase, and a critique of the SUMC
processor internal architecture (Chapter 9).
Much of the description and terminology found in this
report assumes a familiarity with Intermetrics' previous multi-
processor design. To prevent unnecessary (and probably inade-
quate) repetition of the details of that design, the reader is
referred to reference [i]. However, to provide an introduction
to at least some of the terms used we present the following over-
view of the configuration of hardware and software elements of
the design, as extracted from sections of reference [i].
1.2 Overview of Intermetrics' Multimrocessor
The basic configuration of the multiprocessor is shown
in Figure i. The MP was specified to consist of a number of
1-2
INTERMETRICS INCORPORATED .701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840
C'X.I
II
4
I i
i I
; I
- !
i i
c-_ !
t_
O @
0) F4
*r-{
ul
°! 4
0 ...... __
L
I
...... t t
III °- 1.
O
J
J
J
c_
cD
l== b_
0
.p
b3
/// °
"0
0
U
/ _ .
0
o "_ "\ r- .......... "1_-_
I Nil < o _ /.-_ /!.'_ _ !, -a
xt n--'_c,) ...... 1 c= ,__-',_ _ i u
|
I'.----T -- _ n
O
.r4
c_
-_l
O
U
_)
-,-I
n_
O
@
c.)
O
i*
,--4
-,.--I
1--3
identical, interchangeable processing elements which _ould execute
the major processing workload, and a single, more specialized pro-
cessor to handle I/O processing and a number of other unique func-
tions. These functions include interrupt handling, interprocessor
communication control, and the central timer. The executive was
specified to be non-dedicated (to any given processor), and its
functions are performed by any of the processors. The choice of
which processor is made on the basis of status (e.g., by having
completed its current assignment), or by reason of its greatest
interruptibility as determined by the priority of its current
process. The number of computational processors was specified
as three, because the resulting configuration represents the
simplest which possesses completely all the characteristics (and
problems) of the n-processor case. The two processor system
which has received the greatest amount of development and opera-
tional e._o--__r-_enc_ of all configurations,_ represents a degenerate
form of multiprocessor: while certainly exhibiting true concur-
rency of processes, nevertheless the dual processor allows cer-
tain simplifications of executive functions to be made because
of the binary nuttier of active elements in the system. The mem-
ory terminology in the figure is used in parts of this report,
and is defined as follows:
a) Ml: Loca] memory, dedicated to, and only for use by
a processor. This is a general term and refers
to all aspects of buffer, scratchpad, control
and associative memory, required by a processing
element. The contents of any M1 storage cell are
available only to the processor of which M1 is
an intimate component. Only in case of recovery
after a P and/or M1 failure are these contents
made available to another processor. In this MP
design M1 is not, strictly, a member of the mem-
ory hierarchy.
b) M2: Operating memory (main memory, or, in" popular
I! I!terms, core ). M2 consists of several individual
memory modules, all of which are accessible to all
processors, including the I/O controller. Each
access takes place via a data path dedicated to
each processing element, through a port in each
M2 module. The basic MP configuration, therefore,
requires four ports per M2 module. Each module is
fourway interleaved, for purposes of speed, access,
conflict resolution, and fault recovery.
C) M3: Secondary storage (backup or Mass Memory). Being
a conventional drum or disk, it was decided to
interface this level of the memory hierarchy with
the rest of the computer system in the more con-
ventional manner, via an I/O channel. The use of
1-4
INTERMETRICS INCORPORATED. 701 CONCORD AVENUE . CAMBRIDGE, MASSACHUSETTS 02|38 " (617) 661-184(
M3 to implement the concept of virtual fllemory then
places the heaviest requirement on the design of
the I/O controller and the I/O executive routines.
As mentioned above, several unique functions were gath-
ered together into one, unique module, which is (for convenience)
termed the I/O controller (IOC). All interfaces to the outside
world were handled via the IOC.
Communication between the processing elements of the MP
system (the P's and IOC) were handled by a separate interproces-
sor bus (IPC_).
(It should be emphasized that the basic configuration
o_ _r_ ] ar.>q not in,Tic.ate the levels of redundancv specified
for f_iult detecuzon' and/or recovery. For a discussion of these
aspects, refer to Chapter 7.)
The terminology used in this report refers to the way
in which information was organized and handled in the previous
Intermetrics work. The key terms and their assumed definition
are as follows:
a) Program: This is an independently compilable section
of code containing pure procedures and/or data.
b) Procedure: A section of code to which execution control
can be passed, with or without the passage of parameters.
i) Internal, not known outside of process (see below)
2) External, known to name manager and declared in
the Process Information Area (see below)
c) Segment: A contiguous block of words defined by a
descriptor, which is the unit of memory management.
d) Process: The unit of work as recognized by the opera-
ting system. A process is represented by a stack.
e) Stack: Although strictly a LIFO list, the definition
of a stack is less rigorous when used to represent a
process.
f) Level: A demarcation in the addressing hierarchy.
Derived from the concept of lexicographical level in
block structured language (such as ALGOL or HAL), but
extended to provide convenient addressing by the
operating system.
1-5
INTERMETRICS INCORPORATED.701 CONCORD AVENUE • CAMBRIDGE MASSACHUSETTS 02138. (617) 661-1840
Figure 2 illustrates the relationship and use of some
of these terms. Each process is represented by an execution
stack. The initial hierarchical level for process execution,
and therefore the lowest numerical level for any process stack,
is level 2. Subsequent procedure nesting varies the lexical
level of each process stack to 3, 4, 5, etc. The portion of a
process stack that is below level 2 contains a collection of
data termed the Process Information Area (PIA) containing names,
priorities, counters, for bookkeeping, etc., specific to each
process. Above the PIA the stack behaves more strictly as a
LIFO list.
Each process has associated with it a vector of des-
criptors defining the segments containing the procedures to be
executed by the process. These descriptors are addressed as if
the vector were a stack: by stack nu_)er and offset from the
base of the stack. For convenience, this collection of segment
descriptors is termed Level i, since it exists at a more global
level than the individual processes, and each such vector will
be referred to as a stack (even though, strictly, it is not).
At the most fundamental level there is a single collec-
tion of basic system descriptors, variables, etc., which is
termed the Level 0 stack, again for convenience of addressing.
One descriptor at level 0 points to the stack vector, which con-
tains descriptors of all the stacks in the system including the
"pseudo-stacks" of levels 1 and 0.
Each processor contained a set of hardware registers
which indicated the actual M2 addresses of the start of each
of the system levels, i.e., the base address of the correspond-
ing stack. Figure 2 also shows the linkages that tie the Com-
pool mechanism into the system.
The operating system design philosophy reflected an
emphasis on the achievement of reliable operation of both
hardware and software. It was assumed that only higher order
language(s) would be used in the progranm_ing of application
software. The exclusive use of IIOLs allows secure system op-
eration to be realized without exhaustive runtime verification
of each request for OS functions. An intimate and well-defined
interface between OS and the compiler(s) was assumed achievable,
so that an optimal division between static (pre-run) and dyna-
mic (runtime) diagnosis could be made.
It was assumed that the language/compiler to be used
in progra_ning the Space Station application software would
possess the facility of handling common data pools (Compools).
The MP design provided a Compool implementation.
1-6
INTERMETRICS INCORPORATED "701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 . (617) 661-184('
0 q.) ,_4
_. ...... ;............. :...... _ ......' ....... : ...... i trj l.-i e.)
. I _"_ _' mic,_H o , {_ r_4-_
) ; ! i t { n r_ m
t ; , _ .................................... t ta
\L
_ II N il H
". c_ _ ,? c.7
t c? c-! cR i
t
\
\\
,, F_cO
0 ',
.13
,'d te
E) _-t
00
[q C_)
U.!
,' rz) 1,0 *
b _¢, ,
O b *
" _-I rd
p._ 4..)
\
%.
\
i\
_ U
, _
,k
\
: i
'% ;i
" _
\
X
D
49
rd_
lq
O
-P
,_1
k4 .---.
D,-_
1.0
(0,-4
_ O3
(1)
U)
ul
D
0
I-t
I _ [.f}
0
O
-/ \lt!i
", ¢):, , i 1 i _ 0 .---.
_> { _ i I I_ H .I-) 0
\ t ; 1 li _.
L , ! I % _ •
.'_ -:, _ _ 6}
i • 4_D I>
J
.p
U_
rd
kl
q)
0
0
b
0
_4
P_
-,-I
4_
t_
D
-,-I
k4
.P
k4
4_.
I-.ql
Qi
oq
o3
_4
b_
-,--I
1-7
0Reference for Chapter 1
i • Miller, J.S., et. al., "Engineering Study for the
Functional Design of a Multiprocessor Design", In-
termetrics/NASA Contract NAS9-I1745, September,
1972.
1-8
INTERMETRICS INCORPORATED.701 CONCORD AVENUE . CAMBRIDGE MASSACHUSETTS 02138. (617) 661-1840
Chapter 2
MULTIPROCESSOR OPERATING SYSTEM DESIGN
2. i Introd_ct.i on
This ch;.:<gter will discuss the special problems facing
the designer of an operating syotm_"-_,_ for a multiprocessor com-
p,_.t_<. 5_e _co_e of the task which is summarized hero did not
encompass all aspects of OS design. Emphasis is placed on the
more :[_t_portant functions and on those aspects of OS which are
unique to, or at ].east more significant for, multiprocessors
as compared with simplex computers.
An operating system for a space station multiprocessor
will be capable of supporting a wide variety of functions. Al-
though some of these may be unique to the application, it is
very probable that the following standard functions will always
be required in some measure:
a) Initialization
This deals with the initial introduction of informa-
tion into the computing system and its preparation for
eventual execution. It includes bootstrapping from a
cold start, establishing the minimum state from which
the complete system structure can be created, the pro-
blems associated with loading and linking of progr_is
and data for execution, etc. This topic is not a tri-
vial one: a real-time MP/OS is a complex structure and
the problem of establishing it as a working entity
from.scratch should be considered at the time its ini-
tial design is undertaken. Initialization will not be
discussed further.
b) Process State Controller
The basic element of computational work will be termed
a Process. Processes can exist in various states: ex-
ecution, readiness, stall or suspension. This function
of the OS controls the orderly progression of processes
between these states in response to various stimuli,
such as voluntary process state changes, I/O interrupts,
priority changes, interprocess communication, etc.
2-1
INTERMETRICS INCORPORATED" 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138. (617) 661-1840
c)
d)
e)
f)
g)
Interrupt Servicing
A real time, general purpose, central computer for a
space station will almost certainly be required to
handle system-originated external interrupts in addi-
tion to interruptions due to arithmetic traps and
other error conditions. This OS function implements
the desired responses to randomly occurring events of
this nature.
Timing and Synchronization
This function provides the basic mechanism for control-
ling the time dependent execution, and the synchroniza-
tion of parallel, concurrent processes in a real-time,
multiprogran_ed environment.
Resource Management
This is the basic function of an operating system. The
resources required by a computational process are vari-
ous. First, there are the basic hardware elements:
the processors, memory modules, and interconnecting data
paths which must be available to allow the process to
run. Then there are the less tangible items such as
common programs and data over which conflict of access
by several concurrent processes is possible. Lastly,
there is external device availability: sensors, avio-
nics data bu£es, disks, tapes, etc. The resource mana-
gement function is usually divided into processor allo-
cation, memory allocation, compool and shared data
management, I/O and file management. It is the function
of resource allocation to ensure that each scheduled
process is granted a sufficient share of the available
resources to execute in a timely fashion without adverse
effect on other processes.
Configuration Control
In a fault tolerant computer, the current status and
the configuration of all elements of the computer must
be continuously monitored and controlled by the opera-
ting system.
Operator and User Interfaces
The OS must provide facilities to interface with the
operator and/or user. For a complex system this is not
a simple task, especially when a major mode of operation
is interactive usage, by the crew members in controlling
the progress of a mission.
2-2
INTERME-IRICS INCORPORATED '701 CONCORD AVENUE "CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840
h) Performance Monitoring
'±his is an often under-emphasized function of an opera-
ting system, but it is an especially important one in a
new or novel application such as a space station MP.
The more sophisticated a system is the greater is the
need to measure, evaluate and influence its performance.
Some of these functions will be reviewed again in the
light of the follo_,_ing discussion of problems facing the multi-
processing o}>erating system designer.
2.2 L)rq)blems of Multiprocessin([
The multiprocessing environment does not pose any diffi-
culties that the designer of an operating system for a multipro-
grammed, single processor system has not also had to face and
overcome. The MP adds new facets to familiar problems, however,
by reason of the concurrent, rather than sequential, execution
of the multiple processes within the system. This requires that
greater care be taken to prevent damaging interaction between
processes at a point of commonality, especially with regard to
shared data. Measures taken to protect processes against each
other usually affect performance unfavorably. The maintenance
of performance near the theoretical limit is, in any case, more
difficult for a multiprocessor than for an equivalent simplex
computer.
An attractive feature of the multiprocessor is the pro-
spect of increased performance achieved by means other than ad-
vances in processor technology, i.e., n similar processors doing
the work of one n times as fast. In practice several factors
prevent this promise from being fulfilled. If we define "through-
put" as the integral over time of the rate of "useful" computa-
tion C, then it can be shown that:
< <Cdt >- n (C) dt
n
where n is the number of processors. C is a discontinuous func-
tion of time, and as n increases, it becomes increasingly diffi-
cult for C to remain non-zero for long periods of time. Compu-
tation lost whenever C falls to zero may not be made up in time,
and the right hand (multiprocessor) integral continuously loses
ground to the left hand (simplex computer) integral. The reasons
for this are enumerated below.
2-3
INTERME[RICS INCORPORATED •701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840
2.2.1 Parallelism
In order for all n processors to be kept usefully at
work, their load must be capable of being organized into n or
more tasks which can be executed in parallel, continuously and
simultaneously. The degree to which this can be done depends
on the parallelism inherent in the work load. Certain types
of computation exhibit natural parallelism, e.g., signal pro-
cessing, where the same operation is applied to multiple sets
of input data (promoting the design of so-called Single Instruc-
tion Multiple Data (SIMD) computers, for example the Goodyear
Associative Processor [i]). But, in general, parallelism must
be sought out, identified and utilized. It exists potentially
on several levels:
a) On the "job" level. In a general purpose computer fac-
ility, the submitted jobs are normally completely inde-
pendent of one another, even if they share resources.
b) Within a job, at the task level.
c) Within a task, most of the statements are independent
of one another.
d) Within a single statement some computations can be done
in parallel.
Parallelism of types c) and d) is not visible to the
operating systera, because the basic unit of OS is the process
(or task). For the type of application being considered for the
SUMC multiprocessor, it is not likely that the work load will
totally resemble that of a ground based general purpose facility,
although it will exhibit more of its aspects than will a simple
flight control computer. Parallelism of type a) will probably
not be present in sufficient proportion to provide the sole guar-
antee of full employment for two or more processors. It becomes
necessary to deal, additionally, with parallelism at the task
level. The trouble is that problem solving with a computer is,
in general, a serial process: programmers do not naturally think
in terms of concurrent parallel processes in arriving at their
solutions, unless such a structure is inherent in the problem.
A real time control function may conists of several, more or less
independent, activities going on in parallel, e.g., system moni-
toring, navigation, display processing, and vehicle control.
Even so, it is anticipated that there will not be sufficient
functions of this type to keep two or more processors fully oc-
cupied, all the time.
It is necessary, therefore, to uncover task parallelism
that may not be apparent, and even to create parallelism if none
2-4
INTERMETRICS INCORPORATED. 701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138. (617) 661-184(
exists. This imposes a constraint on the programmer, which must
be considered deleterious because it is not natural. So it is
necessaz-y to assist the progra_mmer with a programming language
and a compatible operating system that contain features, attrac-
tive to use, that encourage the creation of multiple, independent
processes. The use of a block-structured language encourages
programs to be written as collections of small, closed subroutines.
ALGOL, PL/I and HAL are among the languages that possess this
property. In addition to structure, a language can provide a
convenient and natural way to interface with the executive by
recognizing tasks as syntactical entities. The multi-tasking
features of PL/I and HAL encourage the programmer to think as he
programs in terms of procosses which are amenable to scheduling.
5:he muitiprocessor operating system must support the
requirements of parallel tasking by providing adequate communi-
cation and synchronization primitives, and by protecting shared
data against conflicting concurrent accesses. These requirements
are discussed in more detail later.
2.2.2 Exclusive Sections
In a general purpose multiprocessor certain operations
are concerned with the manipulation of unique system data such
as, for example, information maintained by the Process State Con-
troller, which contains the current dynamic state of all proces-
ses. Execution of the Process State Controller is an exclusive
operation: only one process may perform it at a time. In a
simplex computer this is achieved trivially: it is only neces-
sary to inhibit interruption of the single processor by external
happenings to assure exclusive execution of the Process State
Controller. A multiprocessor requires a more elaborate mechanism
to prevent the simultaneous execution of such critical functions
by two or more processors. Such mechanisms cause the conflicting
processes to become serialized in time, each being admitted to
the critical section through interlocking turn-stiles (a general-
ized mechanism is described later). The net effect is that when-
ever two or more processes wish to enter an exclusive section,
only one may do so and continue executing: the other(s) must
wait. If the exclusive section is designed to inhibit the alter-
nate assignment of the processor (e.g., if it is the Process
State Controller), then throughput temporarily falls until the
other processor is through with the exclusive section. This loss
of throughput cannot be made up again. Note that in a batch en-
vironment conflicts of this type are rare, but in a real time
system of short tasks, with frequent process state changes, the
probability of conflict may become significant. This precipita-
tes the following quandary: to encourage parallelism a multipro-
cessor program should consist of many concurrent tasks, but to
2-5
INTERMETRICS INCORPORATED. 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840
avoid critical section conflict it should be organized into as
large a serially-executable piece as possible!
2.2.3 Shared Data
There is a problem with shared data, aside from the
need to protect it from simultaneous modification. It is asso-
ciated with the creation of copies of shared data. In many com-
puter designs, performance improvements have been achieved by
localizing lengthy sequences of operations within the fast logic
of the processor, rather than executing out of main memory. (The
cache memories of the IBM 370 series [2] and the task memory of
the Navy's AADC [All Applications Digital Computer] [3] are ex-
amples of localized processing.) The problem arises because data
is maintained local to the processor. If the data is shared with
other processes, changes in the original or any of the copies
must be reflected in all. Some means must be found either
a) to allow one process access to another's local storage,
b) to update all copies of shared data at the same time or
c) to prevent old values from being used by other proces-
ses until updating is performed.
It should be pointed out that this phenomenon is en-
countered whenever copies of shared data are created in any sys-
tem: in the Burroughs B6700 series the problem arises through
its use of descriptors. These are maintained in the stacks of
individual processes. Whenever a descriptor needs to be changed
(it is a common occurrence in a virtual memory system for a
descriptored item to be transferred to back-up storage: the
address field of its descriptor most be modified to reflect this
change of whereabouts), all processors in the B6700 are stopped,
and all process stacks in main memory are searched for copies
of the particular descriptor. The B6700 was not designed as a
real time controller so the ensuing loss of processing time was
not considered objectionable by the designers. It is a different
matter for a space station computer, however. The Multiprocessor
design developed by Intermetrics [4] employs a unique approach
to a similar problem. The copy of a descriptor may be maintained
in an associative memory local to a processor. This avoids acces-
sing the descriptor through three levels of indirection
each involving main memory references. Changes in the descriptor
are very quickly signalled by the provision of a specific machine
instruction which cancels the appropriate entry in the associa-
tive memory. The Intermetrics multiprocessor avoids local copies
of the data itself, and thereby foregoes the potential perfor-
mance advantages of local buffer or cache-type processing.
2.--6
INTERMETRICS INCORPORATED" 701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 . (617) 661-1840
I
2.2.4 Conflict Over System Resources
The most critical resource is main ruemory. As the num-
ber of processors increases, the possibility of conflict between
them over the use of memory increases. As in the case of shared
data, a resolution of conflict results in one or more processors
losing pz-ocessing t.ime, and the right hand integral of the ex-
pression for throughput given earlier again loses to the left
hand. The device of interleaving the modules of a memory system
can be used to minimize the <]e]ays incurred by conflict, but it
exacts a cost in added hardware comple_:ity. Its effect is to
randomize memory usage and thus to obtain stationary behavior.
Another approac]] is to partition memory amoung the various pro-
cesses so that processors tend to execute out of physically
tic. This technique implies a sophistication of the operating
system, a well-kno,,,n_ job stream, and a memory system of suffi-
cient modularity.
The network interconnecting processors, memories and
I/O units is a more critical element in a mu!tiprocessor than
in a simplex system. With more than one processor requesting
memory at a time, this bus itself becomes a source of conflict.
It would seem that a technique that lowers the frequency of use
of the bus would lessen the probability of such conflict. For
example, the use of a cache memory, by encouraging local execu-
tion, would appear to make bus use less frequent. However,
analysis shows that the probability of bus conflict actual ly
increases with increas.ing speed of the cache, thereby defeating
any performance advantage.
In summary, techniques devised to minimize conflict in
a multiprocessor are susceptible to the following drawbacks, any
or all of which conJ3ine to prevent the multiprocessor throughput
from equalling that of the equivalent simplex processor:
a) Increased hardware complexity and cost,
b) Increasing operating system sophistication, usually ac-
companied by increased overhead in space and time.
c) Reduced throughput due to delays introduced to resolve
conflict.
The more processors in the system, the more marked is
this effect. Only in a particular application, for which the
characteristics of the work load can be anticipated, is it pos-
sible to deduce the number of processors required to achieve
a given performance cost effectively. In the absence of such
information about the environment of the multiprocessor, this
2-7
INTERMETRICS INCORPORATED • 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840
limit is very difficult to determine. As a result, almost all
practical designs of multiprocessors to date have been limited
to the degenerate case of two processors. Some designs have
even dedicated functions or resources to each processor in order
to avoid some of the above problems, resu]ting in configurations
of dual computers rather than dual processors.
2.2.5 Overhead
The preceeding sections have cited several factors that
contribute to the complexity of functions that a multiprocessor
operating system is required to perform. Each factor contribu-
tes to the overhead of computational time and memory space con-
sumed by the operating system. Matters are further aggravated
because the many activities going on simultaneously in a mul-
tiprocessing environment take on the characteristics of a que-
ueing problem: their deleterious effects are in general worse
than additive, i.e., the loss in real throughput is a non-linear
function of the number of contributing overhead mechanisms.
But to end this section on a positive note, it should
be realized that this depressing parade of multiprocessing dif-
ficulties has a corollary: small efforts to limit the damaging
effects of each of the mechanisms discussed in this section can
yield dramatic improvements in throughput because of the expon-
ential nature of their interaction.
2.3 Exclusion and Synchronization
Any multiprogrammed system requires operating system
primitives for the commmnication and mutual protection of the
concurrent processes. In a multiprocessor, these activities
can be actually time-concurrent and these primitives must be
implemented in a combination of hardware and software. The
problem of protection against unwanted interactions will be
reviewed first, followed by a discussion of synchronization.
2.3.1 Exclusion Primitives
In a simplex computer a basic exclusive operation may
be implemented in software, but a multiprocessor needs hardware
assistance for such an operation, because of the true time-
concurrency of execution of two or more processes. The hardware
must be capable of reading the value of a variable, and then
rewriting the variable with a new value in one uninterruptible
operation. An example of such an instruction is the TS (Test
and Set) of the IBM 360 series, which writes all ones into a
specified byte and sets a condition code with the original
2-8
INTERMETRICS INCORPORATED "701 CONCORD AVENUE " CAMBRIDGE, MASSACt-IUSETTS 02138 " (617) 661-1840
contents. The Burroughs ]_6700 RDLK (Read with Lock) instruction,
which stores the contents of the B register into the location
whose address is contained in the A register, but leaves the
previous contents of the location in the B register, is closer to
a generalized non-divisible read and write operation.
The actions of a set of genera]_ operating system pro-
cedures designed to provide the exclusion primitive are as fol-
lows :
ENTER Check for occupancy of pro--
,_cdurc. Zet LocJt. If
locked, enter wait queue.
o
o
(critical section}
EXIT Check for occupancy of pro-
cedure. Remove self from
wait queue. Inform execu-
tive to wake next in queue
if any.
How these actions are implemented using a fictitious
non-divisible read and write instruction NDRW is illustrated
in Figure i. Let the execution of NDRW exchange the contents
of the operand, MUEX, with the contents of the accumulator.
MUEX may contain the following values:
0
- 1
No process executing critical section
(i.e., section is "free")
Critical section is being executed by
one process
2-9
INTERMETRICS INCORPORATED " 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 - (617) 661-1840
O
.,.4
O
O
u)
,4
O
.,4
,,4
1.4
'44
v
14
OJ
.I-J
14
i
I
i
I
,4
rJ
4'
o.J
r4
.
I
I
1
X
,_ .,4
cI _)""
_4 0 C.l
.rj _4 _5
L .....
I
C)
,,4
4'
>;
I
__3
,-4
!
C_
['I
7i.....
A.
O|
V t
0
4"
O-
:J
rl
b
m_
(9
.4
9 4' --4',
4J 4
r-4 O
.,4 ;I
,-4
-{ .:J b.
I?/
@ @
i,,3
,-i
I
A
0
o
,-i
:>
L9
0
@
_4
OD]
m
ol
A_
I
I
I
I
I
I
I
I
4J
rJ 0
_j
4J
f4
OJ
Z
t
C
-r4
4m
0
U
0
-,4
4J
0
o_
.,4
0
0
v
_>
-,4
",-I
_4 __
_4
0
,'-'4
0
,--4
4_
2_i0
INTERMETRICS INCORPORATED. 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840
2,3, . . .n Critical section is being executed by one
process, and 1,2,...n-1 are waiting to
gain access. Requires a MUEX queue struc-
ture to be maintained by OS.
negative Procedures ENTER or EXIT are being executed
by a process. (The OS primitive itself
must be protected against multiple use.)
The actions surrounded by dotted lines indicate the
execution of the P:r.'ocess State Controller function. Note that
the fin,u] updating of MUEX in cases where a process is to be
placed in the wait state, or readied to execute the critical sec-
h{on_ m_st be done within the Process State Controller to prevent
interruption ot the sequencu.
This exclusion mech_,nism must be expanded if it is
required to accomodate the comprehensive Update Block capability,
for controlling the accessing of common data, provided in the
HAL language [5]. It is not always necessary to prevent all types
of access to shared variables: a shared variable can be read,
as long as it is not actually being changed. The ability to
differentiate between types of access reduces the time for which
a requesting procc_ss must be made to wait, with consequent im-
provement in throughput. The HAL Update Block is in effect a
modified form of critical section. Every variable that is
addressed within an Update Block has associated with it a "lock-
type" attribute. The lock can assume the following states:
a) Free: Unlocked
b) Read: Accessed for reading only
c) Copy: Accessed for modification
d) Write : Being modified
A variable that is to be modified is first copied, and
all intermediate computations are performed on the copy. This
is the meaning of the "Copy" state. Final values are written
from the copy to the actual variable after the state of the
lock has been raised to "Write" The testing and setting of
the states of locked variables requires the use of the NDRW in-
struction. A requesting process is allowed into the Update
Block only if the type of access requested is compatible with
the current state of all locks within the block. For example,
a request to read the variables is allowed if the current state
of all locks is "Free", "Read", or "Copy", but is not allowed
if any are in "Write".
2-11
INTERME]-RICS INCORPORATED "701 CONCORD AVENUE " CAMBRIDGE, MASSACtJUSETTS 02138 . (617) 661-1840
An operating system mechanism to implement Update Blocks
involves the maintenance of linked queues (see Figure 2). Every
locked variable has associated with it a queue of requesting
processes, each identified with its individual access type. All
queue elements associated with a given process are also linked,
to facilitate the response to changes of state of the processes.
The Intermetrics design of a multiprocessing operating
system [4], defined a pair of generalized primitives, ACQUIRE
and RELEASE, of the £orm: ACQUIRE (Mode, Category, Name, Access)
where each of the terms has the following meaning:
a) Mode: The calling process is placed in the Wait
state if access is not immediately pos-
sible, or an immediate return may be spe-
cified with an indication of why access
could not be allowed.
b) Category: Data, code or device. The ACQUIRE primi-
tive is applicable to the protection of
shared data, the implementaiton of exclu-
sive sections, or the use of a shared de-
vice such as a printer.
c ) Name : Identifies the item in the category, e.g.,
the name(s) of the specified shared variables.
d) Access: Shared, update or exclusive access request.
.These are analogous to HAL's Read, Copy,
and Write lock type states.
It is possible to define any type of required exclusive
operation in a given system with these two primitives.
2.3.2 Synchronization
In order to provide for communication between parallel
processes of a multi-tasked environment it is convenient to in-
voke the concept of an "event" An event is a variable whose
state reflects the occurrence of an activity within the system,
e.g., the completion of a lengthy computation or the arrival in
memory of a previously requested item of I/O. The process await-
ing the activity is associated with the event. The "signalling"
of the event results in the process being made ready to continue.
FOr illustration, let Tasks A, B and C be three independent tasks, _
all scheduled during the execution of some master Program. Sup-
pose it is appropriate to schedule Task C only when certain com-
putations have been completed by Tasks A and B. Tasks A and B
may be executing on separate processors, and thus be unaware of
2-12
INTERMETRtCS INCORPORATED- 701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 - (617) 661-1840
14
0
14
u_
0
0
_J
.IJ
fO
t'N
©
14
O_
.,-I
2-13
one another. In which case, they cannot easily cooperate in
the scheduling of Task C. However, if each were to signal
an event on completion, e.g., EVENT A and EVENT B respectively,
then the event mechanism can provide the synchronization that
causes Task C to be scheduled as soon as both EVENT_A and EVENT_B
have been signalled.
The language multi-tasking features that were advocated
earlier to help keep a multiprocessor busy are supported in PL/I,
ALGOL and HAL by event mechanisms of varying sophistication.
The Intermetrics multiprocessor design [4] specified a very com-
prehensive event structure which enabled complex logical expres-
sions to be evaluated as event signals. In this design events
are controlled by primitives of the form
SET n, E 1 E 2 , ...
(E, E )
RESET _ ' ' m
which is interpreted as "set (reset) event E when n of the events
in the list E 1 through Em, are signalled." If n = m, this ex-
pression is the boolean "and" of all listed events, and if n = 1
it is the "or". The primitives also have a simpler form
SET } (E)
RESET
Response to the signalling of events is basically of two forms:
WAIT(n, E l, E 2, • .. , E m)
and
ON(n, E 1 , E 2, ..., Em)<code>
In the first, as the WAIT is executed the process is placed in
the Wait state until the event expression becomes true. The
second statement causes an interruption of the process as soon
as the expression becomes true, to execute the procedure spe-
cified in the "code".
2-14
INTERMETRICS INCORPORATED "701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 " (617) 66t-184([
The implementation of an event structure involves mul-
tiply-linked queues of event elements which allow the associa-
tions between the processes involved in declaring, signalling
and responding to events to be established, executed, and re-
moved in a dynamic fashion. It is perhaps superfluous to point
out that such a mechanism in a multiprocessor environment re-
quires processors to be able to interrupt one another. This
ability is provide_i, for example, in the Burroughs B6700 by
the "II]_YU", and J_ ]_e RCA 215 by the "INTERRU!?T CPU" instruc-
tions.
2.4 Schedul:i _:_
The scheduling function of the operahing system ensures
that processes are prepared for timely execution with due regard
to their relative importance. It involves some of the functions
of Process State Control and Resource Allocation defined earlier.
This section will discuss briefly the following aspects of this
function :
a)
b)
c)
Ensuring that computation time and space are properly
apportioned among the processes according to predeter-
mined needs, while maintaining an optimal balance be-
tween the conflicting requirements of throughput, effi-
ciency, and response. Throughput is defined as the
amount of useful work accomplished by the total multi-
processor system, efficiency is the degree of utiliza-
tion of the basic components of the system (e.g., pro-
cessors, memory modules, I/O devices), and response is
the ability to react to a given stimulus.
Ensuring that competition between processes in their
demands for resources do not produce catastrophic con-
ditions, such as deadlock or thrashing. ,
Preventing the resulting computational overhead, espe-
cially of time in a real-time control system, but also
of space, from becoming excessive (the definition of
"excessive" is not attempted here!).
2.4.1 Space and Time Allocation
The computational activities in a space station multi-
processor are expected to fall into the following categories:
2-15
INTERMETRICS INCORPORATED. 701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 . (617) 661-1840
Category Characteristic Time
Response Ranqe Criticality
Batch i0 secs-mins, non-critical
Interactive
Real Time
0.i sec-10 secs. non-critical
1 ms-100 ms non- cri tic al
Real Time 1 ms-100 ms critical
Examples
Lengthy computations. Off line
experiment data processing
Crew operational sequences.
Time sharing by scientific per-
sonnel.
Control of scientific experi-
ments. Operational equipment
status monitoring
Operational equipment servicing:
strapdown IMU. Closed loop con-
trol: autopilots, etc.
Processing tasks in the batch category can, to an extent,
ignore the constraint of time. The allocation of memory space or
other system resources such as common data, input file, I/O de-
vices, processors, can be considered with more freedom. The
presence of this category in the total work load can provide a
measure of global optimization in the use of system resources to
maximize efficiency.
The time-critical real-time tasks can not make such
compromises. Resources must be ready when needed. The need
is often (but not always) randomly determined. Unless it is
composed of highly repetitive tasks, the real-time component
of the work load prevents high values of throughput and effi-
ciency from being attained.
A work load consisting o[ components from each category
must be so arranged and presented to the computer system that
all tasks can get sufficient cuts at the system's processing re-
sources. Obviously, no amount of intelligence built into an
operating system will supply enough computational resources to
a work load whose demands exceed the capability of the machine.
An operating system can be designed to contain features and to
operate in a way that matches the characteristics of the work
load. But it remains the responsibility of the user of the sys-
tem to assign a given work load to the machine in such a way
that it does not overload the system.
Task scheduling can be approached from two extremes:
a) Synchronous, or time slot scheduling. Each task is
allotted a different, but fixed, interval of time for
2-16
INTERMETRICS INCORPORATED " 701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 . (617) 661-1840
b)
a)
execution, which is available at multiples of fixed
minor cycle intervals.
Demand Scheduling. Tasks are allocated processors
and other resources on demand, at execution time, ac-
cording to the needs and importance of the task and the
availability of the resources. Tasks are differentia-
ted in importance by a priority value which stays as
initially assigned, or changes as a function of time or
the tasks' status.
The advantages of the synchronous approach are:
Minimal overhead, since scheduling is pre-determined;
b) The scheduler is simpler, being essentially table driven;
c) The fixed schedule of task execution eliminates problems
associated with code and data sharing, and does not re-
quire re-entrant code;
d) The load may be evenly distributed over the available
time ;
e) The deterministic behavior makes system verification
easier.
The difficulties associated with it are:
a) It is difficult to structure programs so that they may
be time-sliced;
b) Each time slice must be sufficient to accomodate the
worst case, so on the average will be under-utilized;
c) It Js difficult to accomodate response to random events
such as crew inputs. Response to system failures is
especially difficult, unless recovery from all classes
of failures is pre-scheduled.
d) The structure is inflexible to change.
These disadvantages are all overcome by the demand scheduling
approach, which, however, suffers from an increased degree of
difficulty because of its greater complexity, and because it
is more difficult to verify.
In a functional design of an executive for the Space
Shuttle central computer, Intermetrics has proposed a combined
synchronous and demand scheduled approach [6]. The repetitive,
time-critical functions which can be implemented in short,
2-17
INTERMETRICS INCORPORATED. 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840
complete sections of code are executed by a synchronous "fore-
ground" scheduler driven by timer interrupt, at 40 ms intervals.
The majority of the remaining tasks are scheduled on demand as
a "background" activity according to pre-assigned priority
values. Communication between foreground and background is by
an event mechanism, in essence similar to that described in sec-
tion 2.3.2.
2.4.2 Deadlock Prevention
OS/360 has three resources to allocate to each job/step.
These are core storage, data sets and peripheral devices. The
allocation algorithm is summarized in Figure 3. Note that all
data sets for the entire job are allocated at job initialization
time and are bound for the duration of the job. In addition,
all devices are allocated at step initialization time and are
bound for the duration of the step. This approach may be costly
since some of the resources allocated to a task may remain un-
used for long periods.
Alternatively, resources may be allocated dynamically,
i.e., while the process is running. Unfortunately, now dead-
lock prevention becomes a more difficult problem. However,
some practical solutions have been suggested [7], although a
time overhead must be paid if they are implemented.
The suggested methods involve keeping track of the state
of the system by means, of state graphs or matrices. When a re-
source is requested by an executing process, the availability
of the resource is checked. If it is presently unavailable, the
algorithm must determine if it is safe to put the requesting
task in the wait state. To determine this, it checks the state
matrices of the system as they would be if the request were
enqueued for the resource. When a safe condition results, the
request is enqueued, and the task is placed in the wait state.
On the other hand, if an unsafe condition results, the request
must be denied and the task so notified. The task can then de-
cide if it wishes to cease execution or it if can proceed with-
out the resource. (Some subtle problems to be aware of, in
implementing such an algorith_ have been overlooked by several
authors and are discussed by Holt [8],)
While it is easy to see that dynamic allocation is
most economical in the amount of time system resources are un-
available, some time overhead must be paid each time a process
requests a resource. The OS must check the state matrices to
determine if safe states will result. This process can be
lengthy for a system with many resources and many ready tasks.
One must remember here that the overhead is really that time
2-18
INTERMETRICS INCORPORATED .701 CONCORD AVENUE . CAMBRIDGE, MASSACHUSETTS 02138 . (617) 661-1840
I
FIG. 3" OS/360 Resource Allocation Algorithm
2-19
INI _:t-,,,_........::l};,CoK_" I;,!(.:(_i]I_L,I_A]I::L_,• " " • ,"igI].G}!FE1,J SlI{r!E] " • CA:vI[}[IIL){-;E, MAoo/.,C),USETTo'... .. _ o 021:}9 • (617) 85_-1[b10
used for dynamic allocation over and above that which would other-
wise be spent for allocation at job and step initialization times
as described above.
Unfortunately, no analytic studies or simulations of
these algorithms have been done to evaluate overhead costs. How-
ever, with careful thought given to the implementation of a
dynamic algorithm, its overhead can be held to a minimum. In
_]y case, the advantages of dynamic allocation would seem to
overshadow any time overhead that results.
2.5 Memory Manaqement
Management of the use of memory is potentially the
most critical activity of an operating system. It is very de-
pendent on:
a) the structure and characteristic behavior of the appli-
cation software. If the work load is well known and
dynamically predictable, especially with regard to its
memory requirements, allocation of space can be pre-
determined, by pre-planned overlays for example.
b) The system architecture. If sufficient operating memory
is provided to accomodate all programs at all times, dy-
namic allocation problems are eliminated. If, however,
a virtual memory design is adopted for its potential
simplification of programming and its cost effectivity,
the operating'system becomes intimately involved in
creating and allocating memory space, and its detailed
design is further affectedby the technique adopted for
addressing the virtual memory system.
c) Memory technology. The architecture of a virtual memory
system and the functions of its operating system are
significantly different for secondary storage with moving
head disks, than for solid state block-oriented, random
access devices such as the experimental magnetic bubble
domain memory.
Although memory management can assume a critical role in
determining operating system size and efficiency, its problems
cannot be addressed in detail in the absence of a memory hier-
archy definition. The following review of methods of operating
memory utilization is presented to underscore some of the factors
involved in providing increasing levels of operating memory uti-
lization by multiplexing.
2-20
INTERMETRICS INCORPORATED. 701 CONCORD AVENUE . CAMBRIDGE, MASSACHUSETTS 02138. (617) 661-184£
2.5.1 Operating Memory Multiplexing
The following examples describe practical applications
of a number of techniques for incz-easing the utilization of op-
eratinq memory.
2.5.!.1 Non-multiplexed Memorv: In a non-multiplexed system
the process of "assembly" of the program serves both to esta-
blish the ma}_ping between names found in "subroutines" (which
are si,<iply se}_arately maintained units of program code), and
the mapping beLween names and physical locations in memory.
At tile conclusion of the assembly, the mapping information is
com})letely d;istributed, and is saved and accessible only as a
diab,_,..._Lic _id, Zoo ti-a _omputer simui_,_LuJ:, fc)r example. }_ost
flight control computers are of this design, usually because of
their modest total memory requirements, typically 8K to 32K
words.
2.5.1.2 Partitioned Memory: A simple form of memory multi-
plexing is used when the physical memory is large enough to sup-
port the requirements of more than a single program at a time.
The OS/360 MVT and MVT systems implement fixed and variable par-
titions respectively. The normal objective of concurrently-
loaded programs is to provide more efficient use of the proces-
sor by increasing the chances that some program can use the
CPU when another is waiting for completion of I/O operations.
As in sequential execution, the mapping between names
and locations is applied in all places at the time of loading,
and the map is of no further use to the execution of the program.
To further increase processor efficiency, a high-speed
secondary storage device may be used for "core-swapping". This
involves writing the contents of a partition out to the device
before its execution has been completed in order to make room
to bring in some other program ready to run. Because the name-
location mapping is not dynamically applied, the information
must be returned to its original location when its execution is
to be resumed.
2.5.1.3 Partitioned Memory with Relocation Registers: Under the
above mechanization, the application of the name-location map
takes place at one time, but over manyspatial places. This has
the advantage of getting the mapping finished; however, it has
the disadvantage that the mapping is not readily reversed or
modified. Several systems (e.g., PDP-10, Univac 1108) use an
alternate scheme which re-applies the mapping each time. This
2-21
INTERMETRICS INCORPORATED" 701 CONCORD AVENUE " CAMBRIDGE MASSACHUSETTS 02138. (617) 661-1840
is achieved by providing one or more reloca£ion registers, whose
function is transparent to the software, which supply offset
values to be combined with logical or virtual addresses generated
during the program's execution. A disadvantage of this approach
is that it requires additional hardware to perform the combining
as part of instruction execution. However, it has the valuable
characteristic that the mapping remains available for modifica-
tion, so that program and data sections may be relocated in the
operating memory and only the relocation values need to be
changed in the process. Thus, storage in use can be compacted
to collect avai]able sT0ace into one contiguous piece when neces-
sary to find room to load an additional program.
As in the partitioned memory scheme, "core-swapping"
may be used for additional multiplexing. IIowever, the use of
the relocation registers makes it possible to return the infor-
mation to any convenient location, rather than the precise place
from which it was written.
2.5.1.4 Pagino: An alternate to the use of relocation regis-
ters is to divide the program and data space, linearly arranged,
into a series of "pages" of a fixed size, ordinarily a power of ....
2 (e.g., XDS Sigma 7, CDC 3800). In address formation, a group
of bits from the logical address is used to select a page-
location word from an array called a page-table; this word con ....
tains the memory-address of the page if it is currently there.
Otherwise, an indication of the absence of the page is provided,
along with the secondary-storage location at which the page may
be found. The physical storage space is thus divided into fixed-
size page frames, and the mapping between names and physical
location is dynamically applied. A strong advantage of this
approach is that logically contiguous space need not be physically
contiguous, nor need it even all be present. The relaxation of
the pages for occupancy of storage space by implementing some mea-
surement of page reference behavior (with hardware help). Pages
appearing to be less needed may be overlaid with more lively ones.
Because all page frames are the same size, space mana-
gement is simple, and requires only modest overhead at execution
time. On the other hand, the page boundaries fall at arbitrary
locations in code or data, rather than at logical divisions. The -
average usefulness of words in a page is therefore reduced, since
a logical entity may occupy only a small part of a page, or cross
a page boundary.
2-22
INTERMETRICS INCORPORATED. 701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 " (617) 661-1840
2 5 1.5 qocmented Addressinq: The simplest segmentation on a
logical (if not operational) basis is the scheme used in the
Burroughs B6700 and its predecessors Each program block is
compiled into a virtual address space of its own, called a seg-
ment; locations may then be accessed by specifying a segment
number and an offset from the beginning of the segment. In
execution, the name-location mapping is applied dynamically.
Each segment has a segment descriptor which contains the physi-
cal location of the beginning of the segment. IIowever, this
descriptor can also contain an indication that the segment is not
in stora(._e at the moment; in this case, the address Jn the des-
__' .L ..
czmp<oJ is the secondary storage location at which the segment
may be found.
F,n nd'<_ntggc o[ th:is type of segmentation is the direct
relationship between the segment size and the logical unit of
program or data it contains. This cha_-acteristic increases the
average usefulness of words transferred in a segment load.
A disadwmtage of this scheme is that segments are
small, scqment descriptors are therefore numerous, and must
consequerftly be located in operating memory rather than high-
speed p)_-ocessor registers. The access to these necessarily
slows down the address formation process; consequently some
scheme of buffering in a small set of fast reg_isters is usually
utilized to shorten the access delay (see section 2.2.3). A
second disadvantage is that storage allocation occurs in vari-
able sized units and is therefore more complex and consumes
more pz-occssor time than for fixed-sized pages.
2.5.1.6 Segmentation Plus Paqin[: This method of addressing
and multiplexing was developed by the Multics group at MIT Pro -°
ject MAC. It is implemented most ambitiously on the GE (Honey-
well) 645 designed for Multics, and also on the IBM 360/67.
In Multics, segments tend to be large, and each is divided into
fixed-size pages. Even page tablds are paged, since they other-
wise would occupy too much operating memory. Paging is the
mechanism which accomplished multiplexing; segmentation is uti-
lized for other purposes which are not relevant to this report•
However, it should be mentioned that segmentation is implemented
in such a way that when two independent processes refer to the
same segment, both processes utilize the same page table. Shar-
ing is thereby implemented in a general and powerful way.
The Intermetrics multiprocessor design [4] featured a
segmented virtual memory system based in principle on the Bur-
roughs designs. The policies for space allocation, segment
placement, and replacement were, however, novel implementations
of the operating system. The overall objective of the design
2-23
INTERMETRICS INCORPORATED "701 CONCORD AVENUE " CAMBRIDGE, MASSACt4USETTS 02138 - (617) 661-1840
was to reduce the usual overhead consumed by the memory manage-
ment function, by hardware assistance of address translation
with associative memories local to the processors, and by spe-
cially tailored OS routines to handle segment I/O.
A more detailed examination of the characteristic
differences between paging and segmentation, and the factors
influencing virtual memory design is presented in Chapter 4.
2.6 Im21ementational Aspects
This review, far from complete, of multiprocessor op-
erating system design problems closes with some comments about
the implementational aspects. The major objectives of anyone
embarking on the design of an operating system should be:
a) That the completed system work very closely to the way
it was intended;
b) That it not take forever to finish;
c) That the resulting design be non-subtle, that it may
be easily understood, maintained, and if necessary
modified, other than by its creators.
2.6.1 System Specification
A big step towards accomplishing the first objective
is to establish clearly in the beginning what the operating
system is expected to do, and how. A considerable fraction of
the total programming effort should be devoted to identifying
the functional requirements, and then thinking out an overall,
coherent design that not only satisfies them, but possesses
enough flexibility to accomodate later modification and addi-
tion. The end item is a detailed design specification which
deals with the structure to be implemented and its operating
characteristics, and includes a description of how the com-
pleted system is to be verified.
2.6.2 Structure
The second and third objectives are largely a matter
of the way in which the software of the operating system is
structured, and the techniques used to implement that struc-
ture.
Comprehensive operating systems have acquired a bad
reputation for complexity, cost and ultimate unreliability,
2-24
INTERMETRICS INCORPORATED •701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840._
I
largely perhaps as the result of the widespread usage' of the
IBM 360 series of computers. OS/360 was very ambitiously con-
ceived at a time when rigorous tech_Jques of software construc-
tion (and the penolities of ignoring them!) were not as well
researched and understood as today. Problems with the use of
OS/360, and other designs, have prompted much study into the
theory and practice of operating systems to be undertaken, es-
pecially durinq the last five years or so. A gathering body
of knot, ledge on techniques of design and operation has become
availab]e (see, fokr example [9]) .
Dijkstz_a has pioneered the discip]ined approach to op-
eratin<_ system design [I0]. He organized the functions of a mul-
tipro-.srammed opezating system into a number of sequential pro-
:_%ss<::< Th<<_'o ,-'_c_:,sse:; were £hen hJoxarchica]lv arranged to
form sevoral independent levels of increasing abstraction of
machine operation. For example, the lowest hierarchical level
was that of the real machine itself. At the next to lowest level
were procedures for allocating processors to processes and field-
ing interrupts from the real time clock. The level above that
managed the operatien of the virtual memory, without concern
for processor availability. The next ].evel fielded the inputs
from the operator keyboards, and so on. The application pro-
grams formed the highest level. A programmer was thus able
to view the combination of hardware and software as a "virtual
machine", representing an abstraction of the real machine. Need-
less to say, the whole concept precluded the use of machine lan-
guage coding by an),, application programmer, since this would
have cut straight through the screening levels of "virtual ma-
chines" Each level of the system possessed a large degree of
independence of the ether levels, and could be separately con-
ceived, implemented and tested.
Other operating system designs with different opera-
tional requirements and system configurations would probably
depart from the functional separations made by Dijkstra, but the
basic philosophy may be adhered t6.
2.6.3 Systems Programming Language
Just as a problem oriented higher level language assists
in the structuring and implementation of applications software,
the use of a language suited to the definition of OS functions
has gaiied much support from operating system implementers.
The advantages can be viewed from both a managerial and a tech-
nical aspect. The managerial benefits of HOL usage are too
well established to be repreated here. Various authors have
defined the features that would make a systems progra_ning lan-
guage easy and efficient to use [ii]. Almost all agree that
2-25
INTERMETRICS INCORPORATED '701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 " (617) 661-1840
the language should possess a block structure and enforce name
scope rules. It should contain control features such as pro-
cedures and functions, the statements IF THEN ELSE, DO FOR, and
DO CASE. Some language designs restrict data types to those
generally agreed to be useful to systems programming, namely
bit, character, pointer and various forms of arrays. Others,
following the example of PASCAL [12] contain more powerful and
flexible data structures, which allow the systems programmer
freedom to adapt the language to his specific problem. The
ability to address specific machine features is necessary, al-
though the major portion of any operating system can be machine-
independent. The need to generate efficient code is clear, if
only to overcome the reluctance of non-believing systems pro-
grammers to code in a higher level language! Almost all advo-
cates insist on the absolute necessity of readability in the
language, and the provision of comprehensive diagnostics by the
compiler. From these characteristics, it is evident that sys-
tems and application progra_ning languages have quite similar
objectives, and differ mainly in the natural incompatibility of the
data types recognized. Several attempts have been made, therefore,
to adapt existing HOLs for system programming, as the following
examples illustrate.
A subset of PL/I was chosen to code the operating system
for the comprehensive Multics system at MIT [13], which is based
on Honeywell 6000 computers. The Burroughs Corporation has
developed several versions of ALGOL 60 with differing degrees
of machine dependence [14] for different B6700 systems program-
ming applciations, as a consequence of their long standing use
of ESPOL in the B5500. There is Extended ALGOL for the bulk of
systems programming, including the Extended ALGOL compiler it-
self; Data Communication ALGOL, which allows the control soft-
ware for communications interfaces to be conveniently programmed;
and ESPOL, the original systems language, which enables many of
the B6700 features such as stacks, registers, memory, the multi-
plexors, peripheral devices, etc., to be addressed directly.
Several languages have been developed to handle systems
programming for specific machine architectures. The University
of Toronto is developing SUE for system programming on the IBM
360 [15]. An extensible language LSD is being designed for sys-
tems development on the IBM 360 at Brown University [16], al-
though it is not yet operational. PL 360 [17] a language de-
signed by Wirth at Stanford University for the IBM 360, has
features that make it attractive to systems programming.
Carnegie-Mellon has developed and used BLISS for its DEC PDP-
i0 [18].
It is strongly recommended that the operating system
for the SUMC is designed and written in one of these systems
programming languages, or at least in some tailored subset.
2-26
INTERMETRICS INCORPORATED " 701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 " (617) 661-1840
Most of tbe compilers have been written in the language itself,
which le_:sens the diff']culty of transferring the compiler from
its original host raachine to another.
References for Chapter 2
lo Fulmer, L.C., "A Modular Plated-Wire Associate Pro-
cessor", GF.I<-14727 Goodyear AeJ:ospace Corp., Akron,
OhJ-o, March 1970.
. Conti, C.J., "Concepts for Buffer Storage", Computer
,_ __. _ ,_,LJ _ _ , _ ,C , _ "_ _, , n
, Entner, R.S., "The Advanced Avionic Digital Computer
System", Computer Design, September 1970.
,
Intermetrics, Inc., "Engineering Study for the Func-
tional Design of a Multiprocessor System", Final Re-
port, NASA/Intermetrics Contract NAS9-I1745, Septem-
ber 1972.
. Intermetrics, Inc., "The Progran_ning Language HAL -
A Specification", NASA/Intermetrics Contract NAS9-I0542,
June 1971.
. Intermetrics, Inc., "Advanced Software Techniques for
Data Management Systems: Vol. II", Final Report, NASA/
Intermetrics Contract NAS9-11778, February 1972.
. Coffman, E., et. al., "System Deadlocks", ACM Computing
Surveys, Vol. 3, No. 2, June 1971.
, Holt, R., "Prevention of System Deadlocks", Comm. ACM,
January 1971.
. ACM/SIGOPS, "Operating Systems Review", Proc. 3rd
Symposium on Operating System Principles, Stanford
University, Palo Alto, California, October 1971.
i0. Dijkstra, E.W., "The Structure of 'THE' Multiprogramming
System", Con_, ACM, Vol. 2, No. 5, May 1968.
ii. ACM/SIGPLAN, Proceedings of Symposium on Languages for
Systems Implementation, Purdue Univ., Lafayette Ind.,
October 1971.
12. Wirth, N., "The Programming Language PASCAL", Acta
Informatica i, 35-63, (1971) by Springer Verlag, 1971.
2-27
INTERMETRICS INCORPORATED "701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 " (617) 601-1840
13.
14.
15.
16.
17.
18.
Corbato, F.J., "PL/I As A Tool for System Programming",
Datamation, Vol. 15, No. 5, May 1969.
, "A Hierarchy of High Order Languages forLyle D.M.,
Systems Programming", Proc. Symposium on Languages for
Systems Implementation, October 1971.
Clark, B.L., et. al., "System Language for Project SUE",
Proceedings of Symposium on Languages for Systems Imple-
mentation, October 1971.
Bergeron, R.D., et. al., "Language for System Develop-
ment", Proc. Symp. on Languages for Systems Implemen-
tation, October 1971.
Wirth, N., "PL360, A Programming Language for the 360
Computers", Journal ACM, Vol. 15, No. i, January 1968.
• . "Bliss Reference Manual", Carnegie-Wulf, W A , et. al.,
Mellon University, Pittsburgh, Pennsylvania, January
1970.
2-28
INTERMETRICS INCORPORATED " 701 CONCORD AVENUE ' CAMBRIDGE, MASSACHUSETTS 02138 - (617) 661-1840
Chapter 3
INTERRUPT STRUCTURE
qh.xs ch:_,-)e_< will discuss various aspects of the in-
terrupt st ruct0<<_ when applied to a multiprocessor. The first
sectJoJi will present a list of assumptions upon which th<_ fol-
lowing sections are based. The second section ]:)resents a brief
• ' a
_ - ,. " ".t.-' -.C i . '7q7. .c;_._.(_U_, - ' • i. ,.-; (. _ i_ ::'_.. C::.'Y_7:>[ :_, ",_-r-,,-, _-_.i'_ thJ _t m-],© oI!vJ .t-orb<tent of
the space station multiprocessor. The third section discusses
various probl_,m.:, that are encountered when attempting to develop
an interrupt structure for the multiprocessor.
3.1 As sumot_____ions
a)
b)
The basic assumption is that the concept of interrupts
is indeed required. It is possible to conceive of
computer systems that are well specified, in which all
equipments are synchronized and serviced in a predeter-
mined cyclic fashion. IIowever, the system contemplated
for the space station is not well specified. It will
have to respond to conditions not anticipated in the
program flow. Therefore, the need for interrupts is
postulated.
A true multiprocessor is assumed. This includes a
"floating executive" and a configuration with three
or more processing units. With a floating executive,
any process can be executed on any processor. There
are no functions dedicated to any processor. This ex-
cepts the I/OC, which does serve a specialized function.
Three or more processors are assumed so that the gen-
eralized solution to multiprocessor interrupt handling
can be addressed.
3.2 Interrupt Cateqorization
An interrupt can be defined as any condition which
causes an involuntary interruption in the sequence of execu-
tion of a process. The interrupt is not explicitly anticipated
in a program's code. It can be considered to be an involuntary
procedure call to the interrupt servicing routines, with an ul-
timate return link to the original process.
3-1
INTERMETRICS INCORPORATED •701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840
0Interrupts may be categorized into three distinct
classes:
3.2.1 Process Oriented
Process oriented interrupts are those associated wit]]
the process in execution. There are a number of distinct types.
Arithmetic and control traps are caused whenever an unacceptable
condition presents further execution. An interrupt from a
"watchdog timer" indicates that a process has been running for
an excessive time.
The above two process-oriented class of interrupts are
synchronous with the process and occur while the process is run-
ning. There exists a class of process-oriented interrupts which
can occur when a process is in a waiting state. These inter-
rupts, sometimes called software interrupts, result from HOL
statements of the following form, as discussed in section 2.3.2:
ON (event) <code block>
This statement establishes a linkage which causes the
specified <code block> to be executed when the specified (event)
is signalled. If the process is running when the (event)
is signalled, then it is interrupted to execute the <code block>.
If the process is not running when the (event) is signalled,
then as soon as the process which issued the ON statement enters
the running state it will be interrupted to execute the <code
block>.
3.2.2 System Oriented
This category of interrupt does not have any particu-
lar affinity for the currently running process. Conditions
such as I/O Complete, I/O Error, and Absent Segmen[ Trap fall
into this category. Both I/O Management and Memory Management
are executive functions.
Many failures or error conditions, such as power fail-
ure, can be considered system oriented.
3.2.3 Processor Oriented
Even with a floating executive and no dedicated func-
tions to particular processors, it does become necessary to
direct an interrupt to a specific processor, independent of
the process being executed. For example, in response to an
error signal one processor might direct another processor to
terminate or restart. The entire area of system initialization
3-2
INTERMETRICS INCORPORATED • 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-184'
and reconfiguration requires direct communication with specific
processors. ']'he processor-directed interrupt is a convenient
mechanisra for meeting this requirement.
3.3 Multi[_rocessor Interrupt Prob]em Areas
A numl_er of problems involved in the servicing of in-
terrupts exist. Some are aggravated in a multiprocessor en-
vironment and some are unique to the multJprocessor environ-
mont. Four m._J.jor areas are discussed below.
3.3.1 Which Processor to Interrupt?
In a multiprocessor system, a question arises as to
which processor to select to handle a given interrupt. For
process oriented interrupts which occur while the process is
running, the decision is trivial. The interrupt should be
steered to the related processor. Similarly, so should proces-
sor related interrupts.
The remainder of the interrupts are system-oriented
or non-running-process-oriented, and have no affinity for any
particular processor.
A number of options are possible in assigning a pro-
cessor to service the interrupt:
a) An arbitrary processor may be interrupted based upon
some random selection algorithm. The interrupted pro-
cessor may then execute a software routine which de-
termines whether the interrupt condition is of higher
priority than the process which was interrupted. If
it is not, then the interrupted process will be sche-
duled like any other process according to its priority.
b) All .the processors may be interrupted. The interrupt
service routine can be made a "critical section" of
code which can only be executed by one processor at
a time. The first processor to access this code ser-
vices the interrupt. The other processors revert to
their original processes.
c) A sequential selection employing a "round robin" style
algorithm may be used. In this way, the interrupts
are loaded equally upon all processors. This option
of course does not consider the process which is run-
ning on the processor at the instant of interruption.
3-3
-- INTERMETRICS INCORPORAqED • 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840
a) An assigned processor might service all int6rrupts, or
specific interrupt conditions can be preassigned to
specific processors.
e) The processor executing the lowest priority process
will be selected to service the interrupt. If the
interrupt priority is lower than any running process,
then the required interrupt response will not be exe-
cuted until a process swap results in a lower priority
process.
The approach reco1_m_ended in this Report is to provide
a combination of c) and d) by placing within the I/O control
an element of hardware which automatically determines the most
interruptable processor (based upon the priority of the process
running), and receives and distributes all potential interrupts.
Running-process-oriented interrupts can by-pass the
interrupt logic within the _/OC since the processor to be sel-
ected is known a-priori.
3.3.2 Response Time
There is a small class of interrupts which require al-
most immediate response. These are system oriented and deal
with equipment failures or other emergency situations. One ex-
ample is a "power failure" interrupt. This must be responded
to within raicroseconds in order to move any volatile registers
into permanent storage and then systematically to shut down the
system.
The class of conditions associated with arithmetic
and control traps does not require instantaneous response but
the running process can not continue until after the trap is
serviced. Any trap condition falls into this category, even
system oriented traps such as the'Absent Segment Trap.
Quite often specifications are generated and systems
built which require I/O Complete interrupts to be generated
within micro seconds of an I/O completion. From a performance
point of view, most I/O interrupts can possess a response time
of the order of milliseconds. For example, if M3 requires an
average of 10 milliseconds for each access, it is clearly un-
necessary for its completion to be signalled within microsec-
onds.
3.3.3 Innovations
A number of innovations may be suggested in the I/O
interrupt area. These suggestions exploit the space station
type of I/O, namely mass storage M3, and a data bus.
3-4
INTERMETRICS INCORPORATED - 701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 - (617) 661-1840
a) "Quiet" I/O
b)
When the multiprocessor system work]Dad is heavy,
the frequency of Absent Segment Traps can be expected
to be relatively high° Conventional processing of an
Absent Segment Trap requires entry to an interrupt
handler, initiation of an M3 operation, and placement
of the process into the wait state. Upon completion
of the M3 o]:,era[tion, a_ i/0 Completion interrupt is
si9nalled. _i'he handier for this interrupt is then
entered, the process <._J ring for the seoment is readied,
another I/O operation to M3 is initiated if one is
queued, and the processor allocation routine is called
to see if J.t is appropriate to assign a processor to
the newly readied pzocess.
An alternate implementation is suggested to avoid,
at least in most cases, the necessity for entering the
I/O Completion interrupt handler when the segment trans-
fer is concluded. This is achieved by providing a
capability in the I/O controller which causes it to
make a choice of whether or not to signal I/O comple-
tion. Thus a dynamic decision is made as to whether
the interrupt should be suppYessed or signalled, de-
pending upon the existence of a queue of operations
waiting for the device. If the interrupt is suppres-
sed, the condition is made known to the system by the
setting of a bit field in a location accessible to the
absent segment trap handler. After initiating the M3
operation to make an absent segment present, this hand-
ler checks the completion-states of M3 segment transfers
previously issued. The processes whose segments are
found to have completed their transfers are readied;
thus the utilization of the I/O Completion interrupt
handler is avoided. This diminishes the overhead for
absent segment handlJng,.especially under heavy load,
when computational overhead is most detrimental to the
system throughput.
Data Bus Control
If a command response data bus, with a minor cycle
of 20 milliseconds, is employed then it is clearly un-
necessary to interrupt the system after each peripheral
device is accessed. In principle, the synchronous na-
ture of the data bus does not require interrupts for
normal processing. However, one may consider the need
for interrupts due to infrequent events:
l) An interrupt might be generated by the Data Bus
Control Unit if certain types of failures are
detected.
8-5
INTERMETRICS INCORPORATED • 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840
2) For equipments which are interrogated at a very
low freqQency or even randomly an interrupt might
be considered at the end of the request.
Both of these suggestions impose little if any load on
the system due to their very low frequency of operation.
A checkout problem may, however, arise in trying to
verify successful operation for infrequent interrupts
at any point of execution in a program.
3.3.4 The Interrupt Sequence
When an interrupt is signalled to a processor, the de-
tails of its local environment, the processor's status, must be
saved so an eventual return is possible. In a stack oriented
machine an interrupt response can be executed parasitically on
top of the process' stack, with entrance and return functions
performed automatically.
Since procedures may be nested to multiple depth, so
can interrupts. The only limit is the number of display registers.
provided to mark the beginning of each lexical level in the stack.
3.3.5 Interrupt Functional Response
System or processor-oriented interrupts possess a sta-
tic (pre-determined) response. Once the response is established
it is not changed. However, for process-oriented interrupts
(traps) one may conceive of situations where each process may
desire a different response to particular interrupts. For ex-
ample, one process might want to respond to a square root of a
negative number trap by substituting a zero for the answer.
Another process might deal with complex numbers and cause a
re-entrance into the square root instruction with a change of
sign of the argument.
For all trap conditions the system must provide a de-
fault option. It is suggested that a process be allowed to
override this system option by providing its own response to
particular traps. Any process at any lexical level should be
allowed to specify, if necessary, its own response to process-
oriented traps.
3-6
INTERMETRICS INCORPORATED. 701 CONCORD AVENUE . CAMBRIDGE MASSACHUSETTS 02138. (617) 661-1840
I
Chapter 4
MEMORY IIIERARCHY
Memory is possibly the most difficult of any computer
element to specify, imp] emc'n [< and use. It is in this area that
techno?ogical limits and cost factors are first encounhered
puter system. ]i'i)e inabil_.ty of a single, currently known, mem-
ory technology to meet the conflicting requirements of high
access speed and high storage capacity has led to the hierar-
chical concept of levels of raemory.
4.1 Basic Hie__rarchy Description
Within the multiprocessor structure, one finds a num-
ber of levels of memory used for varying purposes.
4.1.1 M0 - Micro Level Control Memory
From one point of view, micro memory is only a parti-
cular implementation of a control unit and therefore should not
be considered part of the memory hierarchy. Alternatively, an-
other point of view suggests that micro memory should be
used for execution of the frequently used operating system pri-
mitives and subroutines. It is from this secondary point of
view that M0 is considered an element of the memory hierarchy.
4.1.2 M1 - Local Memory
M1 storage is dedicated to the processing unit. Its
function can range from a register set, as is found in the SUMC,
to a complete cache memory as used in the IBM 360/85. The
major function of M1 is to increase the performance of the sys-
tem. Its speed is in the i00 nanosecond access time range and
its size can range from 16 words (for register storage) to 4K
words (for a cache implementation).
4.1.3 M2 - Operating Memory
In a multiprocessor environment, M2 is that part of
memory which is shared by the processing units and I/O controllers.
4-1
INTERMETRICS INCORPORATED "701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETIS 02138 " (617) 661-1840
M2 must of necessity consist of a number of separate memory
modules so that sintulthneous access of different modules may be
made by tile processing units and IO/C. M2 cycle time is in the
1 microsecond range and its size is of the order of 100K words.
4.1.4 M3 - Mass Memory
M3, historically a drum or disk, provides the function
of augmenting the M2 storage. It is used to hold all the pro-
grams and data segments not currently being used in the proces-
sing function. M3 is used to implement the concept of a larger
M2 virtual memory. It is characterized by an access time in the
millisecond range and a storage size consisting of millions of
words.
4.1.5 M4 - Archival Storage
Archival storage (possibly implemented with a magnetic
tape unit) is included for completeness. It is used as the re-
pository of files and other information which does not undergo
rapid change or frequent use. Conventionally, M4 is considered
to be an I/O device and is controlled accordingly.
The remainder of this chapter will concentrate on the
relationships between the major elements of the memory hier-
archy which contribute to system performance, naraely MI, M2,
and M3.
4.2 Local Storage
4.2.i The Problem - Memory Contention vs. Performance
One of the major reasons for using a multiprocessor is
to increase the overall performance or work delivered by the
system. If the extra performance were not required a unipro-
cessor would be employed. Ideally, a system with R processing
units should produce R times the work of a single processor sys-
tem. One factor which tends to reduce the overall performance
of the multiprocessor is M2 memory contention. The effect is
to reduce the M2 cycle time (t 2) by yielding an effectively
slower cycle time (t2eff).
One way to reduce memory contention is to provide a
limited amount of dedicated memory local to each processor
(MI). If M1 possesses a cycle time (t I) which is substantially
faster than t 2 then a performance increase can be obtained.
4-2
INTERMETRICS INCORPORATED • 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840
4.2.1.i Performance Mode] : Postulate the multiprocessor model
shown in Figure 4.1, and make the following definitions and
assumptions :
a) n I = number of M1 cycles per unit time for a single
processor
b)
c)
t I : M1 cycle time
n 2 = number of M2 cycles per unit time executed by a
single processor
d)
e)
t 2 = M2 cycle time
t2eff e_&ec_i,,s i'educcd N2 cycle time due to memory
contention
f) W = work per unit time from a single processing unit.
This is defined as proportional to the total num-
ber of M1 and M2 cycles per unit time. Usually
processor work is defined in terms of the number
of instructions per second. For a conventional
360 type architecture an instruction usually cor-
responds to two M2 cycles. In a sense the internal
processor cycles should also be considered useful
work. Indexing which does not require an M2 access,
because it m i(_ht use an internal register is a
very useful function. If a multiprocessor makes
very large use of its internal M1 storage these
cycles are just as important as M2 cycles in esti-
mating overall work.
g) R
h) M
i) h
W = n I + n 2
J
= number of processing units
= number of independent M2 modules
= fraction of all memory requests that use M1
(the hit ratio). This is for a single proces-
sing unit.
h = nl
n I + n 2
4-3
_ INTERMETRICS INCORPORATED '70t CONCORD AVENUE . CAMBRIDGE, MASSACHUSETTS 02138 " (617) 661-1840
M Modules
M2 M2 ... M2
R Processing Units
M1
P __]
141
P ___I
M1
p !
Internal
Bus
Notes:
I)
2)
3)
A processing unit contains a P-M1 combination
The internal bus allows all the R processing units
to communicate with all M memory modules
There is no internal bus contention
Figure 4.1: Multiprocessor Model
4-4
INTERMETRICS INCORPORATED • 701 CONCORD AVENUE . CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840 ._
j) It is assumed that a processing unit is always making
an M! or an M2 reference and that these ref_,_cnces are
mutually exclusive, that is they cannot occur simul-
taneously. Let_ n]t.l.+.n2 (t2 _"_) = ] unit of time
From the above deflnltlons l_ _ollows that
t2eff h + (i - h)
where
The term in brackets can be considered to be an en-
hancement factor by whicl_ performance is increased.
Figure 4.2 plots this factor as a function of h.
We sc'e from this simplified model that the introduction
of M1 with a reasonably high hit ratio can potentially increase
the performance of a processing unit, especially if the t2/t 1
ratio is high. Many overhead factors, involved in the utiliza-
tion _)nd control of M1 will tend to lessen the improvement.
The effect of memory contention upon t2eff will now be
calculated. Assume that requests to M2 are independent and
randomly distributed across the address space. In reality this
assumption can be seriously questioned since program and data
both possess locality. That is, there is a strong correlation
between successive M2 access events. This is extremely diffi-
cult to measure since tl_e programming load is not known. For
lack of a better model, the random distribution is assumed.
A processor will request access to M2 with a proba-
bility A = n 2 (t2eff)/n l(t I) + n 2 (t2eff) . It can be shown that
A = r(l - h)
r(l - h) + h
The probability of accessing any particular M2 modules is there-
fore A/M. Given that a processor is requesting access to a
particular M2 module, the probability that none of the other
R - 1 processors are requesting access to that module is:
4-5
__ INTERMETRICS INCORPORATED . 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840
k = r
h + (l-h) r
i0
9
8
7
6
5
<
4-.1
a_ 3
H
0
r = I0
r = 5
= 2
"--_ _ r 1
0.i 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
"Hit" Ratio, h
Figure 4.2: "Enhancement" Factor k Versus "Hit" Ratio h
4-6
P(O) = (i - A/M)R-I
The probability that i out of R-I other processors is request-
ing access to the particular M2 module is:
(i - A/M) R-2 (A/M) 1
In general, the probability that i processors out of the R-I
othez p_cessozs desize access to the same module is:
R I]*m(i) = _ (i - A/M) R-l'i (A/M) i
If there is no contention, the M2 access time is t 2. If one
other processor is requesting, the access time could reach 2(t2).
In general, with i other processors the access time could reacn
(i + i)t 2 .
The effective access time averaged over all contention
possibilities is therefore
t2eff =
R-I R-I R-I
_ (i + i) (t2)P(i) = t2 Z P(i) + t2 E iP(i )
i=0 i=0 i=0
Since Pi is a binomial distribution [i]
and
R-I
Z P(i) = 1
i=0
R-I
7_ iP(i) = (R- i)(A/M)
i=O
i! (R-l-i) !
4-7
INTERMETRICS INCORPORATED . 701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 " (617) 661-1840
therefore
t2eff = t 2 [i + (R- i) A]M = t 2 [i + (R-I) r (l-h) ]M[r (l-h) + h]
Some insight may be gained by studying the overall total system
work (W T) where:
W T = RW
WT = [ Rr ]t 2 [r (l-h) + h] + (R-l) r (l-h)
M
The following figures (Figure 4.3 and Figure 4.4)
depict W T for h = 0, h = .5. The following two facts should
be observed:
a) System performance is increased as more M2 modules are
added.
b) Local storage can significantly increase performance.
4-8
INTERMETRICS INCORPORATED. 701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 " (617) 661-184(
No ?.ii
h =.- 0 WT : 1 [<" i"_R _]
t 2 : i lJso.c
i0
3
_
_maxim<_m !_ossib]e
_ perfo_rmance
// M : 16
M : 8
M = 4
M = i
1 2 3 4 5 6 7 8
m •
Figure 4.3:
W T Versus R
4-9
h = ,5
r = I0
t 2 : i _is
W T : 1.82R
1 + .98 (]{- 1
M
1
t 2
14
13
.].2
11
10
9
8
7
6
5
2
/
= 16
= 4
M
max] mum f'ossible
]?e r forman co
8
2
1
R
1 2 3 4 5 6 7 8 9
Figure 4.4: W T Versus R
4-10
4.2.2 Two Approaches to an Implementation
A major design question naturally arises. How does
one use local storage to obtain a hit ratio of .5 or .9 or
more? The answer is complex and involves studying the nature
of program execution in relationship to the instruction set.
'h_,o approaches will be mentioned.
4.2.2.1 ']?he Cache Concept: As CPU speeds have increased with ad-
vances in t echnolog/, computers have been able to handle lar-
ger and more complex processing tasks, and the demand for
operating 1.nemorycapacity has increased. Since capacity and
speed are conflicting factors in memory design, an hierarchical
memory organization was proposed many years ago [2] to enable
these two desirable qualities to be independently developed.
Advances in semi-conductor technology have only recently made
this concept feasible.
A backing store, M2, which de-emphasizes speed to
achieve an adequate capacity, interfaces to a buffer store or
cache, MI, whose primary design objective is speed.
BACKING STORE (M2)
Data width
4 to 16 words
Transparent
to
Programmer
BUFFER (MI)
PROCESSOR
ASSOCIATIVE I
MAPPING
MECHANISM
i
Figure 4.5: Buffer Store Organization
4-11
INTERMETRICS INCORPORATED "701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 " (617) 661-I840
The concept d.epends for its success on the notion of
locality. Locality is an experimentally observed fact of pro-
gram behavior by which references tend to occur within a re-
gion of the program's address space, and this region migrates
relatively slowly. Locality is a natural outcome of the way
people think and write programs: concentrating on one task
at a time, using loops, using seauential control, etc. [3].
The degree of locality is influenced by programming style,
data organization, strategy of algorithm, and the programming
language. Locality gives rise to the notion of the program
working set, which is the minimal set of blocks that a pro-
gram requires to ]lave in the cache in order to run efficiently.
If less than the working set is in MI, the probability of oc-
currence of a reference to a missing block, m, increases. This
situation is most likely to occur in a multiprogrammed environ-
ment when the nu_J]er of programs n exceeds the capacity of the
cache to contain all their working sets, as illustrated below.
m
I
n O
n
Figure 4.6: Probability of Missing Block Versus
Number of Working Sets
It is an experimentally verified fact that a process
favors references to a small set of its total address space,
and that provided this set is contained within MI, the need
to access program areas not in M1 arises relatively infrequ-
ently. When access to M2 becomes necessary, more information
4-12
INTERMETRICS INCORPORATED "701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 . (617) 661-184C
than is immediately required is transferred to MI, in the expec-
tation that references in the vicinity of the accessed word
are likely. The relationship between the size of M], the amount
of info:cmation transferred, and the effect of different program
addressing behaviors was studied by Gibson [3]. He concluded
that an i'll capacity of 2K to 4K words and a transfer block
size between 4 and 16 words provided best results. He also
found t]lat the dynamics of buffer operation were more sensitive
to the addressing pahterns of the various programs than to any
other factor.
To maintain a given processor's speed, data transfer
from M2 must occur at an adequate rate. The M2's slower access
ccm_ en_c.tc<_ for by increasing the transfer path width.
'l'fllS C61]I be dCIIICvUU by t
a) An M2 technology which yields a long physica].ly stored
word, e.g., the pseudo - 2 I/2D organized plated wire
memory [4] which allows several hundred bits to be
accessed at once.
b) Organizing M2 into a number of smaller modules and in-
terleaving the addresses, so that contiguous addresses
i, 2, 3, are stored at corresponding locations in mod-
ules 1,2,3, rather than in conseuctive locations in
any one module. This has been the approach employed
by current designs such as tile IBM 360/85, 91 and 195,
which use core technology for M2.
The high speed of M1 is now generally realized by bi-
polar semiconductor techniques rather than thin-film. Buffer
memories of up to 1/4 million bits with cycle times less than
200 ns have been built, although similar speeds at far lower
power dissipations are being achieved by current plated wire
designs [5].
The above discussion has been in terms of a processor
"read" operation. Writing into the buffer presents an additional
problem in that the contents of the buffer do not represent the
primary source of the program being processed. A processor
"write" must be reflected in an update of the primary source,
which is stored in M2. This can be achieved in two ways:
a) Storing through: Every "write" request causes an
immediate update of M2 as well as the cache.
b) Block update: Write requests are allowed to accumulate
in MI. Whenever a block is to be replaced by the block
replacement logic the modified block is written out to
M2.
4-13
__ INTERMETRICS INCORPORATED "701 CONCORD AVENUE • CAI',ABRIDGE, MAS,_,ACHUSETTS 02138 . (617) 661-1840
Which of the two techniques is chosen depends on program beha-
vior: "writes" tend to cluster in time and in program space,
so that for small blocks of 4 to 16 words a bloc]< update tech-
nique may result in lower average M1 to M2 traffic density and
transfer delay.
There are a number of arguments which can be raised
against the use of a cache in a multiprocessor system.
a) Cost. To be effective a 4K word cache of high speed
(i00 ps) monolithic memory must be employed in each
processing unit.
b) To keep the cache filled with useful data a large band-
width of data from M2 must exist (128-256 bits) per
access. Many of the words accessed from M2 might not
be used. This unnecessary M2 traffic tends to increase
M2 contention and thus reduce performance.
c) In a multiprocessor system the use of a cache with
COMPOOL data presents a problem in keeping copies, pre-
sent in the various caches, updated. (See section 2.3.1)
d) IBM's successful use of the cache is based partially
upon the inefficiency of the 360 instruction set.
That is, quite often small program loops are inserted
by a compiler to execute primitive functions which
could have been basic instructions in other systems.
4.2.2.2 M1 in a Stack-oriented, Descriptor-based System: The
problem faced in employing Ml is to use it for information which
has a high probability of being accessed many times (a high hit
ratio, h). Traditionally, base registers and index registers
have been allocated to the local storage of a processor for
reasons of speed and their high frequency of use. :However, re-
gister management problems tend to increase overhead.
Intermetrics proposes to use M1 for specialized storage
and to have the management of M1 an automatic hardware function.
In a stack-oriented machine it was realized that the
top few entries of the stack provide the most referenced ele-
ments. For this reason the first 8 stack locations are made
resident in MI. M1 stack overflow pushes the bottom of the
M1 portion of the stack into M2.
The descriptor is the most referenced data type. For
this reason the 32 most recently referenced descriptors are re-
tained in MI. An associative mapping mechanism is employed for
control of this descriptor cache.
4-14
INIERMETRICS INCORPORATED •701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 . (617) 661-1840
The dynamic nature of the stack creates a situation
where the starting loeation of each lexical level must be
quickly c_ccessed. For this reason a set of from 16-32 base
registers is proposed. Each base register contains a pointer
to the start of each lexical level and is automatically acces-
sed when addressing within the stack is desired.
An instruction set which is organized around this
machine tends to be more complex than a 360 type instruction
set. For this reason more time is spent accessing M1 and ex-
ecuting _aicro code. This tends to make the duty cycle of the
processor high<-r than a 360 type instruction set, which in turn
tends to reduce memory contention.
The procc;-sor's duty cycle and the parameter h are
directly rc.] ated.
D = duty cycle = nl(t l)
nl(t I) + n 2(t2eff
h = nl
n I + n 2
D = h where r = t2eff
(i - h) r + h tl
4.3 Operating Memory and Memory Management
The concept of a memory hierarchy, discussed in rela-
tion to M1 and M2, can be extended to the relationship between
M2 and M3. For large file oriented systems archival storage,
M4, is also considered.
4.3.1 Background
Since program and data can only enter the computation
process via M2 one must control the flow of information across
the hierarchy of memory. This control is the job of memory man-
agement.
Virtual memory is a technique for managing the utili-
zation of memory in processing systems where program space
4-15
INTERMETRICS INCORPORATED" 701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 " (617) 661-1840
exceeds the actual operating memory space. The concept has evol-
ved from the need to improve on early attempts to utilize limited
amounts of memory by overlaying. This required the user to par-
tition his program into pieces which fit into the available
space, and then plan the sequence of execution of the pieces and
control their reading into and out of operating memory. As pro-
gram requirements grew larger than a few thousand words, this
became a cumbersome task. To help the programmer, automatic
overlaying (folding) techniques, by the operating system with
compiler assistance, were developed. But eventually it became
clear that a system should allow a distinction to be made ]be-
tween address space, a set of identifiers used by a program to
reference information, and memory space, the set of physical
operating memory locations [6] .
Since a program could be allocated any physical M2 loca-
tions, the addresses contained within the program string must be
relative and not contain any absolute M2 reference. A transla-
tion mechanism must map the relative addresses into absolute M2
address. Many machines employ a concatenation of the address
field of the instruction with the contents of some specified base
register. Other schemes employ a descriptor mechanism, which
is used to provide an indirect reference. In either case, the
relative address is first presented to a memory map mechanism
which determines if the desired element is in M2 or whether an
M3 fetch is required. Figure 4.7 indicates the basic operations
involved in memory management.
The memory mapping mechanism usually employs a limited
associative memory to contain the most recently referenced ad-
dresses. In the 360/67, the contents of the base register is
funnelled thru an associative memory. In the Intermetrics'
multiprocessor concept the descriptor's address field is trans £
lated via an associative memory.
The first suggestion for. achieving virtual memory was
published by Manchester University in England, in 1962 [7].
Virtual memory has subsequently been implemented in a number of
ways, most notably in systems designed to service a large user
body generating an unpredictable load and mix of processing
jobs (e.g., the HIS 6000 in the MIT Multics system, and the
IBM 360/67). The mapping mechanism requires address informa-
tion to be organized into blocks. Two basic schemes have been
defined for handling these blocks:
a) Segmentation organizes address space into a collection
of segments which are mapped into variable sized memory
blocks
b) paging organizes meraory space into "pages" of fixed
size.
4-16
INTERMETRICS INCORPORATED" 701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138. (617) 661-184(_
t
]_ ]_O (" Lzf;GO£
......................... *" I k
._ -. _, _ , map M2 address ,
z _:j.{tt.] x,e I
a(]dUQSS I
I add_:essed quantity la->M2
J_s not in M2 ]
.........................................._ M3
Fetch from }13 r
,f
disp]accd
data moved
to M3
IN
}
form relative !
address - A from[
pro< ram strin 9 ]
J
//_<\\
".Is A
Yes
No
i
IM o to memory map
to fine physical
2 location
Allocate storage space in
M2 to receive (A). If
necessary delete some in-
formation from M2, write
modified data into M3.
i Fetch (A) frown
M3
Update memory ma
]and store (i) in|
'11'42 t
Fetch (A)
OUT
Figure 4.7: Memory Management
4-17
INTERMETRICS INCORPORATED • 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840
dSegmentation4.3.2
Since segmentation is concerned with the modularity and
structure of the program it is visible and controllable by the
programmer, although usually indirectly through the use of a
]_anguage. He determines the size of the segment, and attaches
its name. Each segment may be considered as an independent
virutal memory. Internal to each segment, addressing is rela-
tive to the beginning of the segment, and thus becomes inde-
pendent of addresses in any other segment. This property re-
sults in what has been termed two-dimensional addressing: seg-
ment number fo!lo<_ed by location number. McKeeman [8] points
out that this addressing structure is employed in a number of
modern progranu:_ing languages, such as ALGOL, PL/I and FORTRAN.
It is also a property of IIAL. These languages use a pair of
numbers to represent an address: the first number corresponds
to the nesting level (lexical level) of the occurrence of the
declara_-ion of the name of the address, and the second indi-
cates the occurrence of the name within that level in the pro-
gram. The elements of a segmentation implementation mechanism
are shown below:
Se 9qaent Table
s
w
segment address
word address
S
memory address
Figure 4.8: Elements of Segmentation
4-18
INTERMETRICS INCORPORATED- 701 CONCORD AVENUE - CAMBRIDGE, MASSACHUSETTS 02138 " (617) 661-1840
Segments are loc <_d by reference to a table', each
entry of which is a segm{: ;t descriptor defining the segment's
base address a and its size b. The position of the descriptor
in tl_e table is S front the base of the table. A reference to
an address in name (address) space is of the form S,W. The
component S locates the desired segment's descriptor in the
table. If it is not in the table (i.e., the segment itself
is not in operating memory) a missing-segment trap occurs.
The segment is then brought into operating memory from mass
storage, and its descriptor is placed in the table. A test
wheth<_r W > b is made to check if a programmor's reference is
out of bounds o[! his own segments. Then the location a' in
physical memoz-y to which the name space address S,W refers is
formed by a' = (_ + W) . q'his address transidtion mechanism
can b<_ :i_iiz<_d :iJ1 special hardwaze, with a set of sFecial asso-
ciatively addrossed registers. Or the tables can be accomodated
in operating memory, with all translations performed by multiple
levels of indirect addressing. The latter approach involves
two or more memory accesses per reference and results in a con-
siderable penalty.
The segmented addressing scheme offers several attrac-
tions for a large and diverse software system such as the space
station central muitiprocessor.
a) Program modularity, Program modules are organized into
distinct, separately named and controlled segments.
b)
c)
Variable data structures. In a system such as the space
station, the data base will contain large and com-
plex data structures which will vary in size and content
during use. By creating segments of such structures
they may be assigned just the memory they require.
Their manipulation is well controlled.
Protection. A high degrge of access control can be
provided by the segmented approach through indirect
addressing coupled with access privileges which con-
strain read and write operations within a given seg-
ment.
d) Program sharing. By enabling one physically stored
module to be known in different address spaces under
different segment names, it may be directly shared be-
tween two or more users. This obviates the usual prac-
tice of creating copies of multiply used routines, and
consequently economizes on memory space.
4-19
INTERMETRICS NCORPORATED.701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138. (617) 661-1840
4.3.3 Paging
Operating memory address space is divided into a number
of equal sized pages. Each page is identified by the memory lo-
cation of its first word. Words within the page are referenced
by word number w from the first word. A page is referenced by
its position p in the page table. A virtual memory address, a,
is equivalent to the pair p,w (in a similar fashion to segment
addresses). The total number of active pages may not exceed the
page capacity of operating memory. Those pages not being execu-
ted are transferred to the next level of storage, thus realizing
the concept of virtual memory. Since all pages are equal in
size, replacement involves only the problem of finding the neces-
sary equal-sized "holes" in operating memory. "External" frag-
mentation of memory need not occur.
Page availability is maintained in the page table. The
pt__h entry in the table is the memory location of the page con-
taining address a, where p = integer [a/Z], and Z = page size.
If the pth entry is missing, the page does not reside in memory,
and must be fetched. This condition is referred to as a missing
page trap. If the page is present, the referenced word is the wth
element of the page, where w = remainder (a/Z).
Paging is attractive to the system designer as a tech-
nique for physical memory allocation, because of the regularity
of the equal-sized pages. It is attractive to the programmer
because he is relieved of the concern of allocating physical sto-
rage, and, indeed, need never exercise any direct control over
the mechanism.
A major design decision is the choice of page size. A
large page, say over i000 words, may result in a high proportion
of unused page space, if natural program modules are smaller than
the page size. This is referred to as "internal" fragmentation.
With a small page, less than i0 words, an overhead problem arises
due to the large number of pages that must be controlled. The
best page size is determined by:
a) Program locality
b) The speed ratio between memory hierarchy levels
Paging cannot achieve some of the advantages of segmen-
tation that were identified previously, because page boundaries
bear no natural relationship to program content. Segmentation,
on the other hand, lacks the advantages of a fixed size. It re-
quires the availability of contiguous regions of space, of suffi-
cient size to contain the segment. The problem of searching for
and/or creating variously sized "holes" in memory is a much more
difficult task than matching pages to page spaces.
4-20
INTERMETRICS INCORPORATED-701 CONCORD AVENUE " CAMBRIDGE MASSACHUSETTS 02138. (617) 661-1840
It is natural to contemplate a combination of the two
mecha_Jsms in order to realize both their advantages. The larqe
Multics system at MIT has been the only example of a heavily de-
veloped segmented and paged memory management scheme.
4.3.4 Implementing Virtual Memory
When implementing a virtual memory system a number of
properties are desired to minimize overheat].
a) An efficient memory map search. This is usually achi-
eved by employing a lin'dt©d associative memory to hold
the most recently used p_r,_:_ or seqment descriptors
b) An efficient M2 space allocation algorithm.
c) An efficient determination of the M3 address in the
case of a missing page or segment trap. The utiliza-
tion of a descriptor containing an M2 or M3 address
depenc]ing upon the state of the presence bit, is con-
venient.
d) One must attempt to minimize fragmentation of memory
into small unusable portions. A memory compaction al-
gorithm might be required.
e) One must minimize the possibility of overloading the
system to the extent that thrashing occurs. Thrashing
is a state which is reached when memory management be-
gins spending all its time moving pages or segments in
and out of M2 and overlaying pages or segments in use.
No time is left for processing applications programs.
Thrashing can be minimized by providing sufficient M2
and by keeping the unit of memory management small.
Figure 4.9 indicates Intermetrics' approach to memory
management via a descriptor-based, stack-oriented structure.
Absolute M2 addresses are only contained in "Morn" descriptors.
Only 1 "Mom" Segment descriptor for a program or data segment
may exist. Many "Copy" descriptors may be created with a
pointer to the "Mom". This pointer is a two-dimensional address
specifying a stack number and offset (SNO). The SNO is the re-
lative address which must be translated into a physical M2 ad-
dress. The 32 most recently referenced (SNO) addresses are con-
tained in the associative memory. The contents are updated
automatically whenever a reference is made. If the SNO refer-
ence is found within the associative memory, the "Mom" descrip-
tor which contains the absolute M2 address is retrieved from
local storage (MI) and the operating memory address is obtained.
4-21
INTERMETRICS INCORPORATED. 701 CONCORD AVENUE • CAMBRIDGE MASSACHUSETTS 02138 • (617) 661-1840
Copy Descriptor
Pointer to Morn
Stack Number (SN)
and
Offset (O)
Stack I
Number]
Associative .
Memory
i_ SNO is in
i-4 s'o c-_{7ati"v e Memory --_ [
Not in
Associative
Memory
I-f-_-_rect Fetch]
thru Lexical |
Level 0 |
Stack Vector in
M2
Stack Pointer
\
Update
\Associative
_Memory
\
Mom Descriptor in High
Speed Local Storage M1
]
I M2 Address
I
ef redali o n
S tack
Figure 4.9: Addressing Via Stack Number and Offset
4-22
INTERMETRtCS INCORPORATED " 701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 . (617) 661-1840--
If the associative memory does not contain the refer-
enced SNO, then a three level indirect addressing seguence
thru M2 is executed. The first level fetches the stack pointer
from wiLhin the stack vector using the stack number as the re-
lative address from the base of the stack vector. The second
level of indirection is used to fetch the "Morn" descriptor
using the offset part of the address as the positioning rela-
tive to the base of the stack.
If the referenced segment is not in M2 it must be
fetched from M3. '?his is indicated by a "presence" bit con-
rained in the "Morn" descriptor. If the segment is present
within !,]2 it is ref,.:_-enced directly. In either case the as-
sociatJ'Te m<,mory is udpated so the "Morn" descriptor can be
re:Eerencr, d more directly the next tJ1_:_.
Reference, s_ for Chs:uter 4
1. CRC Standard Mathematical Tables, 19th Edition, p.570.
2. Wilkes, M.V., "Slave Memories and Dynamic Storage Al-
location", IEEE Trans. EC-14, April 1965, pp. 270-271.
3. Gibson, D.}]. , "Considerations in Block-Oriented Sys-
tems Design", Proc. SJCC 1967, pp. 75-80.
4. Green, J.P., "Mass Memory Parametric Data", Task Re-
port MD-]01, Intermetrics/NAR, June 1971.
5. , "Mini-wire Sale Completed", Computer Design,
September 1971, p. 12.
6. Denning, P.J., "Virtual Memory", Computing Surveys,
Septenfl]er 1970, pp. 153-189.
7. Kilburn, T., et al, "One Level Storage System", IRE
Trans. EC-II, April 1962, pp. 223-235.
8. McKeeman, W.' _4., "Language Directed Computer Design",
Proc. FJCC 1967, pp. 413-417.
4-23
_ INTERM,ETRICS INCORPORATED "701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 " (617) 661-1840
I
Chapter 5
ADDRESSING
The gucstJ_on of add_sessJng is the most dominant feature
in the divc_rsity o!! instruction architectures. It can be viewed
from m,_ny different angles: correspondence to software, ease of
usage fol _ prog_-a_mr::cs, bit m.inimizatio_, physical implementation
and excu_,tio_, l_i<:cazci.i_:s of memory, and/or operating system
memory resource allocation. We shall discuss several of these
aspects and show various options or methods that may be employed.
5.1 Addressing and Instruction Architecture
When an instruction architecture is contemplated sev-
eral different independent decisions with regard to addressing
within an instruction must be reached. The number of operands
which an instruction can contain may vary from three, two, or
one explicit operand(s) to implied operands, where the implied
operands are to be obtained from a stack. The question as to
how many hardware registers, of what type, and how they are to
be addr<_ss(;d arises (single accumulator or "general" register,
hardware "top of stack" for a depth of two, ...). Finally,
exactly how is memory to be addressed: all memory addressable,
two-dimensional addressina,, self-relative, etc.
5.1.1 The Number of Operands in an Instruction
Most operations which occur in algebraic languages are
dyadic operators. That is, the operation manipulates two inputs,
transforming them into a new output value. It is seen that dya-
dic operators (+, -, +, x, ... ) have three operands: two input
operands and one output operand. There are, of course, monadic
operations such as negate or absolute which have two operands:
one input operand and one output operand.
Instruction architectures vary as to the number of ex-
plicit memory-addressed operands which appear within the instruc-
tion, yet, of course, the necessary three operands for dyadic
operators must be present. (Two operands for monadic operators.)
Three memory operand instructions are found in several
machines including the Honeywell 800/1800 series. However,
5-1
INTERMETRICS INCORPORATED- 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840
when the actual usage of dyadic operators is examined it is seen
that seldom are three different memory addresses needed. Consi-
der for example:
A = B;
A : A + i;
A = - B/C
In these, admittedly biased, examples the use of the three mem-
ory at]dress operands is wasteful. In the first example, there
is but one input and one output. In the second, one of the in-
puts is also the output address, and in the third a monadic op-
erator appears.
The waste, or non-use, of a memory address is only bad
in so far as it takes room. If the instructions are of the
three-operand form and not all three operand memory addresses
are used, the instruction still must save space for the presence
of these memory addresses which are many bits in length. It is,
therefore, usually found advantageous to have at least one of
the three operand addresses implied.
Two memory operands are occasionally met with in the
instruction architecture. In this case one of the two operands
besides being an input is usually also the output operand. The
IBM 1401 is such an example. This form of two operands can be
very useful where most of the operators are monadic such as is
commonly found in data processing where much of the computer
time is spent in moving data and editing them.
The most common architecture found is based upon single
memory address operand instructions. This is common in both the
second and current third generation computers such as the IBM
7090, IBM 360 series, Univac 1108, and the DEC PDP-10. With the
single memory address operand an accumulator (or another "regis-
ter") becomes an implied operand for the instruction. Commonly
then the implied operand serves both as one input operand and the
output operand of a dyadic operator. When monadic operators are
used one operand can be the memory address and the other the im-
plied accumulator. When the third generation of computers de-
veloped, the "implied" accumulator was often made into a set of
general registers of which one could be selected to be the ac-
cumulator. This has led to the characterization of the 360 as
having a 1.5 operand instruction set.
The single memory address form of instruction is very
useful when sequential accumulation of results occurs, such as
in:
A= B + C + D + E;
5"'2
INTERMETRICS INCORPORATED" 701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 " (617) 661-184(
However, if a tree structure form of computation is needed (as
common].y occurs) such hs:
A = (_ + C) * (b + E);
the accumulator would have to be saved after calculation of
B+C before D+E can be done. One of the hopes of the general
registers deve[IoL>ment in one third generation computer with
multi!/Le accumulators was to be better able to do efficient
calculations of this foxm (i.e., save on storage to memory for
tem_oraries) .
One of the principal advantages of having fewer mem-
ory oDerallds wi Lb each instruction is in the space savings to
be fou_Jd by not having u<:e]_ess fields in a]l instructions.
That is, it wou_]d be desirable to use instruct/on space for
memory address{ o_erands only when they are needed. The ulti-
mate in this fo_rm of space savings is to be found in the zero
memory address opcrand instruction. In t]_is case all of the
necessary operands for an operator are implied. These are the
stack machines _..d_cre the "top of the stack" provides the nec-
essary number of operands for an operator and the resultant out-
put value is in turn placed upon the stack. The Burroughs
B5500 and B6700 are examples of such machines. The memory ad-
dress operands, of course, must be able to be fetched from mem-
ory and stored into memory. These are, in effect, merely two
forms of operands.
This stack form of instruction is one of the most ef-
ficient ways in which to specify an algorithm since only the
minimum amount of information needed for execution need be
present.
The stack itself can be considered in several ways.
From a HOL point of view the implied operands of the stack cor-
respond to many of the parse algorithms which have been developed
for compilation and hence are able to produce extremely effi-
cient code. _rom a multiple register point of view the stack pro-
vides a method for the dynamic assignment of the general regis-
ters rather than the static assignment at compilation time with
its inherent inefficiencies.
5.1.2 Single Accumulator and General Registers
While many second generation computers had a single
accumulator, third generation machines have tended to have a
set of general registers. This has come about for several dif-
ferent reasons. Each reason stems from the basic desire for
more efficient and quicker execution. As was seen above, a
single accumulator does not make for efficient execution of
5-3
-- INIERMETRICS INCORPORATED •701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 " (617) 661-1840
tree structured statements. Therefore, if several accumulators
were available, storing into memory for a temporary could be
avoided; this would save both time and space since memory would
not have to be referenced. Also technology, by the third gen-
eration, had improved to the degree of allowing more complex
hardware in the processor. Thus, multiple accumulators could
be implemented.
Another aspect is invoked with the addressing of mem-
ory. Second generation machines often had separate index re-
gisters from the accumulator; these then needed a separate set
of instructions for their manipulations and similarly they were
then restricted in the operations which could be performed on
them (e.g., no mu]tiplications with an index register). The
third generation often has truly general registers which can be
either accumulators or index registers (or base registers) thus
optimizing on the resources of the speedy registers for use as
needed.
The desire to use more accumulators was based on the
desire to improve the speed of computation by having fewer mem-
ory references and by doing manipulations and operations with the
general register set. Unfortunately, this very desire forces
the instroduction of bookkeeping instructions to set up the re-
gisters so that they can be manipulated. It is often difficult
to tell from instruction occurrence statistics for an IBM 360
whether the large number of loads (L, L}I, IC) used are to keep
the register policy happy or are rather a by-product of improve-
ment.
When both base registers and index registers are avail-
able their usage is often confused. Base registers are primarily
used to address physical locations. They provide the capability
of addressing particular regions of core. Their value interpre-
tation is that of a physical memory address. Index registers
are used to locate an element within an ordered data structure.
They refer to data elements which are to be manipulated and do
not inherently indicate physical addressing. If a character
array is being indexed, then the elements are in byte units, if
word integers are being referred to, the index actually refers
to four byte quantities (in the 360). Because this distinction
is not maintained the automatic quality of element indexing can-
not in general be performed. (In the 360 the SLL instruction
proliferates in order to align the "index" properly.)
One other major problem can develop with the use of a
set of general registers. This is the question of how to opti-
mally use them. A choice has to be made as to which registers
are to be used for accumulator(s), base register(s) or index
registers. The static assignment of the use of the registers
5-4
INTERMETRICS INCORPORATED • 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840-
I
unfortunately does not often correspond to the optimum dynamic
usage. '?]_J.s is tl_e case since the flow of control through
an execut:Lng program is simply not known. Alternate paths
of execution exist since this is what the execution of an algo-
rithm is about. The use and savings of registers for one branch
of an IF...TIIE_...ELSE statement is, in general, entirely dif-
ferent from that of the other branch. Similarly when a s1_b-
rout_]he is ent<_rcd, the use of registers within the subro ine
is not co.t:re],:_tcd with use in the calling procedure (whe:, he
CALL m:'/ be Jssu<:d at distinct locations each with diffe:
xegisL_:_:- usage).
While it could be truly argued that in any case '._ul-
t.iple rc_j]_stc_s a]-o better than one, -the acLual policy ::.mple-
Above, it was seen that in the zero memory address op-
erand fo£m of irJstructions a push do_,n stack mechanism is used
for operands. Tl_is, by its very nature, tends to optimize the
usage of the hardware registers available for accumulations and
index registers. When a subroutine is entered, the dynamic en-
vironment stack continues to push and pop as needed for the sub.-.
routine and hence it acts as an automatic dynamic optimization.
When optimizing is tried in code generation by trying
to identify common sub-expressions (e.g., I + 1 in A_+I : BI+I) ,
+c sz- ' c "the stack can become ine_1.,±ent. (The code needed _oEh in tlme
and space to save and restore I + 1 is (can be) more than the
actual recalculation of I+l.)
5.1.3 How to Address Operating Memory
Various methods of addressing physical memory are found
in instruction architectures. All of memory may be addressed,
a bank of memory may be addressed, addresses may be relative to
the executing instruction or, while only a small portion of mem-
ory may be disectly addressed, the rest could be addressed "in-
directly".
Machines such as the IBM 7094 addressed all of memory.
This form of addressing implies that the memory address operand
must have the number of bits needed to represent all of memory.
Not only is this _asteful, since usually only a small portion
of memory is needed in the execution of a program, but it also
limits the size of memory which can then be used with the in-
structions.
In order to bothreduce the size of the memory address
operand field and to remove the restriction on memory size
5-5
-- INTERMETRICS INCORPORATED "701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 . (617) 661-1840
(or at least increase 'the limit beyond foreseeable needs), two
dimensional addressing is introduced. This addressing can be
in fixed banks where a certain block of memory is in use (in
the Apollo Guidance Computer there were four "banks" address-
able at any given time: fixed erasable, fixed fixed, banked
erasable and banked fixed) and hence memory address operands
then refer to addresses within the current bloc]<, or a more
dynamic form of banking can occur as in the 360 where a base
register points to a starting location and a displacement field
then refers to an offset from the base.
Thus with the use of 16 bits, 4 bits to indicate base
register and 12 bits of displacement, the IBM 360 is able to
address up to 24 bits (16 megabytes) of memory. The penalty, of
course, is the overhead which must be paid in the setting, us-
ing, and maintaining of a base register and the restriction to
a maximum displacement of 4K bytes in a program segment without
the setting of another base register (or the resetting of the
current base register).
Another form of two dimensional addressing appears in
those computers which have been designed for the execution of Al-
gol (e.g., B6700). Since the instructions to be executed are
reflective of Algol, the data referred to must reflect the name
scope restrictions of Algol. The B6700 makes effective use of
the name scope restrictions in Algol to have its "base" regis-
ters (i.e., Burroughs Display registers) set automatically to
the dynamic environment of the addressable data. The B6700 "base"
register points to each succeeding lexical level which is ad-
dressable within name scope rules. The displacement then refers
to a particular entity within the lexical level.
Besides having base registers, as in the 360, which are
able to address any region or core, many architectures allow "in-
direct" addressing. By referring to an address word which is
within the area which you can address, you are allowed to "indi-
rect" your reference thru this address word to what it points to.
Thus, while only a small portion or memory may be "directly" ad-
dressable, all of memory becomes addressable.
It is apparent that when the 360 was designed, the in-
crease to 16 general registers from one accumulator and a few
index registers seemed so magnificent that the need for indirect
references was deemed not necessary. (The 4_r AP-I which is a
flight computer by IBM modified from the 360 instruction set has
restored indirection.) It turns out that the use of a few in-
direct references could save immense overhead on register usage
and allocation.
When data is being addressed, the actual number of en-
tities (variable "names", e.g., A,B ... in a program) involved,
5-6
INTERMETRICS INCORPORATED-701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-184(
in general, is small. This comes from the simple limitation of
the human programmer. The amount of storage, how_ver, may be
larc]e (e.g., arrays of data, sin91e or multiple dimension).
When an e]{:ment in an array is referred to, an index is used.
This phenomena has the very nice property of making the base-
displacemc_nt form o:? addressing attractive. While entries can
be directly _ddressed, arrays can be indexed into. The number
of different data areas are also generally limited, again due to
prog_r:u:mdn<_ ]_anguage restrictions and conventions and hence the
nu<_ber of (]i ,_J_-<en"u data regions is in general small and there-
fo:_.: the nurd-_c,r of base' registers fez- data addre_:s:[ng is in
general not too large.
In._tructions have other characteristics. Often a rou-
t;[_i<_will i_: _xceed the 4K byte displac::';-<cnt a]]c_..,_able with
IBM 360 addressing from one base register. Addressing of a code
segment within a code segment is concerned with control flow and
usually has a very local nature. This brings one final form of
add<essin!f: self-relative addressing. Often branches occur to
si3,._p]y skip one instruction, or a few as in an IF...THEN...EI.SE.
By using self-relative addressing for control flow within an in-
struction stream a very high degree of size compaction can oc-
cur; it becomes automatically relocatab]e without changing any
code and the restrictions (e.g., 4K bytes per base register) of
the code segment length can be removed.
5.2 The IBM 360 and Burr ouqh_s_BB6700
In order to gain an appreciation of the difference in
addressing structures, a comparison between the IBM 360 and the
B6700 is given.
5.2.1 Two Dimensional Addressing (Static and Dynamic)
In order to process large computational jobs a large
amount of addressable space is needed, but with a second gen-
eration machine such as the 7090 all of this space (and hence
the limit of the memory size) must be addressable. In this
case then, it was necessary to use 15 bits in every operand ad-
dress. The IBM 360 and B6700 both have two dimensional addres-
sing. The IBM 360 uses a 12 bit displacement which is to be
added to one of 15 base registers. This allows for a full 24
bit addressing (of bytes) scheme. Here 24 bits of address space
has been compressed into 16 bits of information. The B6700
scheme uses only 14 bits with its operands, where the "base"
(DISPLAY) register is defined, with only the number of bits
needed to indicate the current lexical level (££) (i.e., £_=i
implies 13 bit displacement, ££=2 implies 12 bit displacement)
5-7
__ INTERMETRICS INCORPORATED • 701 CONCORE) AVENUE - CAMBt{IDGE, MASSACHUSETTS 02138 • (617) 661-1840
and the B6700 displacements refer to "words" Since program seg-
ments in the B6700 are described via a "descriptor", the actual
size of memory which could be addressed is only limited by the
nunJoers of bits so usc, d in the descriptor. In point of fact,
Burroughs uses 20 bit word addresses in their descriptors.
It is easy to see then that if the memory of a compu-
tina system is large compare0 Io the modular size of "programs"
(or perhaps even procedures and :<outines), program string sav-
ings are to be found ])y usin C, a two dimc_nsional address.
Tll',:re is a g)-eat difference, however, between the I]3_
360's and I_6700's t\,,o dimensio_la] addressing schemes. The IBM
360 base registers aJ:e assigI_ed "statically" at compile tirne, and
it is up to the com!::J icr to t_:], and optimize base register usage.
This optimization is minimal if only one base register is needed
within a segment. This becomes difficult in large segments since
the dynamic characteristics of the segment modularization must
be considered.
This static two dimensional addressing of the IBM 360
has several aspects.
a) By using 4 bits everywhere for base registers the dis-
placement range is ]cr]uced, since seldom are that many
registers desirable.
b) If a program is "one big" segment; then several base
registers are needed[ and segment boundaries must be
carefully watched.
c) If the base registers are set upon entering and upon
returning to each module then:
i) There must be code to do this in the program
strings.
2) Name scope problems arise when variables in a
previous level are to be addressed since their
base registers are in general no longer in exis-
tence.
The B6700 optimizes upon the two dimensional address
idea by.
a) using only the number of bits necessary for the current
lexical level to indicate the number of bits for the
"base register". This leaves the rest of the bits for
displacement. (There is also the fortuitous circum-
stance experienced by all, that the more "inner" a sub-
routine the "smaller" it is, i.e., it needs less dis-
placement to fully address it.)
5-8
INTERMETRICS INCORPORATED" 701 CONCORD AVENUE " CAMBRIDGE, MASSACtJUSETTS 02138. (617) 661-1840 -
t
b)
c)
The base registers point at the beginning of each dy-
n_mic module, hence allowing the displacement to reach
zts most e>:treme logical dynamic range.
Since the usage of the "base" (display) register is
unique and well defined, (versus general., e.g., base
register, an accumulator or an index register) the
initialization and resetting of them can be accomplished
automatically. Furthermore, no explicit code in the
l)rogram string is required and current dynamic name
scope is maintained.
5.2.2 Implicit Addressing
Compare the expression:
A : B + C;
on the B6700 versus the IBM 360:
B6700 IBM 360
VALC C L R0, A
VALC B A R0, B
ADD ST, R0, C
NAMC A
STOD
In each case they execute similarly: (fetch C) , (add B to this
value) and (store value into A) . In effect it is the only se-
quential form possible (i.e., ADD before STORE) for this expres-
sion.
However, when temporary locations become necessary a
difference appears in the code, although the total effect, must
of course, remain the same. Consider A = (B + C) * (D + E):
B6700 IBM 360
VALC C L R0, C
VALC B A R0, B
(continued)
5-9
_ INTERMETRICS INCORPORATED • 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840
B6700 IBM 360
ADD ST R0, TEMP
VALC E L R0, E
VALC D A R0, D
ADD M R0, TEMP
MULT ST R0 , A
NAMC A
STOD
Assuming that there are only a few (in our case exactly one) ac-
cumulators being used, during the expression evaluation it becomes
necessary to create a temporary.
The creation of a temporary indicates an increase in the
program size for two reasons.
a) In general, the use of temporaries is a static decision
and hence cannot behave better than the dynamic usage
of the stack. Therefore, one needs more "temporary sto-
rage" locations than stack storage.
b) But more importantly, in the IBM 360 type of machine,
every instruction has an operand, therefore, the tem-
porary requires an address which in turn takes space.
The B6700 uses implicit addressing; the needed number
of operands coming from the appropriate number of loca-
tions on top of the stack.
I
When temporaries are needed, most often an implicit ad-
dress scheme allows for the savings of "temporary" operand addres-
ses.
5.2.3 Descriptors
Descriptors can be considered either as sub-operators
or as the ideal data structure which is being manipulated. When
considered in the first manner, it is seen that the descriptor
saves on the program string length. "Fewer" operators need be
specified since the "sub" part of the operator is found in the
descriptor of the data structure. For example, the IBM 360 has
for "add":
5-10
INTERMETRICS INCORPORATED' 701 CONCORD AVENUE - CAMBRIDGE, MASSACHUSETTS 02138 " (617) 661-184(
AH, A(R), AL, AE(R), AD(R), AU(R), AW(R), AP
while the B6700 has simply "ADD". This of course requires fewer
opcodes, and in turn fewer numbers of bit states for the neces-
sary operators.
When the descriptor is regarded as the "data structure",
it shows at ].east two virtues. One is the fact that by being
"sem_:nLic_!lly concise" (further discussed below) it places into
one ]o..:ation the compiicatcd description of the data structure,
which Lhcreby need not be repeated in multiple references in
the program. Tl_e other is the observation that the number of
entities which ace manipulated by a program are few. The reason
that a large addressing space is normally necessary is that if
the mac_linc do_s not haw; descriptors, then each "memory cell" of
the data structul?e must be directly addressable. The example of
an array of i00 scalars on the IBM 360 is in fact i00 memory lo-
cations. On the B6700 it is one entity: a descriptor which in-
dicates the dJmensJons of i00 and where it is to be found in
physi.cal core. This very important phenomenon reduces the ad-
dressing requirements of a program string, since the full physi-
cal memory address need only appear in the descriptor. The des-
criptor becomes one of the "few" entities which must be addressed
and hence only a small address field is needed in the program
string proper.
5.2.4 Type Differences
Descriptors allow any information which can be "bottle
necked" to be placed in the descriptor once, instead of having
the information repeated throughout the program string.
Besides having character data (for I/O) and an inter-
nal arithmetic form, most machines have in fact several internal
forms. The difference between the "character" and "internal
arithmetic" comes largely from the savings yielded by compactly
storing and m_nipu!ating them in the internal form. The various
internal forms come from considerations of preciseness.
Types can be optimized by:
a) making one a proper subset of another (e.g., integer is
a subset of single precision floating point on the B6700).
Thus, the difference between the operators disappears
(except for an explicit operator to recover the proper
subset; such as INTEGERIZE).
b) the need for multiple forms of the same operator dis-
appears (e.g., IC, LH, L, LD, LE)
5-iI
INTERMETRICS INCORPORATED "701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 . (617) 661-1840
c)
d
and the need for explicit type conversion operations is
reduced. The program string could be further minimized
by providing an explicit operator for each type conver-
sion when needed (e.g., scalar to character, while in-
tegers to scalar would be implicit by the integer de-
finition as a subset of scalar).
5.2.5 Semantic Conciseness
Probably the most powerful way to save in program string
length is by having semantically compact operators. By having
the operator correspond to the operations indicated in the pro-
blem language being executed, the minimum amount of translation
is needed and hence the minimum amount of expansion in the pro-
gram string.
The Burroughs B6700 is an "Algol" machine.
tors are those that ALGOL indicates.
Its opera-
The IBM 360 is semantically concise only to "BAL" which
is merely stating a tautology. The IBM 360 is not semantically
concise to any real "problem oriented language".
Besides being semantically concise with respect to the
operations needed for a problem the operators can be "semanti-
cally concise" in the way in which they are constructed. Branch-
ing occurs within a program under execution and not logically
with respect to all of physical memory. The IBM 360, as most
machines, allows the branch address to be any address of physi-
cal memory. The B6700 uses relative addressing (that is, re-
lative to the program under execution) either in the same or dif-
ferent segment. This of course reduces the address space neces-
sary, since it corresponds to the dynamic space involved at ex-
ecution time. The RC4000, although built upon similar concepts
as the IBM 360, has relative addrgssing, and this in turn creates
an efficient and small(er) addressing need.
In the IBM 360 each memory reference instruction gen-
erally carries 4 bits of indexing information. The B6700 in-
dexes only when needed, and since a stack is used (hence impli-
cit addressing) only an 8 bit operator is needed (which can
also load the resultant indicated entity). Assuming that not
every memory reference needs to be indexed (the indices them-
selves must be fetched from memory) the use of indices when
needed, (and semantically concise operations make the need less)
will, in most every case, minimize the program string length.
The use of short literals a_so compresses the program
string since the constants used are usually small integral
5-12
INTERMETRICS INCORPORATED" 701 CONCORD AVENUE - CAMBRIDGE, MASSACHUSETTS 02138" (617) 661-1840_
values. Recognition of this fact allows for their representa-
tions in the amount of[ space needed and not the amount for the
worst (largest) case possible.
5.3 Implementation Aspects of a Stack Machine
5.3.1 Definitions
'/'he stack provides the mechanism through which impli-
cit addressing cen be accomplished in a scman£ically concise
and e£:fJciont manne__-. The control sequencing and addressing
wJ I.I_the shack will he discussed in Lhis sectior_° A sp©cific
iml.Jl_m<:i_C ..... i.o_ is v_,kscnted. Details can _.>ar)/ f:om m;_cT_jne to
mnchine. However, the fundamental ideas will l'emain the same.
In a sense the stack is a hardware element just as the
arithmetic unit is an element. It can execute three primitive
conunands:
a) I_USH
The PUSII command will take the contents of the stack
buffer register and place it on top of the stack. Sim-
ultaneously it will shift all other elements of the
stack down one level. For example, the old top of the
stack becomes the second entry in the stack.
b) PoP
The POP command fetches the top of the stack and places
it in the stack buffer register. Simultaneously, it
shifts the contents of all other elements of the stack
up one level. For example, the old second entry of the
stack becomes the new top of the stack.
c) Stack Fetch
PUSH and POP store or retrieve information from the
top of the stack. In many instances, information is
desired from other stack locations. The Stack Fetch
sequence accomplishes this function by fetching from
the stack location (indicated by the lexical level
and displacement) and placing the information in the
stack buffer register. Stack Fetch does not change
the state of the stack in any way.
One could implement the stack as a word parallel-shift
register. This would fix the length and make it a specialized
element of the computer. In order to achieve generality and
5-13
INTERMETRICS INCORPORATED • 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 681-1840
flexibility in the design, we choose to implement the' stack by
employing a standard linear memory array with some specialized
pointers. These elements are manipulated by micro code to
create the three control sequences.
In general, the length of the stack can vary during
execution from ten's of words to thousands of words. For this
reason the bulk of the stack must, due to practicality, be con-
tained in M2. ]_owever, the more dynamic part of the stack (we
choose 8 locations) can be placed in M1 for faster access. For
the purpose of the follo_,._ing description the stack word size and
M2 word size are assumed to be the same.
5.3.2 PUSH
The PUSII sequence, whose flow chart appears in figure
5.1, involves both the M1 and M2 portion of the stack. Figure
5.2 depicts these two portions and provides definitions of the
various pointers used to control the stack.
The M1 portion of the stack can be pictured as a
wrap-around shift register. The oldest data is pointed to by
MISL (MI Stack Limit). The first empty location is pointed to
by MITOS (MI Top of Stack). Whenever MITOS = MISL, namely the
M1 portion of the stack overflows, the contents of (MISL) is
moved into M2 location indicated by M2TOS (M2 Top of Stack).
If M2TOS ever equals M2SL (M2 Stack Limit) then the M2 part of
the stack has overflowed and a trap is generated. The stack
overflow trap routine could then, depending upon conditions,
allocate more storage for stack use and change M2SL.
The data to be entered into the top of stack is con-
tained in MIBR (MI Buffer Register). Upon entrance to the
routine MITOS is compared with MISL to see if the M1 portion of
the stack has overflowed. If it has not an M1 write is set up.
The M1 address is MITOS and the data is contained in MIBR.
Finally, MITOS is incremented, modulo 8, before the exit.
If the M1 stack overflows a determination is made as
to whether the M2 part of the stack will overflow. If so, a
trap is entered. If not, an M2 write is set up, using the M2
address indicated by (M2TOS) and the data pointed to by (MISL).
MISL and M2TOS are incremented, followed by the M1 write set up.
5.3.3 POP
The POP sequence is shown in Figure 5.3. If the M1 part
of the stack is empty, an M1 stack underflow condition exists
and a read from M2 must be initiated with an M2 address of
5-14
INTERMETRICS INCORPORATED. 701 CONCORD AVENUE " CAMBRIDGE MASSACHUSETTS 02138. (617) 661-1840
I_n t(: r
1
Data in HlliR
M1 er pry?
M1 :;tack overflow
Legend
N]P,R = ?'12 _uffer I_eqister
N]TOS = 141 Top of ,(;tack
M]SI, = MI St;Ic.k Limit
N2TOS = M2 Top of Stack
N2SL = |42 Stack Limit
bit Jn status word
Enter Stack Overflow
Tram
l
,Clc, t up M2 write
(142TOS) * ,'4;' address
(NI,qL) > /ik d_ta
I
(bI],%1, 4 I) _ :.I],%L ](M2TOf_ 4 i) -* /.:2TOJ;
d
Set up M1 write
(MITOS] _ M1 address
(M]BR)-, M] data
}<es<.t Ii:l em[,ty bit in .%i',i [(;-_]'/O.g + 1)_ ., 5II']OS
J
Exi t
Figure 5.1: PUSH
5-15
MISL
Full
M 1 Par t of St_a_c_ I-'-_-_._
/2%
Ful I ._ 5 Full
_'<_- 4
Empty
MITOS
M1 Stack Over f low
M2TOS i
J
M2BOS
O
O
O
w
M2SL
M2 Part of Stack
Figure 5.2: The Stack
5-16
INTERMETRICS INCORPORATED. 701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 " (617) 66t-1840
(M2TOS) 1 [he M2 words are ]placed into hi.J)],. On the other
hand, if the M1 stack-is not empty, the contents of (_41']7OS) - 1
is read and placed into M].BR.
If the stack becomes empty the condition is set into
• 1 J_ _ (SRI for znpuu. J n c ) the next POP _,_equence . Every.. PUSII sequence
will reset this empty condition.
5.4 Effeci_iv<, Addrc:ss Generation (EA).
Offs<..[ Addres<<Jnc])
(Lexical Level
Within the insLruction architcckure of a stack oriented
mach/L_IC _ there u>:ii,sts a class of instructions \_hJ.ch refer to infor-
matJ_.'.n ,,<,ith:[n 1::h.:_stack. _.qhenever oile of [-.hr:.seinstructions is
encountered an efiective address (EA) must be calculated. The
segu.>nce to be presented depicts a specific design [].] . In gen-
eral, the det:_il.s of EA calculation might be different. However,
some form of addressing with the stack must be provided.
The format of the class of instruction referencing ex-
plicitly the stack is:
# of bits 1 2 5 8
cont  t  ode I 1 I A1 ]
The address couple A2 I JA]. forms a 13 bit field.
• .., A 0 which is interpreted as follows:
AI2, All, A10,
a) The lexical level indicator, ZZ, is the key to the in-
terpretation of A2 I IAI. The first step is to find the
positive integer m, where:
2m-I < ££ < 2 m
b) Form Field 1 where
Field 1 = AI2 , ..., Al3_m
c)
d)
Fetch from M1 the base register specified by Field i.
Denote this base register by BRm.
BRm is in Stack Number, Offset representation.
5-17
INTERMETRICS INCORPORATED. 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138' (617) 661-1840
I,.-i
<
b.
c) r_
, B_-I
r_
r-_
O
O _ H rd
,2_ i ---- cG
4_
O
Z
b_
\
\ _ / _
\,0/
'/7 !
4J
1.4
'L1
"d
fd
C'q
I]-_ 0
>i
@
.el
,.-4_
0
I'd l::c
0
#3 .,--i
L:', _J ,"de._
o
r_ 0_-I
(;) .lJ .lJr.D
P_-_t
_ __._ o) _4_
0 _
dj 0 .r_
o_
o]
t
-r_
N
_q
C
b
5-18
e) Next Field 2 is formed
Field 2 = A12_m , All_m , .. A 0
f) Finally the effective address (EA) is formed where
EA = (BR_n) + Field 2
This addition only occurs to the offset portion of
( B Era) .
5.5 Stack Fetch
When information is required from any location except
the top of stack, a stack fetch sequence must be executed (see
Figure 5.4).
The main test to be performed is to determine whether
the information to be fetched is in the M1 or M2 part of the
stack. This is accomplished by the calculation of the displace-
ment DISP. Information is then read from either M1 or M2 and
placed in the MIBR.
Reference for Chapter 5
i) Intermetrics, Inc., "Final Report -- Engineering Study
for the Functional Design of a Multiprocessor System",
Prepared Under Contract NAS9-I1745, Septemmer 1972.
5-19
INTERMETRICS INCORPORATED. 701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 - (617) 661-1840
.p
_q
<
_q
(_
_4
cs}
H
+
_q
I
U3
O
E_
_.II.BM.,,I_U
U
O
O
-H
0
r-I
0
4J
rd
0
H
0
O
r4
H i
A i
r-i i
rH _)
_-_ i
_4 _I
_d 0
c_cN
cd .,--I
_Ord
L
-,-I
•_I '-I-I
0
O_
-,-I 0
-I-.I .,--I
rd.p
_I 0_4
0 P_O
I-i-i rd
I_ _N ._J
;t__t
0
_J
,-t
o
.p
O
_4
o
rd
.p
u_
.I-I(9
_._ I._
b_-_
_.I© b-,
_I _ (_
-_d
© II II
II I_
,o
u'3
14
Oa
5-20
I
ChapLet 6
I/O CO}_S]DERATIONS
_°]_
'2i_c I/O interface of the computer which serves the cen-
tzal co_<:n_t:ationa] and con[:_:ol elem_it of the m4_Pn._cd space sta-
tio_ is ilkcly t<_ bc charact<-._iized by the fo!]o<;ing observations:
a) There will be a large number and variety of interfaces
with diverse avionics equipments. The recent Phase B
Space Station analysis has advocated the use of a time-
shared, high speed (i0 MHz) avionics data bus to sim-
plify the pi-oblem of meeting this requirement. '2he I/O
implication of such a data bus will be discussed in this
report.
b) The computational speed and storage capacity requirements
of the Space Station are such as to make the multiplex-
inq of operating memory an attractive economical propo-
sition. (The cost of storing one bit in a core or
plated w:ire memory is over on<: thousand times the cost
of storing it on a disk.) Until the more exotic, non-
moving media, secondary storage technologies (such as
magnetic bubbles) become fully operational, the more
conventional magnetic drum and disk will probably pro-
vide the mass storage capability on the early space sta-
tions. The relatively long access time of these devices
has made it necessary to treat the problem of getting
information in and out of them as an off-line task in
parallel with the main computational functions. This
chapter wJ ii discuss the use of a drum or disk as the
tertiary level of a memory hierarchy and as the pri-
mary storage for files.
c) Although the Space Station central multiprocessor will
possess the powers of a typical large ground based com-
puter facility, it is not anticipated that its work load
will encompass as wide a variety of jobs, languages, or
users. Perhaps of even more importance, the work load
will be much more predictable. This is certainly true
of the operational requirements, and even the eventual
experimerntal support function will probably be farily
carefully tailored to the available facility. The
6-1
INTERMETRICS INCORPORATED • 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840
implication of this for the I/O function is that there is
less need for a highly generalized interface to a wide
variety of the conventional peripheral equipments, and
much less need for the sophisticated data management faci-
lity usually found in the I/O hardware and software for
controlling these peripherals and providing for the or-
derly management of a large nu_er of files, it will be
assumed that the only need for standard peripheral I/O
channels in the planned SUMC MP will be to satisfy the
ne_ds of a laboratory environment (e.g., card reader,
line printer, operators' console), and that the eventual
operational I/O will be performed almost entirely through
the avionics data bus.
d) The emphasis on the generation, processing and record-
ing of large amounts of data from experiments places
the high density, high speed tape store into a special
category of space station I/O device. Even if an impro-
ved bulk storage technology is eventually employed in
this function, the need for transferring and retrieving
large blocks of data from archival storage at rates on
the order of several million bits per second will still
have to be met. This data originates at the experiment
sensors, and enters the system for processing and reduc-
tion via the main data bus, which, as will be seen, can
typically supply 2.5 million information bits per sec-
ond. It is felt that a more specialized interface than
just another port on the bus is required for this I/O
function.
The major impacts of these observations on the I/O hard-
ware and software will now be discussed.
6.2 Data Bus I/O
In order to make more than sweeping generalizations, some
assumption of data bus characteristics must be made. Studies to
date [i] have shown that an initial Earth Orbital Space Station
can be serviced by a data bus whose elements are shown in Figure
6.1, and which has the following typical characteristics.
Multiplexing
Frequency
Number of devices (stations)
Command structure
TDM
i0 MHz
256
Command/response
These are the important control characteristics from the point
of view of I/O communication.
Command/response implies central computer control. Bus
I/O takes place only on the behest of the computer; no device
6-2
INTERMETRICS INCORPORATED. 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138. (617) 661-1840
°I
! ! _ir 'n f:-" \
'..... -'-I '.-J"_" ! "...... l'r _ /
..... -_ ::i _,...t _ I --_ I
I i:i ,::_ .<i ! \U
r.Pi t! {
f:.:i +.J <'.
,-j
7.i
r-!
©
.
o o
o
u
.... •_ _ \
2_)
U
4_
_rj
_:_
_;
H
U
C_
_0
0_
may volunteer information. It is our opinion, however, that
although a strict C/R control policy may be shown to be quite
adequate at this stage of Space Station development, it will be
advantageous to provide a bus interrupt capability. This is
not so much in order to provide the devices with control auth-
ority, but rather it is in order to allow the bus control unit
(BCU) the ability to off-load the computer I/O routines of chores
such as error monitorinm, detection of unusual conditions, response
to unsolicited communication from Station subsystems, etc.
Local processing at the device level has been proposed to
off-load from the bus any high speed repetitive functions (such as
strapdo\¢n inertial system algorithm evaluation). It is e>.'pected
that bus communication between computer and device will be com-
posed of short blocks of data from one to several bytes in length,
typically 1 to 128. Data transfers of larger blocks (e.g., CRT
display frames, experimental data recording) are usually not time
critical, and may be achieved by repeated bus I/O. If 8 bytes
suffice for device address and address echo check, and assuming
a typical 80:20 mix of short (4 byte) and long (128 byte) bus
communications, the time to service 256 devices is derived below:
Short
All messages
Bytes Bits
Control
Con_nand Echo
4 4
4 4
Data Total/De vice x(# of
devices)
Tota i
4 12 96 206 2.104
128 136 1088 50 5.104
7.104
A complete service cycle of all devices on a maximally
configured bus thus generates 70K bits. For a I0 MHz transmis-
sion frequency this cycle can be repeated every 7 milliseconds.
In practice, delays due to finite transmission speeds will in-
crease the cycle time, but a i0 ms to 20 ms bus service cycle
seems to be entirely achievable. A 20 ms cycle, with the pre-
ponderance of long communications assumed, will generate about
300K bytes/sec of actual data, i.e., a data rate comparable to
that of the higher speed storage devices such as drums, disks,
and tapes. However, a data bus differs significantly in the
manner in which this data is addressed and controlled.
6-4
INTERMETRICS INCORPORATED " 701 CONCORD AVENUE . CAMBRIDGE, MASSACHUSETTS 02138 . (617) 661-1840
The type of bus described is essentially a t'able-driven
device: in p:_ractice, communJ cation between the computer and the
avionics devices will occur as follows:
a) A number of device interfaces will need to be accessed
for real time data at the highest service cycle frequency,
i.e., every i0 ms to 20 ms.
b) Others will require accessing periodically, but at lower
frequu, ncics Lhan the maximum.
c) Some will require occasional sampling of random intervals.
d) Some devices may be attached but may not be components
o_ _ _oml/u_ez _aivity. L_cv_:,.]_Icss, thci_ status and
health must be continuously known.
e) The remaining interfaces may not even be attached.
The mix of devices in each category is a function of mission phase
and/or station operations. It is a delicate design problem to
ensure that all the highest frequency requests are complete with-
out exceeding the basic bus service cycle, and without losing
some of the less frequent requests. Since these constraints are
known only to the system implementer, specific bus configuration
should not be wired-in to the hardware (or system software) of
the computer or I/O controller.
The device accesses can be organized into a set of I/O
tables. Each table contains the list of accesses to be accom-
plished at a given frequency. Figure 6.2 illustrates an example
of such a table, made up of entries for bus I/ O to be accom-
plished for K : 1 (every service cycle), K = 2 (every other cycle),
K = 4 (every fourth cycle), and so on up to K = 64. K need not
be in powers of two, but it is felt that this makes table mech-
anization much easier, and is not. a serious burden to the avionics
system implementer.
Each entry in the table is a request for bus I/O. Such
a request may consist of one or more words with fields which
contain the following information:
[ IOC Command I I%CU Con_nand i Bus Conm_and ]., Device Operand I Memor_ Address I
Figure 6.3: Typical Bus I/O Request
6-5
INTERMETRICS INCORPORATED • 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840
t_
1-I
4_
_m m
nnkimi m m
I
I
!
--J
t
r-_
mm _
o
6-6
t_
t_
_J
t_
©
0
t.fJ
oo
t_
-r-I
a)
b)
I/O controller field specifying I/O channel 'type (i.e.,
bus) channel number, channel command.
c)
Bus controller field specifying special instruction
to BCU (e.g. , table update, check device status, etc.)
d)
Bus command field specifying device address and bus
operation (e.g., read, write, set mode, get status,
etc.)
c)
Device operand field spcJfying operation to be performed
by specific avionics subsystem (interpretation known
only to device)
Dcsti:_,:_tion qie]d- address and length of memory area
in which result of bus I/O is to be placed, or from
which output is to be taken.
As each I/O request is executed the appropriate data
is transferred between memory device. The question now arises:
how is the table of I/O requests to be interpreted and where does
it reside? Several alternatives present themselves:
a)
b)
It resides entirely in main (operating) memory and each
entry is treated as a separate I/O request to the soft-
ware executive I/O routines. If there is a large num-
ber of high frequency entries this will create an I/O
bound condition, and much process swapping in a multi-
programmed environment.
The table of I/O requests resides in the I/O controller
and is executed there independent of main processing.
Only the result of each request is transferred to mem-
ory. This relieves the interface between the I/O con-
troller and the operating memory of traffic generated
by control statements.
c) The table of I/O requests and the resulting data re-
side in buffer storage local to the I/O controller.
Data transfer is in block updates between minor cycles.
The progression from a) to c) implies an increasingly
elaborate I/O controller. It also incurs the problem of buf-
fering the bus I/O data. If a user program no longer has the
ability to place each individual request, than it has no know-
ledge of when an update to (or from) the requested bus device
is made. This is especially critical for blocked data, where
it is essential to ensure homogeneously updated elements of the
block. A mechanism for preventing multiple access to data blocks
must be provided such as a TEST and SET operator, or multiple
6-7
INTERMETRICS INCORPORATED •701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 - (617) 661-1840
buffers with switchable pointers: the first incurs delays (cri-
tical to an I/O process), and the second consumes memory space.
The localization of bus I/O in the IOC allows high
frequency bus-computer communication to be conducted without
the several milliseconds delay normally associated with I/O
devices such as drums or disks, and obviates the need for pro-
cess swapping to maintain throughput. The low frequency or
random bus communication can be handled in a conventional fash-
ion as a single I/O event. Such requests can be treated as
temporary insertions into the bus I/O request tables, which are
removed when serviced by the bus. Completion of the request
can be signalled by an I/O complete interrupt. Division of bus
I/O requests into repetitive and random categories depends on
the trade-off between IOC complexity, I/O buffer size, bus ser-
vice frequency, and throughput.
6.3 Mass Storage I/O
The most critical function of secondary storage is
as part of the multiplexed operating memory hierarchy. Whether
the technique employed organizes memory into fixed size blocks
(pages) or variable sized blocks (segments), it is essential to
be able to locate and transfer to and from secondary storage
fairly large amounts of stored information (from tens to thou-
snads of words), in a minimal time.
The traditional disk or drum memory systems possess
characteristically long latencv and/or access times (on the
order of tens of milliseconds), and data transfer to those
devices is performed in parallel wxth other CPU acnlvity by an
independent processor. It is anticipated that early Space Sta=
tions will still employ rotating magnetic storage devices and
that I/O will continue to be concerned with their optimal usage.
It is important to realize that a subsequent change to solid
state mass storage (with little or no access delay) can radi-
cally modify the concept of memory multiplexing, to the point
where it may not be done via the I/O controller. In the present
discussion, we will assume the conventional core to disk inter-
face requirement.
The major concerns with optimal usage of the disk are:
a) Since access times are long (typically i0 to I00 milli-
seconds), but transfer rates are high (typically 5 to
I0 MB/s), it is desirable when a request for a missing
memory block is honored, that as much "useful" asso-
ciated information is transferred along with the spe-
cified block, since the cost of so doing is relatively
6-8
INTERMETRICS INCORPORATED " 701 CONCORD AVENUE - CAMBRIDGE, MASSACHUSETTS 02138 . (617) 661-184(,
b)
c)
d)
low. This involves maximJz.ing the "locality" of the
executincj program which creates the I/O request, or
otherwise anticipating its accessing behavior.
Since requests for data take so long fie honor, it is
probable that, by the time a requested bloc]< is lo-
cated a_Id transferred, the request3ng process is pro-
bably no longer running. It becomes desirable to allow
the m tu,_ory management to determine, at its convenience,
when to alert _.¢ait:ing processes of their com}_]ete I/O
request._;. This may be done by causing a table of com-
pleted I/O requests rathe1; than to signal the system
via an "I/O complete" interrupt, as is usually done.
This may be don.:_ hy caus:incJ a table of completed I/O
zequ_:,_ L_, L_ _cc_Ji',_latcd blf the I/O con tro]]r,r, and
only wilen no further requests are pending, cause the
I/O cont_roller to interrupt the systc, m to notify it that
all requests have been expedited. A "quiet" I/O com-
plete scheme such as this is expected to greatly mini-
mize the "thras]_ing" of memory transfers that occurs
when operating memory becomes overcommitted.
The assignment of disk space can become as critical as
that of operating memory. For a high degree of memory
multiplexing, disk space can become badly "fragmented"
with use, necessitating a compacting or rearranging of
the assignment of files. In a real-time system it may
require prohibitively long search cycles to update all
references to files that are re-assigned. Disk addres-
ses can be organized in a central directory which maps
logical into physical address space. This can be ac-
complished in main.memory, at considerable cost of space,
or on the disk, at the cost of more complex hardware in
the disk controller.
Other traditional I/O problems (such as the trade-
off between I/O request frequency and I/O buffer space
in main memory, and the related question of logical
file blocks and how to assign them to a device that
is organized into physical records) still remain in
a Space Station environmerlt. But, as stated in the
beginning, these questions are of less significance
in an environment whose work load and user requirements
are less variable and more known. A less generalized
approach to file directory management may be possible
than is found in general purpose ground-based facili-
ties such as the larger IBM 360 installations.
6-9
INTERMETRICS INCORPORATED • 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840
6.4 I/O Controller Design
This section will describe the functional elements of
a _ :oposed I/O controller design. Detailed implementation ques-
t1 :. are beyond the scope of the present contract. Figure 6.4
i_ i_ates the basic functional elements.
6.4.1 Central Control (CC)
The central control unit provides the decoding of the
I/O opeKations, for the initiation and synchronization of com-
mands, and for data transfers between the units. The CC contains
an arithmetic unit and the logic required to perform conditional
decisions. The sequences issued by CC are stored in a micro
control memory and are initiated via commands from the various
interfaces.
6.4.2 Interprocessor Communication Interface (IPCI)
Some mechanism is clearly required for communicating
between processors and the I/O controller. This is necessary
for interprocessor interrupts, I/O commands, and recovery form
processor faults.
The IPCI provides the interface to the interprocessor
communications bus. One may reasonably question whether a sep-
arate interprocessor con_unication interface is required. Can
not all the communications go through M2?
If all the interprocessor communications occur by
writing into M2 and reading from it, then the answer to the
above question is no! The overhead due to constantly polling
M2 would waste processor time and create excessive M2 conten-
tion.
If processor communication uses the internal bus, as
a communications media, by-passing M2, then the answer is pro-
bably yes. The use of the internal bus as the communications
media is just an implementation decision. The fact remains that
distinct communication between processors and between processor
and I/O must occur, outside of M2. The logical decisions per-
formed by IPCI must exist whether a physically separate inter-
processor communications bus (IPCB) is employed or not.
A wide variety of signals are communicated over an
IPCB. Some are between processors. Others involve I/O trans-
actions. Some examples are given below:
6-10
INTERMETRICS INCORPORATED. 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138. (617) 661-1840
r ....
o i
I
I
I
h
fJ
.H
[.<
--0
_0P3
_0I%1 _.¢...)
--0
., FT.......1-I
_4
/ _ 0"_
t
i
I
I
I
I
i
I
I
I
I
.I
6-11
e_
OTOSUO3
I
i
pxeD
]
I
I
I
!
I
I
I
I
I
I
I
I
u
rd ,,-4
o IHU
I
z__ _j
V
r-4
O
_4
4J
O
O
O
H
q-4
O
4-J
(9
(_
r-4
P4
m,
l.o
_4
U_
R4
a)
b)
c)
d)
e)
If local memory M1 is employed, then a poten'tial pro-
blem exists in updating con_non information (for ex-
ample, descripto_:s ( contained within different Ml's.
The control of the updating requires interprocessor
communi c at ion s.
The loading (initialization) and dumping (for a pro-
cess swap) of M1 can be triggered within a processor
or commanded from another processor (in case of an
error condition).
When a processor fails or detects an M2 failure this
information must be signalled to another processor.
All the commands issued by I/O executive routines must
be sent to the I/O controller over some conmtunicating
link.
All "done" or "error" interrupts generated by or pas-
sed on by the IOC must be steered to a processor over
a communications link.
6.4.3 Operating Memory Interface
This interface element controls access to memory by
the various channels. It is, in effect, the DMA channel for
the I/O controller. The priority as to which I/O interface has
access when contention exists is fixed. The following is sug-
gested:
Priority 1 (highest) Channel i: The devices which
operate in the burst mode must be serviced at a rate consistent
with their data rate. M3 can possess a data rate of up to i0
MBPS, which is three to six times less than the M2 data rate.
However, channel 1 devices cannot sustain a large delay between
a request for an M2 transfer and the final servicing of the re-
quest since the addressed record is usually not fully buffered
and M2 and the auxilliary device must be synchronized during a
data transfer.
Priority 2 Channel 2: The devices which are driven
by tables in the local memory of channel 2 present to M2 a
data rate three to six times less than that of channel i. Yet,
if too much delay is introduced in each M2 transfer, the minor
and major cycle times might be exceeded.
Priority 3 Central Control: When the CC receives a
command over the IPCB it often has to fetch an I/O control word
from M2. While this fetch can be delayed a reasonable amount
of time, queueing of too many IPC commands before execution
must be avoided.
6-12
INTERMETRICS INCORPORATED • 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840
Priority 4 Channel 3: The devices attached to channel
3 are all slow s!)ov'd and involve only a few b_'tes per transaction.
A delay of ten to even one hundred M2 cycles will not appreciably
affect Lhe performance of these devices.
Priority 5 (lowest) : Since the interrupt priority
and timer elements of the I/O unit do not use M2 to a signifi-
<'.ant extent, these elements are placed in the lowest priority
category.
6.4.4 Channel s
These control the interface to the device categories
define([ prevlous±y, ndm_iy.
a) the high speed dis]< (or drum) and tape
b) the avionics data bus
c] slow sNeed unit record equipment.
Each channel will contain buffer capacity appropriate to the
device, and a set of instructions tailored to the control re-
quirements of the device.
6.4.5 Interrupt Handler
Although not a unique location for the interrupt con-
trol mechanism, the I/O controller often contains this function.
There is some advantage in handling external interrupts and pro-
cessor traps with the same mechanism.
6.4.6 Timer
The real time aspects of the MP system require access
to a precise time standard. Also the capability of generating
an interrupt at a predetermined time, probably by means of a count-
down mechanism, is required. Each counter must be addressable
from a processor for initialization or readout. These counters
are placed inside the I/OC for convenience, thus saving the
cost of providing a unique piece of equipment.
6.5 I/O Configuration Orqanized for Recovery
The I/O configuration presented in Section 6.4 indi-
cates that a single I/OC is capable of servicing the multipro-
cessor. If this design approach is taken, how can this single
I/O meet the requirements dealing with recovery from a failure?
6-13
INTERMETRICS INCORPORATED • 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840
If two or more I/O units are required for system oper-
ation then the recovery aspects of the I/O can be made very sim-
ilar to those of a processing unit. Each of the I/O units
would be configured like a processing unit with an M2 interface,
a special interface to the Processors via dual redundant commun-
ication links, an M3 interface, and a data bus to the outside
world. Single instruction Restart could be employed as the major
recovery mechanism.
Since only a single I/O unit is proposed to meet the
performance rc<_uirements, a triple-redundant I/O unit with voting
logic is a candidate design approach. Many transients are com-
pletely masked in this configuration. If a permanent failure
occurs then tl__e voting elements can be reconfigured to compara-
tors and the bad I/O unit taken off line for repair.
Figure 6.5 shows a possible redundant I/OC employing
the components described in Figure 6.4. The major features of
this configuration are described below.
a) The triple redundant I/O hard core contains the cen-
tral control, timers and the interrupt control. A
failure in this critical area will allow the system
to keep running without propagating the error.
b) In order to interface the TMR section with other dual
redundant interfaces, voters and switches are provided.
The S elements, which are controlled by their asso-
ciated I/O elements, are used to select which of the
dual redundant interfaces to accept data from. The V
elements vote upon the triple redundant I/O outputs
and produce dual redundant outputs. The voters will
automatically reconfigure to comparators and switch
out a faulty I/O where required.
c) The IPCB, M2, M3, and data bus are all postulated to
be dual redundant. For this reason their interfaces
are shown to be dual and they interface to the I/O
via the S's and V's. The multiplexer channel which
contains peripherals necessary to operate a laboratory
model is only shown as a simplex subsystem, with a
corresponding single interface.
d) It is assumed that all the peripheral devices attached
to the data buses and the M3 controller possess char-
acteristics which will aid in the recovery process.
These characteristics include:
i) hardware to aid in fault isolation between dual
redundant threads
6-14
INTERMETRICS INCORPORATED. 701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138. (617) 661-184£
/ .[]"c7--_ .I,
To
J
_I
Figure
'i_(] ]_llt(-'] J ]<f ]. l] ] ],S
I t
I
t
I ] r-_S--_.;]_---,/oI
I £_.,_
I
I
!
I
i
I
!
I
I
I
i
1
I
I
I
I
I
I
- ]_24
L
0-_ "" -0--- O
•
Dual. Redundant Data Bus
TMR Ilard Core
Simplex
R[....IChannel 3
!
I •
I
M21
DBCU
IPCB
IPCI
I/O
V
S
M2 Interface
Data l_us Control Unit
InterProcl,ssor CoI?_unicdtion Bus
InterProcessor Co_._nicatlon Interface
I/O contains central control, interrup
control, timers
Voter
Switch
6.5: Redundant I/O Configuration
6-15
INTERMETRICS INCORPORATED. 701 CONCORD AVENUE , CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840
e)
2) sufficient buffering, so that aborted commands
cannot hang up a subsystem
3) the ability to be reset and to indicate upon
request the status of the I/O device
Certain problems caused by locking of processes to I/O
devices must be resolved by the operating system. This
requires the capability of selectively deleting the I/O
conuuand c_:eated by a process which is cancelled (either
purposely or as the result of a failure) from the appro-
priate device queue. Also, the capability of relieving
any M2 space allocated as the I/O buffer area must be
provided.
One of the main motivations for a triple redun-
dant I/O central core is to reduce this problem as far
as I/O failures are concerned. A failure within the
central TMR I/O cannot propagate past the voters. How-
ever, a voter or channel failure can cause a temporary
suspension of I/O or a re-issuing of an I/O command
and the associated problem of releasing any I/O locks.
References for Chapter 6
l) North American Rockwell, Space Division, "Modular
Space Station Phase B Extension - Information Mana-
gement Advanced Development Report", Contract NAS9-
9953, MSC-02471, July 1972.
6-16
INTERMETRICS INCORPORATED "701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 " (617) 661-1840.__
I
Chapter 7
FAULT TOLERANCE PHILOSOPHY FOR THE SUMC MULTIPROCESSOR
The purpose of this chapter is to present the study
rcst_its Jn kerms of error detoction, fault isolation and re-
covery philosophy as applied to a multiprocessor system.
7.1
_e.c]ui remeu [-
The requirements postulated for the system, as a
result of the study, are delineated below.
a) The only interaction that the applications progra_m_er
should pos:_ess with the fault tolerant aspects of the
system is to specify whether and under what conditions
a program o_7 sequence of events is to be critical. A
critical program is defined to be one which must be re-
coverable in the event of a fault. A non-critical pro-
gram is one which need not recover.
By classifying a program as non-critical certain
design considerations must be kept in mind. The ab-
rupt termination of a non-critical program in the mid-
die of any instruction should not create a situation
which will prevent the execution of other critical
tasks. Any Compool data which is used by a non-critical
program can not be left locked. The failure of a non-
critical program can not. lock out a piece of peripheral
equipment from use by a critical program.
b) It seems reasonable that for certain applications a
recovery time of i0 to i00 ms could be required, es-
pecially for certain real time control applications
with iteration rates of i0 to 50 times per second.
Other critical functions might take longer. The accep-
tance of recovery times of 1 minute or more essentially
means that the program, which is to be recovered does
not fall in the real time category.
7.2 Error Detection
The most fundamental conclusion that has been reached
in the error detection area is detection of hardware failures
7-1
INTERMETRICS INCORPORATED. 701 CONCORDAV[ENUE • CAMBRIDGE MASSACHUSETTS 02138. (617) 661-1840
must be completely a hardware function. (We are confining our
discussion to faults within the internal structure of the mul-
tiprocessor. Peripheral I/O devices can, depending upon their
characteristics, em!?loy central processor software to provide
diagnostic capability.) The above conclusion is based upon
the following reasoning:
a) An important aspect of any system which is to recover
from a fault is to detect an error within a period of
time which guarantees that the error hasn't propagated
to a point where recovery becomes impossible. Assum-
ing a given error is detected by a software self test
routine, it is generally impossible to determine what
information in memo17y has been incorrectly modified.
}{ithout the ability to isolate the damage, repair can-
not be effected and recovery becomes unattainable.
Hardware error detection mechanisms such as parity,
comparators and specialized logic provide a continuous
monitoring upon the system. Software test routines
can only be executed periodically in time.
Error detection logic, properly designed, will
more nearly approach the goal of instantaneous error
detection which prevents the propagation of failures.
b) If software self-test were to be employed one must con-
sider the question of how long it will take to execute.
Hardware error detection need impose little if any
overhead upon the system performance. Software can
spend a considerable amount of time for two reasons:
i) To be comprehensive an extremely large number of
tests must be run.
2) They must be executed at a high frequency.
The unfortunate thing about software self-test in the
pasthas been that, in most cases, hardware was not
designed with self-test in mind. It was very diffi-
cult for the software to control precisely the hard-
ware state. Micro level diagnostics tend to allevi-
ate this problem to a degree. Because of an inabi-
lity to test easily all features of a system, self-
test software demonstrates the phenomenon that a
large percentage of equipment functions can be tested
with a relatively small amount of code, while the
final few percent of the equipment tests require a
very large amount of code.
7-2
INTERMETRICS INCORPORATED.701 CONCORD AVENUE • CAMBRIDGE MASSACHUSETTS 02138- (617) 661-1840
c)
d)
The periodic nature of software error detection makes
transient error detection difficult. 'I\,_ocategories
of transients may be isolated:
l) Type 1 transients cause a temporary incorrect
electrical signal but do not change the state
of any storage element.
2) Type 2 transients occur at such a point in the
sequencing of a processor that incorrect storage
occuz_s. The hardware satisfies al.] tests that
can be invented, yet b<:d information may e):ist
which will eventually cause incorrect system per-
fo]: !_lEtl]C_.
If a type i transient is not detected it hardly mat-
ters to the functioning of the system. IIowever, an
undetected type 2 transient could possibly be catas-
trophic. An error detection philosophy which provides
a continuous monitoring at critical points is neces-
sary in order to prevent type 2 transients from going
undetected and propagating.
Micro diagnostics, although more commrehensive
and easier to write than software, must still he exe-
cuted on a periodic basis. Their ability to detect
transient failures faust be seriously questioned.
The final point against software diagnostics as the
sole error detection mechanism is that failures can
occur which disable the execution of the software.
Therefore, the signalling of fault condition can not
occur.
7.2.1 Implementing Hardware Error Detection
Error detection is intimately involved with the specific
failures modes of devices and equipment. If the various failure
modes and the propagation dynamics of the failures are studied,
then, in specific instances, the addition of a moderate amount
of logic can detect the anticipated failures. On the other
hand, one would like to employ techniques which are not very
dependent upon the specifics of the equipment in order to pro-
vide a degree of flexibility and generality. The appropriate
decision between specialized and generalized error detecting
logic is a matter of engineering judgement.
7-3
INTERMETRICS INCORPORATED • 70_ CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840
7.2.1.1 Processing Unit: The processing units of the multipro-
cessor ar_ the major sources of error propagation. If incorrect
write operations are executed, due to a failed component, then
the normal sequencing of the processing units, using this in-
correct data, can cause propagation of the error to other por-
tions of memory. Propagation of errors can extend beyond the
multiprocessor system, if incorrect I/O commands are issued
and executed. Because of the potential devastation caused by
a processing unit failure, a maximum design effort must be un-
dertaken to detect P failures before they propagate to other
parts of the system. Within the limits of practicality, an
effort must be made to detect almost all failures within P, be-
fore incorrect write operations or invalid I/O operations are
executed.
Based upon these objectives, the study conclusions sug-
gest that processing unit error detection be accomplished by em-
ploying two synchronized but independently operatin_ processors
with a fail-safe comT_arator placed across the memor_ interface.
Some of the reasons for this conclusion are presented below:
a) Periodic software self-test cannot catch all failures
before they propagate to multiple errors.
b) Error detecting codes internal to the processing unit
cannot detect a large category of failures. For ex-
ample, the failure of a control signal can cause al-
most every bit in a word to be incorrect. The use of
arithmetic codes, such as a Modulo 3 check, produces
inconsistent results under operations such as AND, OR,
Not.
c) It will require at least twice the logic, and incur
more than twice the cost, to detect all possible single
component failures in P. Therefore, the cost of a
dual P unit is reasonable.
d) The redundant processors can be packaged separately
with independent power distribution. This will more
closely meet the failure independence assumption.
e) Redundancy with a comparator at only one interface will
reduce the number of interconnections between the re-
dundant processors.
f) Errors are detected before bad outputs may propagate
from the P. The comparator placed at the output of P
might allow an error to propagate within P, but no
bad information leaves P.
7-4
INTERMETRICS INCORPORATED •701 CONCORD AVENUE . CAMBRIDGE, MASSACHUSETTS 02138 . (617) 661-1840
g) If one were to design a processor considering error de-
tection as one of the main specifications, then each
module could be designed to detect its own errors. Ap-
propriate design efforts must be spent in maintaining
statistical independence between failures and preven-
ting errors in the error detection logic itself from
going undetected. This innovation to the logic design
effort would prove to be an interesting research topic.
As far as employing the present SUMC design as the
proccssin C element of the multiprocessor, the use of
two SUMC e]e_.ents with a comparator seems to be the
most reasonable approach.
7.2.]..2 M.:£..I_)S-X: The irregular structure of the processor leads
one to considler the use of dual processors as a cost effective
error detection mechanism. Memory structures tend to be very
periodic in nature, possess little if any combinatorial logic
outside of the addressing area, and therefore, are more amen-
able to the use of error detection codes. Simple word parity
is a degenerate case of an error detemtion code.
Memory can be a significant contributor to the hard-
ware cost of a multiprocessor system. For this reason, tech-
niques other than brute force duplication of memory modules
should be considered for error detection purposes. Depending
upon the details of the construction of memories, different
techniques can be employed. The following suggestions are made
arid seem to serve the. purpose for most state of the art mem-
ory architectures.
a) Word parity can detect single memory cell failures,
sense amplifier failures, and other failures which
manifest themselves as single bit errors.
b) The incorporation of parity upon the address of the
word proves satisfactory in detecting the failure of
a single bit in the memory address register.
c) Employment of special current threshold circuitry
can detect the simultaneous selection of more than
one memory word at a time.
d) The use of a time-out indication can detect the fail-
ure of a memory module to sequence.
e) The use of a write-and-verify mode of operation, where
every word written into memory is immediately read
again, can verify correct storage. This is particu-
larly applicable to NDRO type memory structure. For
a DRO memory system one must face the problem that the
7-5
INTERMETRICS INCORPORATED" 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138" (617) 661-1840
f)
g)
d
read operation which is used for verification must be
followed by a write-for-restoration of the data. A
failure can occur during the second write operation
which would go undotected until the stored word is
used again. However, the write-and-verify operation
is still useful in detecting failure modes associated
with transient addressing, control or bit storage
failures.
Integrated circuit memories possess enough redundant
addressing logic so that a partitioning of the memory
into independent bJt planes allows word parity to de-
tect a large nunJoer of addressing errors. Present
state of the art integrated circuit memories contain
address decoding on each memory chip. Chips can be
configured to contain one, t<,'o or four bits of 1024
words on each chip. Since each chip contains its own
address decoding, a failure of a chip can only mani-
fest itself as an error on the output of the chip it-
self. That is, it is localized to a few bits of the
word. If each chip contained only one bit of each
word, then a single word parity bit would detect all
address decoding failures.
The use of separate read and write logic in the con-
trol area of the memory module will prevent a read
command from turning into a write command, due to a
single component failure.
7.3 Recovery
When a module of the multiprocessor fails, the presence
of a spare (physically identical module) which can execute the
same function does not necessarily mean that recovery can be
accomplished. A failure not only.eliminates certain physical
resources (hardware) from potential allocation to executing pro-
cesses, it also destroys information (program, data and status),
which is required for execution. The major problem associated
with recovery is not the necessity of providing spare hardware
with an appropriate reconfiguration switching mechanism. It
is, instead, the problem of re-establishing all the information
required by the process to recover. In order to achieve
recovery, the system must be returned to some past state which
is known to be correct.
What exactly determines the state of a system? If
real time is ignored, for the moment, then the system's state
can be defined to be represented by the contents of all the
storage elements, including MI, M2 and the Processor's control flip
7-6
INTERMETRICS INCORPORATED "701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 . (617) 661-184C
flops. The more dynamic changes to a system's state are contained
within Ml and P. M2 possesses less dynu.mic changes with time.
M3 is even mo_:e static. As one p)-oceeds from the more dynamic
to more static elemtns of a system's state, time becomes less
important to the recovery process, therefore, software, which
is more time consuming than hardware, can be employed.
The discussion on recovery will address three major
are _is:
a) The processing unit, P and MI
b) Operating memory, M2
c] Inp:_L output control) :r (!/OC) and its channels
Suggested approaches to recovery from both transients
and permanent failures in these three hardware areas are pre-
sented.
7.3.1 Processing Unit (P-M1)
7.3.1.I Restart_.ble Instructions: One of the main suggestions
generated by tl_is study, relative to a recovery from a proces-
sing unit failure, is to design all instructions to be restart-
ab]<;. This raeans that the polnt of recovery is the instruction
du-_[-Sng which the fail_re was detected. It is assumed that all
failures are detected essential]y instantaneously so that pro-
pagation of the failure does not cause incorrect information to
be written into M2 or bad I/O comsL_ands to be executed.
Although a restartable instruction is not a difficult
technical feat, it does require a design effort. The following
ground rules must be applied during the design implementation
of each instruction:
a) Each instruction must be partitioned into two phases.
During phase 1 the instruction is fetched, data is
read, computations are made and all memory write op-
erations are placed into a temporary buffer area for
execution during phase 2.
b) During phase 2 the buffered information is copied
into its final destination in M1 and M2. The contents
of the buffer area are not destroyed unti all the copy
cycles are completed and verified. Each phase is de-
signed to be separately restartable. Figure 7.1 sch-
ematically represents the execution of a generic re-
startable instruction.
7-7
INTERMETRICS INCORPORATED" 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840
c6
H
_3
O
_O
-H
O D
_3
P_-IJ
D
b_
H l_tH
¢xlCq
O ,--t ,-'q
O
-,.-t
D
,-_-H
a_H
q-t
t_
_ ,--I O
O
D
_3
aA
o,i
M
4_t
_3 0
•H "H
t._ .-O
.._
r_
u_
0
.F-I
O
7-8
O
.el
4_
D
_4
4_
H
(9
4_
D
-,-t
l-t
(1)
t_
(9
_4
t_
.,--I
c) If a failure indication occurs during phase l, then
the old copy of the program counter indicates which
instruc[:ion was being execu/ed. All of the infor-
mation needed to execute the instruction has not
change(], so phase 1 can be re-initiated. If a fail-
ure occurs during phase 2, then, even though some in-
formation might have been copied, the information tem-
porarily buffered in M1 is still valid, and a complete
re-init__ation of phrase 2 is indicated.
d) Interrupt testing can either occur at the end of phase
2 or _t t}_e beginning of ph_!se i. it is assumed that
all interrupt condi£ions az_._ caught in latches, so that
the in te:_-rupt test is just: a mattez" of readin 9 these
latches und detczm]_ning %:het!_cL- to fetch the next in-
struction in the instruct:ion stream or to enter the in-
terrupt control micro-routine. The interrupt control
micro-routine must be designed to be restartable and
it must incorporate the concepts of a double phrase
operation with a buffer area, i.e., the interrupt con-
trol micro-routine can be considered to be a restart-
able instruction.
What does a restartable instruction design allow the
system to do?
a) For transients which interrupt the normal execution
sequence, but do not destroy data, the retry of an
instruction will provide a simple method of recovery.
b) For transients, where information is modified, the
information must be restored before the instruction
is retried. The restoration of the lost P or M1 in-
formation can be accomplished by either error correc-
tion codes or by duplexed storage.
It is proposed that each instruction be designed
so that after an instruction is executed, the state of
the processing unit is always contained within MI.
Each processing unit would contain two Ml's so that in
the event of failure of one, the information contained
in the second could be used. The size of M1 should
nto be more than i00 words and so its duplication pre-
sents little hardware impact.
c) Recovery at the instruction level allows the entire
operation to be independent of the application pro-
grammer. Hardware and operating system primitives
can determine when and how to restart. All considera-
tions are based upon detailed information below the
7-9
INTERMEFRICS INCORPORATED- 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • ((]17) 661-1840
d)
instruction level. The application programmer could
not care less about these details.
Because single instruction restart (SIR) allows
a very quick recovery mechanism, one is not even con-
cerned about the impact of the delay between error
detection and recovery. This should be well within
the iteration period of the highest frequency periodic
application function.
Error detection within the instruction cycle as well
as SIR tends to eliminate questions of error propaga-
tion and the interactions between a failure and the
informational content of the rest of the system's
storage.
7.3.1.2 Critique of Alternatives: Why the emphasis upon a re-
startable--i_£-ru-dhion? _:fat are the alternatives?
a) In a batch processing system where multiprogrammling
is not used, the failure of a processing unit catches
only one program in a running state. All the submit-
ted programs are completely independent and recovery
is simply a matter of reloading the program and data.
Many functions on the space station can be handled
by this "fresh start" approach. It is simple and
imposes minimum overhead.
However, the real time aspect of some of the
space station processing requirements makes the "fresh
start" approach unfeasible.
b) A "checkpoint restart" approach to recovery has been
applied to systems where problems requiring hours of
computer time are being run. At fixed intervals the
complete contents of core as well as the processor
registers are dumped onto a back up area on disk or
tape. A snapshot is taken of the system's state.
A superficial look indicates that with a 1 _sec
cycle time and a 100K word memory, a memory dump can
be accomplished in i00 milliseconds. This is not an
unreasonable time. However, let us investigate the
implications of "checkpoint restart" a little more
deeply.
i) If a snapshot requires i00 ms then one must con-
sider its effect upon system throughput. If one
desires to limit the overhead imposed by this
7-10
INTERMETRICS INCORPORATED. 701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 - (617) 661-1840
2)
3)
function to less than 5%, then a snapsliot can not
be taken more than once evez-y 2 seconds. If real
time requirements allow a recovery time of 2 sec-
onds, then "checkpoint restart" might be a viable
candidate.
If the contents of operating memory and the proces-
sing units are rolled back 2 seconds in time, can
one guarantee that the state of the mass memory
._s always consistent? Must the contents of mass
memory also be dumped when operating memory is
dumped? In genera], tile answer is yes. In a
virtual memory system where memory hierarchy must
not contain inconsistent infoz_mation. Dumping M3
perJodJically onto some archival stoi-age device
such as tape (M4) seems to eliminate check point
restart as a valid candidate for recovery in a real
time environment.
Even though M2 can be dumped in i00 ms; a disk,
drum o]- tape probably couldn't absorb the data at
a rate higher than i0 MBPS. This will increase
the snapshot time for 32 bit words to 320 milli-
seconds and the snapshot period to once every 6.4
seconds.
7.3.2 Recovery From an Operating Memory (M2) Failure
IIardware failures and electrical transients in memory
systems cause information to be destroyed. Recovery from a mem-
ory failure would be very easy if the error patterns caused by
failures and transients could be known with certainty. Many
error patterns could then be corrected by employing error cor-
recting codes. Unfortunately, it is impossible to analyze all
possible failure modes under all possible environments to de-
termine all possible error patterhs. Failures exist which can
not be corrected by error correcting codes. Error correcting
codes are not useful when the timing mechanism fails in such a
way as to prevent memory access. A failure in the addressing
mechanism can not be corrected by the encoding of data.
Error correcting codes can be successful when the pre-
dominant error modes are single bit failure or small burst fail-
ures. In general, however, duplication of the information con-
tained within the memory cells is required for successful re-
covery from an M2 failure.
7.3.2.1 Problem Areas: When attempting to design a system which
is recoverable from M2 failures, a number of distinct problem
areas must be resolved:
7-11
INTERMETRICS INCORPORATED • 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840
a)
b)
c)
d)
Memory Management
Normal (non-failure-tolerant) memory management
deals with the allocation, deletion, and control of
memory space for program and data entities. When fail-
ure recovery is made a requirement, additional questions
arise; how to deal with redundant storage of critical
information? How shall the hardware and software in-
teract to :
l) enah]_e the continuous storage of redundant infor-
matJ on ?
2) allow the accessing of valid information in the
presence of a fault?
Hardware Fault Isolation
When a memory error is discovered, how can it be
isolated to a repairable piece of equipment?
Information Fault Isolation
If the failure is isolated to a specific memory
module, one must be able to determine what informa-
tion was destroyed so that recovery action can be con-
trolled.
Storage of Redundant Information
Since the redundant storage of information be-
comes a necessity for critical programs and data, a
question arises as to how and where the redundant in-
formation should be stored; in M2 or M3 or a combina-
tion of both?
7.3.2.2 Factors Behind the M2 Rqcovery Approach: A number of
considerations pointed to the suggested M2 configuration. The
following items consist of assumptions, observations and the
philosophy which leads to the approach presented in the next
section.
a) Consistent with the processing unit's failure recovery
philosophy, the applications programmer should not be
concerned with the details of the recovery procedure.
This is handled by the hardware and operating system.
There is however, one aspect that must involve the
application programmer. He is the only one who can
initiate the specification as to which program and/or
data segments are critical. By definition, critical
7-12
INTERMETRICS INCORPORATED " 701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 " (617) 661-1840
b)
c)
segments are all those segments used by programs which
must recover and continue execution after a failure.
Non-critical programs need not recover. They must,
however, be terminated in such a way so as not to inter-
fere with critical programs. This is called Fail Safe.
Some observations and requirements necessary to enable
a program to Fail Safe are presented in Section 7.4
Once tile applications programmer indicates the
programs <._hich are c_:itical the compiler can statically
assign critical or non-critical status to segments it
creates. Similarly, the operating system must also as-
sign criticality status to segments it dynamically
c_r'_ntcs: Y'O_" e]:amplo, _ stack,
The recommended approach to memory management is to
employ a segmented virtual memory system.
The virtual memory approach allows an exploita-
tion of the difference between read-only (program and
fixed data segments), and read--write (variable data
segments) information. If an M2 module which contains
program segments fails, it is desirable to exploit the
virtual memory mechanism, already implemented within
the system, to aid in the recovery process.
Most program segments can be considered to reside
in M3. They. are brought into M2, on demand, for exe-
cution. If the program segments contained within the
failed M2 module were, as the result of the failure,
made "not present", then the M3 to M2 transfer mech-
anisms will allocate space and transfer anew the re-
quired segments automatically. The "not present" seg-
ment indication is contained within the program seg-
ment descriptor. Descriptors are considered to be
data and are in turn stored redundantly in M2.
For a large computational system on-board a space
station, it is reasonable to assume that repair or
replacement of a failed M2 module will be performed
relatively quickly. The hardware error detection mech-
anisms should be able to isolate to a repairable unit,
and to indicate the action to be initiated by the soft-
ware.
However, there must be sufficient M2 space avail-
able so the system can run without "thrashing". This
entails modifying the work load so as to reduce the
memory required to accomodate the working sets of the
7-13
INTERMETHICS INCORPORATED'701 CONCORD AVENUE • CAMBRIDGE MASSACHUSETTS 02138" (617) 661-1840
remaining processes. Possibly, the number of processes
of particulaz-types might be limited to reduce the work
load.
7.3.2.3 Promosed Confiquration forM2 Failure Recoverv:_________The
proposed configuration defines an M2 module as four M2 units
which are interleaved on their low order address bits (see Fig-
ure 7.2).
Information segments may either be stored in a simplex
or duplex mode. The mode is specified within the descriptor.
Most program code would be stored simplexed and interleaved
across the four memory units. Most critical data sediments would
be stored duplexed. In the duplexed storage mode address i and
i + 1 contain idontical information. That is, two adjacent mem-
ory units contain identical copies of the redundant words.
A minimum of two memory ports connect to the redundant
P interfaces. Communication with any M2 unit can occur through
either port. This is under control of the command issued from
the processing units.
M3 is used to backup most program segments. M2 is used
as the backup for data and certain critical program segments.
Program and Data Segments can be stored anywhere in M2. When
space is assigned to a critical data segment, a double size "hole"
must be found in M2. This does not impose any extra effort upon
the memory management function.
Redundant writes into independent units of M2 are ac-
complished automatically via the dual redundant processing unit
bus links. Recovery of M3-backed-up information requires making
the segment "not present" The memory management routine which
handles segment faults will automatically reload the M3 segments
when required, on demand.
Whenever an M2 error is detected, the error indications
are communicated to both halves of the processing unit so they
can continue to perform identical operations. The ability to
restart an instruction can be exploited in attaining system re-
covery after an M2 failure. As soon as an M2 error is detected,
the processing unit traps to a special micro-routine which boot-
straps into the sequence indicated in Figure 7.3. After recov-
ery, the instruction which was terminated by the trap can be
re-executed (if the M2 error was detected during ¢i of the in-
struction) or the instruction may be completed (if the M2 error
was detected during %2 of the instruction). It is interesting
to note that M2 read operations occur only during _i while M2
write operations occur only during _2"
7-14
INTERMETRtCS INCORPORATED- 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSEfTS 02138 . (617) 661-184(:
................ ii
;i
0
0
0
O
N
i,,,--T1
F---_hi,....................
i.
J
I
E
f
D
0 -,4
h
i
11
,-i
I
['4
e_
4)
"M
CI
£1 ', "_
"1 l
¢x3
'd
0
1,4
H
!
w4.
-_ ,
7-15
d
4J
0
L)
C3
_4
'L$
,-4 D
4_
0
,Ui
<
.rl
C,
I
_q
0
1.4
<
4J
•[ m
d
.u
In
0
0
a
B,
-,-4
- INTERMETRICS INCORPORATED "701 CONCORD AVENUE - CAMBRIDGE, MASSACHUSETTS 02138 " (617) 661-1840
[-
I
I
I
I
I
I
I
_1 0
c'J
o
o
o
__°._i
O: H _ U
0 ._
o t_
0 i
0
-r-4
4J
C_
O
-,-I
_O
I-'4
o
F_
0
_4
tm
.H
Fa
7-16
I
When an M2 error indication is first recorded, the M2
o[)eration wJil be tried again. If the error does not recur
then a type' 1 error is indicate<]. Ilo\.:evcr, if t]l(- error indi-
ca I-ion per:;ists a se_1]rch is mode to determine which segments
are stored in the sus[_ect unit. To accomplish this search in
lihe presence of a failed unit, the header word containing a
pointer to the segm',ni: descri_?tor as well as a link to the next
segment is redundant!v stored. Figure 7.4 shows the storage
allocation for both sSmplex and redundantly stored segments.
All non-crit.ical segments within the suspect module
are put into a "dead" state.
Critical scg._;v,:nts can be either redundant]_y stored or
not. A redul_dantly stored critical segm,q, nt is written out to
M3 so normal memory management can be used to allocate new space
for it whc_1 required. Since it is assum';:d that failures do not
simultaneously affect both copies of redundantly stored infor-
mation, the good copy can be accessed after a failure.
Non redundantly stored critical segments are made not
present. Fixed data and programs fall into this category.
For all M2 failures, statistics are maintained indica-
ting a failure history. If an M2 module develops a bad history
of failure, then it will be removed fron an active status. The
definition of how many failures within a given time period indi-
cates a bad history, can be considered a design parameter depend-
ing upon whether transient oz permanent hardware failures con-
stitute the predominal%t failure mode.
7.3.3 Fault Tolerant Aspects of the I/OC, Channel
This section will address problems associated with re-
covery from a transient or permanent failure in the I/OC or com-
munication channel between the I/OC and the device.
Many constraints must be placed upon the I/OC, channel
and the attached devices and controller. Figure 7.5 presents
schematically the elements which will enter into the discussion.
Only one I/OC, channel and device is shown. Clearly more exist
in a real system. Our discussion will focus on only one I/OC,
channel and device at a time.
7.3.3.1 Incorrect I/O Commands: The basic recommended approach
is to eliminate the possibility of executing incorrect I/O com-
mands. As a general principle all I/O devices require some de-
gree of feed back to the MP, if any fault tolerant design goals
7-17
INTERMETRICS INCORPORATED "701 CONCORD AVENUE - cAMBRIDGE, MASSACltUSETTS 02138 " (617) 661-1840
MU I
M2 Memory Module
MU 2 MU 3 MU 4
F
I
l
h__
I
l
l
l
[
I
l
L
/
I{WI_ P P
W3SI__/
W4S 2
WIS 3
W5S 3
HW 1
W4S 1
(W2S 2
(w4s2_
W2S 3
1 /
\ p
l
-/
W IS i
W5S I
WIS2
W3S I
_IIW 3
W3S 3
WIS 1
W6S 1
/
(WIS 2 )
(W3S 2 )"
WW4S 3
egment --
simplex-
_Is egment
redundan_
egme_te: -
Legend :
MU k
HW i
WjS i
/.
(WjS i )
P
p/
= k th memory unit
= Header word of i th segment
= jth word of the i th segment
= redundant copy of HW i
= redundant copy of WjS i
= pointer to start of next segment contained in HW,
1
= redundant pointer contained in HW/
Figure 7.4: Storage Allocation in Interleaved M2
7-18
INTERMETRICS INCORPORATED •701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 . (617) 661-184C
Figure 7.5: I/O Elements
P
[ M2|
I.......ILL...........
ijoc
_j<-_7 E+_L "
Multiprocessor
P
M2
I/OC
Channel
D
E1
E2
E3
E4
E5
E6
E7
Processing Unit
Operating memory
Input output controller
Communication Channel between I/OC and Device
Device and associated controller (if required), e.g.,
Printers, CRTs, IMU's, other computers, etc.
Device error
Channel error detected by.device
Channel error detected by I/OC
I/OC error
M2 error detected by I/OC
Interprocessor communications error detected by I/OC
Processing Unit error
7-19
INTERMETRICS INCORPORATED "701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 " (617) 661-1840
are to be achieved. No external device can be allowed to run in
an open loop mode without com:qunications back to the I/OC.
One of the most devastating aspects of I/O failure is
the possible execution of illegal unwanted I/O commands. A major
design effort, which will impose constraints on the elements of
figure 7.5, must be undertaken to eliminate or minimize the pos-
sibility of incorrect I/O. Let us look at a typical I/O sequence,
with safeguards to minimize this possibility.
a) The processing unit issues an I/O command to the I/OC.
b) The I/OC reads the indicated M2 location to obtain the
I/O descriptor.
c) The I/OC sets up the channel and issues the command to
the device.
d) The device echoes the command back to the I/OC for ver-
ification.
e) If correct, the I/OC issues an execute sequence to the
device. The device then executes the command which may
require reading or writing into M2.
f) After execution, a finished indication is sent from the
device to the I/OC and this status is set into the I/O
descriptor in M2, or an interrupt is generated.
Let us investigate the effect of a failure during any of
the sequential steps listed above. Error indications can occur
from many sources including P, M2, I/OC, channel and device.
A failure indication, E5, E6, or E7 during steps a and
b allows time so the I/O can prevent the issuance of the con_and.
If an error, El, E2, or E3 is detected during step'c, then the
I/O must also terminate the command, since an execute has not
been issued to the device. An E failure indication during step
c should result in an emergency sequence to cancel the I/O re-
quest already issued to the device.
The echo check, step d, provides a positive verification
(feedback) that the device has successfully received the command.
An I/OC or channel failure indication during the execu-
tion of a command must result in a sequence of operations which
is very device dependent. This will be discussed in section
7.3.3.7.
7-20
INTERMETRICS INCORPORATED" 701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138. (617) 661-184(
7.3.3.2 Super: Critical ComaTlands: Although the I/O portion of
the space st0tion is inadequately defined, it seems reasonable
to postulate the nec .....i[:y for a small number of super critical
commands with the following properties.
a) It is most disastrous if the command is executed when
it shouldn't be.
b) It is better to abort the command or action if anything
seems to be going wrong rather than execute it incor-
rectly.
Exam]?]es of such commands might be "Stage the Rocket",
"Pul-cc' the Airlock", etc. What should be done if failure occurs
durir.:-_ the c::ecution of a super critical dommand? The answer is
to make the coY.m_and fail safe, by issuing it or a facsimile thru
multiple channels to the device. Only when all the arming con-
ditions for the command are p:r.operly set is the device allowed
to execute. If any discrepancy is noted at the device, command
execution must be held up for resolution by the MP.
7.3.3.3 I_nte!]z__u.pt- s: In raany instances, the system is faced with
the problem of "phantom" interrupts or missing interrupts. Fault
conditions within the interrupt logic can cause undesired inter-
rupts (phantom interru]gts) or can possibly prevent the generation
of interrupts \.%_ich should occur. The action to be taken by the
system in these cases is very dependent upon the interrupt condi-
tion one is considering.
Let us consider two cases:
a) The Expected Interrupt
Often interrupts are expected when an I/O device
command finishes. The exact time of occurence of the
completion of the I/O command is not known, but the
worst case time may be estimated.
A time out error indication is a simple mechanism
which will inform the system that the I/O device has
not finished executing the co_nand or at least the "done"
interrupt has not been received within a given time
period. If the I/OC and channel have a sufficient amount
of internal error detection, the failure can probably
be attributed to the device itself.
The action to take might involve a limited number
of retries of the operation or a call for system re-
configuration which eliminates the device from use.
7-21
INTERMETRICS INCORPORATED " 701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 " (617) 661-1840
If a "phantom" interrupt occurs, which indicates
a device end condition for a device which wasn't being
used, then clearly this interrupt should be ignored by
the system. This feature can be incorporated into the
interrupt handling routines.
b) Unexpected Interrupts
These are a class of interrupt conditions which
are provided for but which are unexpected. For example,
the failure of a P or M2 unit might cause a different P
to get interrupted. If this failure interrupt is sign-
alled when the condition really doesn't exist, it is
probably still wise to service the interrupt rather than
ignore it. It is better to configure into a degraded
mode of operation, for a short while, when it isn't nec-
essary, rather than not to reconfigure when it is nec-
essary.
Other interrupts which are unexpected are not as-
sociated with failures. Many are traps, such as absent
segment trap conditions. The servicing of an absent
segment trap condition when one doesn't exist can lead
to inconsistent situations and ultimately system failure.
One design feature, which can be applied to certain
I/O interrupts, involves a handshaking or interrupt ver-
ification concept. This feature would have the system
verify that the interrupt which was signalled really
does exist. The device which signalled the interrupt
must retain the interrupt condition information until
after the verification cycle. The verification can
either be performed directly by the I/O unit or by a
processor through an I/O command.
7.3.3.4 Non-State-Dependent Sequences: If an I/OC or channel
sustains a transient, which causes the termination of an I/O
sequence, then it would be desirable to rely upon a recovery
policy which would cause the reissuance of the I/O command. In
order for this recovery policy to be satisfactory, the response
of the I/O device to the command must be only a function of the
command and not of the state of the device itself. This feature
can be designed into the device if one is careful about the ini-
tial design specification and the type of commands one allows.
For example: Assume a tape unit is at the end of re-
cord 6 of file i. A command which says "Read the next record"
is very dependent upon the state of the tape unit; namely the
position of the tape. A better command structure would be "read
record 7 of file i". The result of this command will always be
the same independent of the position of the tape.
7-22
INTERMETRICS INCORPORATED. 701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 ' (617) 661-184(
It should be clear the "Read the next record" would not
prove ho be a satisfactory command to reissue in case of a
failure in the middle of reading record 7. Record 8 could be
accessed instead of 7.
"7.3.3.5 Con__t!_le_te- _fim:__sln_eBuFfer: If errors can be detected as
soon as they occuY and if recovery from transient errors is re-
quired, Loth the I/OC and the device must have enough buffer
storage so that a :re,transmission of the entire message (data and
command) can be m.<_do. The Z/OC buffer may, indeed, be M2 and
the buff< _- stora[(io element of the archival memory might be the
tape itself.
It is unc!esirable to have to recreate the entire message
because of a channel transient error. Retransmission appears to
be a reasonable a[_)_)roach_:
7.3.3.6 Real Time Aspects: When the MP is used as an element
of a real time con[:rol loop, outputs can be required periodically.
If a failure occurs during a real time I/O command, the device
could possibly have to wait for a nun_]er of iteration cycles for
the recovery cyc].e to be complete.
In this instance, the device must be provided wit]] a
capabi]:i_ty to extrapolnte f_om old updates until the system has
recovered. This miclht require nothing more than assuming the
last update is still valid. Possib].y, more complex methods are
required.
7.3.3.7 _ ' -_}azlurc. During the Execution of an I/O Command: If a
transient occurs, the actions to pursue in order to recover be-
come extremely device dependent.
Consider the following examples:
a) Many of the external devices attached to the space sta-
tion data bus are transducers, to monitor temperature,
pressure, gas mixture, etc. If an I/OC or channel fail-
ure occurs, the appropriate action for recovery would
be to ignore the results of the command in progress,
clear the buffer or reset the device if necessary, and
reissue the command.
Any non-destructive read operation can be reissued
for recovery purposes. Destructive read operations should
be eliminated from the system specification or temporary
redundant storage or redundant devices must be employed.
7-23
INTERMETRICS INCORPORATED. 701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138" (617) 661-1840
b)
c)
d)
Consider the case of updating the refresh memory of a
CRT output device. Assume a failure occurs during the
update operation and the possibility of incorrect in-
form:vtion on the CRT exists. Recovery action can con-
sist of nothing more than reissuing the update command.
If recovery takes i00 ms the human operator might only
notice a small flicker on the screen and no damage is
done to the overall system.
Consider the case of a printer. Assume a failure occurs
in the middle of a print cycle. It should be clear that
the reissuance of the PRINT command is inappropriate for
recovery since the old printed output, possibly incor-
rect, would exist inuaediately on top of the new valid
printed output. Page boundaries would be incorrect.
Before reissuance of the print command, the page must
be spaced. If a plotter instead of a printer were being
used, the computer operator would have to be informed
to insert a new sheet of paper in the plotter.
Inter-Computer Communication. Quite possibly, the space
station will contain pre-processors in addition to a
large central multiprocessor. Pre-processors are em-
ployed so as to buffer the high bit rate of the device.
(See Figure 7.6.) They perform high frequency inter-
active calculations and provide a data rate reduction
for the system.
Unlike simple input output devices which can re-
cover with reissuance of commands, a pre-processor re-
action to a conm_and can be very dependent upon its own
state.
All the concepts of command verification and mes-
sage buffering, must be built into the pre-processor.
The programs in the pre-processor must also be designed
to run asynchronously from the multiprocessor.
7.3.3.8 I/O Locks: When a software process requires access to
an I/O device, the device may required to be locked to the pro-
cess. That is, no other process can access the selected device
until the previous I/O request is finished. Problems of dead-
lock exist when the initiating process fails.
If the software process recovers quickly enough, then
the lock doe_ not remain on the I/O device for an excessive time.
However, if recovery takes a long time or if the process is spe-
cified to be non-critical (that is it need not recover), then
some mechanism must be designed into the system to release the
I/O lock. This is one of the elements to consider in allowing
a process to fail safe.
7-24
INTERMETRICS INCORPORATED "701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 " (617) 661-184(
comn_al]d s a_]d.
initialization
data
Cen tra 1
Multiproccssor
bit
rate
Pre-processor
interrupts and
result data
high bit rate
high frequency
periodic pro-
cessing
Device I
Figure 7.6: Pre-Processors
7-25
INTERMETRICS INCORPORATED .701 CONCORD AVENUE . CAMBRIDGE, MASSACHUSETTS 02138 " (617) 661-1840
Even though the process need not (
case of a failure, a s'pecial I/O routine m
search, find and release all locks created
terminated.
::inue operation, in
he executed to
the process which
7.4 The Implications of Fail Safe
Although Jt is the physical hardware that fails, it is
conceptually useful to consider the process being executed at the
time of faJ ]u:ce to have failed]. Only one process can fail when
a processor fails. In the case of an M2 module many processes
can be affected.
It is assumed that in the space station environment all
processes are either required to recover or fail safe. None are
allowed to be abruptly terminated without consideration of the
interaction between the termination and the rest of the system.
A number of problem areas arise when one considers the
implications of Fail Safe. Some of these are discussed below:
a) In order to maintain system throughput in a multipro-
cessor, the intrinsic parallelism within a function
must be exploited. Parallel processes are spawned and
executed simultaneously on different processors.
If a process is to fail "safe", all the fork points
which were created laust be examined and all the sp_ned
processes terminated. This feature must exist within
the executive function of the system which controls the
termination of processes.
b) If a process is to fail "safe", all the I/O commands
issued by the process must either be cancelled, term-
inated or corapleted. None may be left indefinitely
on queue. The various commands issued to each device
must. be studJed to ascertain the effect of a preraature
termination of the issuing process. If a tape was in
the middle of reading a record, the read cycle can be
completed. Upon receipt of the "Done" indication, the
read data can be discarded. If a command is still on
an I/O queue, it can be cancelled. If a device is being _
written into, it is not clear that the write operation
can continue when the initiating process is terminated.
All these types of questions must be considered for
each I/O device when one desires a process to fail safe.
c) When a memory unit fails and a segment of a non-critical
process is made dead, questions must be raised as to the
7-26
INTERMETRtCS INCORPORATED •701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 " (617) 661-184C
d)
e)
disposition of the other valid segments within M2 and
M3 associated with the failed process. The following
suqgestion is made:
One of the conditions which will cause a process
to be ]<i]]ed, will be when it attempts to access data
contained within a dead segment. _:{hen this occurs,
control will be transferred to an executive routine
3.<hich wil] control the operat:ion of systematically
terminating the pro{:ess. This includes:
i) Placing the process in the dead state
2) Placing all spawned processes in the dead state
3) Releasing the stack number (in the case of a stack
machine) and the space used by the process and de-
pendent processes.
Contained within the process stack are descriptors
of all the local data segments currently being used by
the process. The space used by these segments must
eventually be reclaimed for other uses.
Durin_ the normal execution of the memory management
function, any segment not referred to within a period
of time will be replaced by more active segments. This
includes any dead data segments that may exist. Even-
tually, all the dead segments in M2 will be overwritten
by just letting the system run normally. However, it
is possible for dead data segments to occupy space on
M3 which could possibly be used for other segments or
for file storage.
At some point a "Garbage Collection" routine will
have to be executed in order to reclaim this lost space.
Most probably, the normai reclaimation of fragmented M3,
due to M3-M4 control, will provide the required service.
In general, the executive design must consider the actions
to take when a process enters the dead state. If an
interrupt is directed to a process which is in the dead
state, it should be ignored and any other process which
is dependent upon the dead process must be informed so
that appropriate action can take place.
7-27
INTERMETRICS INCORPORATED "701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 . (617) 661-1840
!
Chapter 8
CONCEPT VERII_ICATION
8. i Backqro_Inc]
The mul_ip)_-ocessor (MP) system proposed for future man-
ned s].ace stations will employ many new concepts w]lich will hope-
iully <._ha_ce tht ucrfo_-m:n_ce and reliability of the s;]stem.
This c]_)ter will discuss the validation of various concepts
proposed for the space station MR. The concepts to which refer-
ence is made are not applications software or SUMC hardware but
rather those aspects of the system which interact with applica-
tions software, and SUMC hardware to control the operation of
space station subsystems and experiments. One wishes to verify
that the ideas which will be implemented do indeed yield the
required performance with an efficient utilization of resources.
How does one go about validating a new concept, or at
least establishJJ1g confidence that a given approach will prove
satisfactory? The u!timaLe answer is to build the system, run
it, and evaluate its performance. This of course is an expen-
sive pJrocess, especially if many new ideas have to be frozen
into a design before it is evaluated. In order to provide a
more orderly, cost effective approach a two level simulation is
proposed, both levels being carried out before the system is
committed to operational use.
This chapter will discuss both a high-level and a more
detailed low-level concept verification process.
a) The first verification phase involves both analytical
techniques as well as a high-level computer simulation
employing idealized work loads and environments. The
results of this effort will verify that a given design
concept can achieve specified qualitative goals.
b) The second phase involves a more detailed, low-level
simulation requiring both simulated and actual hardware
and software modules. The objective of this phase is
to verify quantitative goals, by means of measurements
and design modification.
8-1
INTERMETRICS INCORPORATED • 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840
As part of both verification steps, measurements are
made and design parameters are modified so as to optimize sys-
tem performance. The specific ;Jctivities involved in the de-
sign verification and pe:rformance optimization of the space
station multiprocessor concept will be presented in the remain-
der of this section.
8.2 Phase ! -- Initial A__na_lysjs and___}_I!__!_-Leve]_Simulation
8.2.1 Objectives
The initial analy_"_._l_ and hi(_h-level simulation attempt
to achieve the following objectives:
8 2 1 1 Desicm Features: The major design features must be
established. In a MP system this will include:
a) A definition of memory management philosophy
b) The appropriate utilization of local memory
c) Interrupt and I/O analysis
d) The structure of the MP internal bus
For example, the application of simple analytical techniques
will demonstrate the inappropriateness, from a performance stand-
point, of a single 32 MBPS internal bus which is time-shared be-
tween P's and M2 elements.
8.2.1.2 Parameters: The parameters which should be made vari-
able in the low-level simulation _ust be identified and segre-
gated, so that performance can be optimized. For example, the
simple analysis of local memory and its effect upon performance
indicates that the major parameters are the M2/MI speed ratio,
r, and the hit ratio, h.
The isolation of these parameters is significant in that
performance improvement or degradation is very sensitive to h.
Clearly those hardware and software elements which control h
should be made as variable and flexible as possible•
If a virtual memory is employed a simple, high-level
simulation or analysis will show that the following parameters
should be made variable:
a) The page size (if paging is employed).
8-2
INTERMETRICS INCORPORATED. 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-184(
b) The replacement algorithm.
c) If an associative memory is proposed, its size should
be variable. Performance is very sensitive to the
search time of the page location algorithm.
d) Possibly, the utilization of a variety of different
access times to M3 devices should be considered.
e) The size of M2 could be ;i parameter. The "thrashing"
threshold has to be established if software expanda-
bil:Lty is to be achieved.
The r,&-_in objective of this effort is to isolate as
many i:,aramot_L_-n of de_:ign as pc)s<J])le thro_'3h a careful scru-
tiny of all major design features.
8.2.1.3 Assum_)t_ons; Another objective of this first phase
effort is to establish clearly all the assumptions, implicit
or explicit, that formed the basis of major design decisions.
For example, why was a multiprocessor chosen? Three answers
are possible :
a) A cost effective performance increase.
b)
e)
Reliability improvement through the use of identical
elements and an ability to recover.
Expandability.
All three of these assumptions or desires drives one
to the conclusion that the executive system, which interfaces
the hardware and applications software must be generalized
enough for expandability, yet it must be implemented in such
a way as not to produce an excessive overhead. Reliability
implies a comprehensive error detection scheme. Recovery im-
plies a specific conununication interface between the hardware
and executive.
8.2.2 Tools for High-Level Simulation
8.2.2.1 Simulation in General: How does one approach the pro-
blem of developing a high level simulation? What tools are
available? Reference 1 discusses techniques available for both
macro (high level) simulation and micro (detailed low level sim-
ulation). Macro level simulation is concerned with abstractions
8-3
INTERMETRICS INCORPORATED. 701. CONCORD AVENUE " CAMBRIDGE, MASSACI4USETTS 02138 • (617) 661-1840
of computer systems which are designed to expose and analyze
critical design parameters. Generally speaking, these simula-
tion techniques deliberately suppress design detail, and con-
centrate on broadly defined measures of system effectiveness.
Computer simulation at this level has its basis in
queueing theory, the probaba]istic analysis of the interaction
between users and facilities. The role of simulation is to
exe_:cJse user and facility interactions whose complexity ex-
ceeds the bounds of known or feasible analytic solutions, by
Monte Carlo methods.
Digital computer facilities have long exhibited the
symptoms dear to the queueing analyst: namely, bottlenecks.
The reader will probably have personal familiarity with situa-
tio_is where a data processing facility ]]as become hopelessly
inefficient due to one, or a combination of, bottleneck elements.
The objective of high-level_ simulation is to obtain
an advance estimate of the performance of a computing facility
at the design stage. To be successful, the simulation must
anticipate the way the system would work if it were built. The
successful simulation designer must accomplish all of the fol-
lowing steps:
a) He must satisfy himself that simulation is an appropri-
ate analytical method, and that the elements of the
system and the job stream are sufficiently defined.
b) He must verify that the results of the simulation are
correct, and that they are appropriate to his purpose.
c) He must explain and substantiate his results and pro-
selytize his conclusions in order to affect future
events in a constructive way.
These generalizations are noted here because there
seems to be an uneasiness among professional personnel about
high-level simulation of computers. This is probably because
the technique of simulation has been often misused, particularly
by neglecting the fundamentals listed above.
8.2.2.2. GPSS: A generalized macro simulation language GPSS was
developed by Gordon [2] of IBM. GPSS deals in transactions,
events, facilities, storages, and queues. A transaction is
generated for each element in the job stream. Events mark the
movement of the transaction through the system of facilities,
storages and queues. A facility is _ system element that can
accomodate only one transaction at a time. A storage is a
8-4
INTERMETRICS INCORPORATED • 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-184(
system element that can accomodate l_:any transactions up to a
specified ]irnit, at a "time. A queue is a waiting line. Gordon
gives e::a_,_l)les of these concepts as [hey might occur in differ-
ent systems:
Col311]lrli c ;-_9 i<-q]s
Transportat] on
Data _..,_,_,_:.: .... _j
T_ans ac tion Faci li ty St or age
Message Switch Trunk Lines
Car Toil Booth Ro_d
i_:_c_Ld l<cy Punch 11:_unory
There have 10een at least two efforts to develop spe-
cialized simulation language for compdter systems. These lan-
guages arc CSS II [3] and iMSIM [4].
8.2.2.3 CSS II: This simulator was developed by IBM to support
its own sys-tem analysis needs, and to aid in analysis of custo-
mer facility requirements.
IBM now provides CSS II as proprietary software on a
rental basis. CSS II is similar in concept to GPSS but differs
in one important aspect: it is not general but applies speci-
fically to computer systems. Thus its language speaks in terms
of tape units, disk files, communication lines, and terminals,
and provides instructions for the modeling of programming systems.
CSS programming consists of a specification of system
elements, a specification that generates job streams, and spe-
cification of. the logical operations to be performed on the job
elements. Its generality is enhanced by permitting a more or
less complete construction of both the system hardware confi-
guration and the software operating system, to a level depen-
dent on the user's needs and interests.
8.2.2.4 IMSIM: IMSIM was developed by Systems Development Cor-
poration for the NASA Manned Spacecraft Center. It presents a
less general approach to computer simulation, in comparison to
CSS, because user constructions are confined to the preparation
of input tables which define the configuration of computer system
elements and the job stream. The algorithms that define the
8-5
INTERtvlEIRICS INCOF{PORATED- 701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138. (617) 661-1840
software operating system cannot be modified, except 'for a few
switch setting choices. The operating system programmed into
IMSIM includes the capability, of sin tulating priority-dependent
multiproqrammed and m',_Itiprocessor computing systems. I_SIM
is supported only at the Manned Spacecraft Center, NASA. It
is written in Modlit, a language similar in many respects to
GPSS.
8.3 Phase 2 -- Low-Level, Detailed, Mixed Simulation
The attractiveness of hic;h-!evel simulation lies in
its ability to discover major conceptual flaws before the de-
sign is committed to hardware and before the operating system
software is frozen. Hopefully, this effort also builds confi-
dence in the system concepts at a low cost. The major short-
coming of high-level simulation is that design flaws may have
been obscured due to simplifications in the models employed.
The low-level simulation employing various degrees
of real hardware, software and a simulated environment will
provide a more definitive verification of system performance,
albeit at a significantly higher cost. The CVT program pre-
sently being carried out at MSFC is an example of a simulation
with a real computer and data bus. The space station environ-
ment and typical work loads will, however, have to be simulated
by artificial means.
8.3.1 The Simulation Process
The simulator is a device (both hardware and software)
which provides the developer of the system with overall exter-
nal control of the system being tested. The simulator provides
hardware and software required for specifying, monitoring, and
testing the system under well controlled conditions Reference
1 describes the simulation process which can be organized into
four factors as shown in Figure 8.1. These are:
a) the user (USER) ,
b) the simulator itself (SIMULATOR)
c) the computer system being simulated (SYSTEM), and
d) the simulation output (OUTPUT).
Let it be made clear that the SYSTEM being simulated
may be .implemented as either a complete software effort on a
host computer or it may contain certain elements of real hardware
8-6
INTERMETRICS INCORPORATED • 701 CONCORD AVENUE . CAMBRIDGE, MASSACNUSETTS 02138 . (617) 661-1840
I
DB
C
OU TPU T
Figure
8.1:
simulator Logical partitions
8-7
INTERME1RICS INCOR?ORATED" 701 coNCORD f,,VE_',ILIE"CAW'Ir'3RIDGE'MASSACHLJSETTS 02198, " (617") 661
and software.
approachos.
follow.
There are advantages and disadvantages to both
These shall be made clear in the discussions to
The geometry of the logical partitions in the simula-
tor is shown in Figure 8.2, and the physical control is shown
in Figure 8.3 fo]!owing. The control[ path labeled A in the
two figures provides the user with the capability of specifying
the load module to be simulated, start-location and initial
SIHULA<P'.)R clock setting, the maximum allowable SIMULATOR clock
settir, g (to assu_re run termination), the configuration of the
SYSTEH (ieve].s of redundancy, numbers of spares, initial fault
states, etc.) information relative to automatic reconfJguration,
illegal instruction detecLJon, execution of instructions in
read/write memory, etc.
The primary control, which the USER specifies, follows
path B. By this path, and the return path C, the USER will be
capable of ordering entry to routines which he provides, upon
the occurrence of events or situations he specifies. The
trigger-directives can include time conditions, location refer-
ence (instruction or operand access), and state changes (I/O,
interrupt, hardware error detection signals, etc.). Once his
routines have been entered as a consequence of a trigger direc-
tive, the USER is capable of accessing all locations, registers,
states, and conditions in the SYSTEM, and modifying them as he
sees fit. Through an interface ].anguage, the USER may implement
actions based upon conditions of almost arbitrary complexity, by
simply programming the testing of these conditions in his rou-
tines.
Control paths D and D' provide information for OUTPUT,
such as trace, flow-trace (output produced by branches only),
interrupt-occurrences, faults, or output directly from the USER.
Information is not required on path "a" since the USER
only interacts with the SYSTEM once the run starts and needs no
interaction with the SIMULATOR. Figure 8.3 shows that the SYS-
TEM is actually implemented within the SIMULATOR, and that the
control paths to it actually interact via the SIMULATOR.
Path E of Figure 8.3 represents the closed-loop dynamic
flow capability which the USER can exercise within his interface-
language routines. These routines may, in turn, call routines
prepared in other languages to perform further processing. Us-
ing external routines via this path allows the convenient addi-
tion of a data-recording capability to the system to allow post-
run processing and the addition of almost any conceivable envi-
ronmental model.
8-8
IN]ERMETRICS INCORPORATED. 701 CONCORD AVENUE - CAMBRIDGE, MASSACHUSETTS 02138. (617) 661-1840
U SER
Us e:_-l_.rovided Jn formation
.!
which controls sJ.mulation }
run !i
S IMU!,ATOP, {{
F---- l il
t sYsTE_ J I'I
, 1 I!
I System being _:
simulated 1 [_
1]
1
I0i;_LT.....................!
I Listing outlJut from
i simulation run " !
Figure 8.2: Basic Simulator: Input, Simulator, Output
8-9
_ INTERMETRICS INCORPORATED -701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 " (617) 661-1840
E_._._. ......_.:._._i,_. .....
A
C D
!
i
!
I
I
J_:
A B
S IMULATOR
User-provided
Routines
[-- O
SYSTEM Ii
Ii
I ° I
I • I
I
I ®
Figure 8.3: Simulator Physical Control Flow
8-i0
INIERMETRICS INCORPORATED •701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 . (617) 661-1840
t
8.3.2 Simulator Design Issues
8.3.2.1 User Interface: For any simulation effort to be success-
ful, the user o): experimenter must be provided with a capability
of exercising complete control over the simulation from beginning
to end. This control includes the ability to:
a) Specify all initial conditions including default con-
ditions, before the simulation is run. This includes
the a])ility to specify the contents of memory locations,
control ])it and processor registe_:s.
b) SpecJ fy the work load to be run in the system, includ-
ihg ]_,zdware clcmei:ts to be used.
c) Specify the en\,ironment to be sirnu]ated, including ex-
tra-ordinary events such as failures.
d) Specify the outputs to be generated and reported.
e) Specify modes of operation, the ability to roll back,
and snapshot times.
8.3.2.2 Work Load: The simulation of the processing unit or
employment of real hard,rare is only the first step in the sim-
ulation of a cc)mputer system. In order to p_:ovide meaningful
inform_tion o_l complex system interaction a "work load" for the
system faust be specified. For the SUMC MP this will include a
reasonably complete set of actual or simulated applications
software modules as well as the real executive system.
If one attempts to evade the issue of generating a
realistic work load, many important design factors may be over-
looked. For example, if a simulated work load is generated by
a collection of subroutines, each one occupying a given amount
of memory space, and a specified execution time (as simulated
by a countdown loop), the information concerning instruction
frequency is lost. Also, since memory requirements for each
subroutine are assigned arbitrary values, many factors con-
cerning memory management become distorted.
It is suggested that an effort be made to generate
the real application software to be used as the work load.
Space qualified software is not required for a system simula-
tion. Therefore, the use of real applications software, to
the extent permitted by the simulator's limitations, may be
less difficult than trying to generate a realistic model of
the work load.
8-ii
INTERMETRICS INCORPORATED • 701 CONCORD AVENUE ' CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840
Because of the interaction between ]} ,rdware and the
executive, it seems only reasonable that th(- ,_cutive system
model must contain as many as possible of tb :atures of the
real executive. A ].arge number of the paraz_ _:s or algorithms
which will be modified because of the simul_ n experience are
implemented in the executive software.
8.3.2.3 The Environment: In simulating aerospace computer
systems, the work load must often interact with the spacecraft
and its environment. For examr_le, navigation programs must re-
ceive acceierometer inputs before they can correctly update
vehicle position an<] velocity. A high degree of similarity must
be maintained between the real and modeled environments so that
the simulat(_d computer can be subjected to computational loads
and dynamic situations closely approximating the conditions of
the actual mission.
The simulation environment developed for the Apollo
Guidance, Navigation and Control System included modeling space-
craft dynamics, engines, optics, astronaut interactions, atmos-
pheric and gravity effects, motions of celestial bodies, etc.
For the SUMC MP, the environment cannot be simulated
within the SUMC itself. This would distort memory management,
I/O, processor allocation and real time factors. The simulated
environment must be provided by external equipment. For example,
the H316 computer can provide such a vehicle by simulating the
data bus and all its peripherals. If a real data bus is employed
with a limited amount of real avionics equipment then the H316
could be interfaced to the data bus to simulate those equipments
which are impossible to exercize satisfactorily in the labora-
tory (e.g., IMU's, fault detectors within BITE) .
8.3.2.4 Measurement of the Svstem Under Test: The accumulation
of statistics and the output presentation of this data are the
ultimate result of any simulation result. If a real computer
is used instead of a simulated model then a major problem can
arise due to the lack of computer memory capacity for trace and
dump routines and data. If the memory is used for trace and
dump data then it cannot be used to process the workload. The
results of the simulation run will therefore be distorted.
A secondary problem also arises in that real time aero-
space computers usually do not possess a full complement of high
speed record recording equipment, such as card readers, high
speed printers, or tape units. The attachment of these equip-
ments could also distort the results since they put an abnormal
load on the I/O.
8-12
INTERMETRICS INCORPORATED. 701 CONCORD AVENUE - CAMBRIDGE, MASSACHUSETTS 02138 . (617) 661-184C
A complete software simulated system will not suffer
the problems mentioned above since time anti memory space are
also simulated entities. When real hardware is used within the
simulator, it is difficult to compensate for inadequate memory
or tile loss of real time.
Three features which were incorporated into the Apollo
computer simulator are presented below as examples of the inter-
action of the s:imu]ator and the simulated system. These inter-
acticns imply [d0at if a real SU_!C I,_P is to be employed as an
e]emcnt of t}_o simulated system, a design effort must be under-
taken to pro',;:i(]c-the co:,:Tect "hooks" into the hardware so that
use_:ul results m,%y be obhained.
A useful feature to be used in microsimulation is
rollback [5]. Long missions such as Apollo require
simulation time on the order of hours. Should the host
computer (on which the simulation is being executed)
malfunction, the simulation will abnormally terminate.
Upon restart one does not want to go back and duplicate
the execution of this simulation from the beginning of
flight. By establishing rollback points in the simula-
tion this problem is avoided. At rollback times com-
plete core and register dumps are taken, and this infor-
mation is put on a secondary storage device. Then upon
system failure the simulation can be restarted at the
last rollback point by loading memory with this stored
information. The overhead associated with rollback is
well justified with long simulations, such as Apollo.
However, to prevent this overhead from becoming too
high the system designer must decide upon a judicious
criterion for establishing rollback points. That is,
he must trade off the cost of frequently storing roll-
back information with the savings in not having to re-
simulate a large part of 'the flight.
b) Stress Testing
Stress testing can be provided in a simulator to
help determine if eo_3inations of application programs
will exceed their combined t±me budgets under the exe-
cuted conditions of operation. This request reduces
the speed of the object computer. If a group of appli-
cation programs is run in a simulation with a computer
whose speed is, say, 75% of the real computer capabi-
lity, successful operation may be interpreted to mean
that no more than three-fourths of the computer capa-
city has been absorbed. This special request can thus
8-13
INTERMETRICS INCORPORATED. 701 CONCORD AVENUE - CAMBRIDGE, MASSACHUSETTS 02138 " (617) 661-1840
c)
d)
be used to "diminish" the capability of the computer
until a point J s reached where timing requirements
are not satisfied. This level then is a guide to the
amount of computer capability still available for
other software.
Stress testing can also be used to verify the
"thrashing" threshold of memory management. If the
amount of available memory is reduced but the workload
and e_vironment are hel(I constant then a measure can
be obtai1_ed as to how much excess memory is available
for multi-programming.
The Coroner Request
A "coroner" special request can be implemented in
a simulator for post-mortem diagnosis. The request
causes storage of information from each simulated in-
struction in a circular buffer of size n. If the run
abnormally terminates, a list of the last n instructions
simulated is produced. This list is a valuable aid in
determining the reason for the abnormal termination.
However, the overhead associated with this request re-
quires that it only be used when its cost is outweighed
by the enhancement of debugging efficiency.
Knobs and Dials
A system simulation is undertaken not only to
verify specific design concepts, but also to make per-
formance measurements under various parametric condi-
tions. In order to achieve this objective the system
(hardware and software) must be provided with enough
flexibility (knobs and dials) so that the various de-
sign parameters may be adjusted.
Although the details of the SUMC MP have not been
published by MSFC a number of suggestions can be made
concerning those entities which should remain as vari-
able parameters during the simulation. Implicit in
the following listing are obviously a number of assump-
tions which, if incorrect, could make the variable un-
necessary. For example, if a management directive ex-
ists that only two processing units are to be employed
with no concern for future expansion then a number of
problem areas associated with multiprocessor design
degenerate into trivial solutions.
The following list describes some of the design
parameters which should be kept variable during the
low level simulation process.
8-14
INTERMETRICSINCORPORATED.701 CONCORD AVENUE " CAMBRIDGE MASSACHUSETTS 02138. (617) 661-1840
i) Operating Memory (M2)
2)
3)
Assume a paged virtual memory conce]?t is
employed. The following items should be adaptable
in order to optimize performance.
i) Page size
ii) Page replacement algorithm
iii) Page presence algorithm. If an associative
memory is employed [o determine the presence
of a page in }{2, then the number of woz-ds in
the associative memory should be made a para-
me te r.
iv) Total size of M2 storage as well as the number
of M2 modules.
v) Possibly the speed ratio between M2 and M3.
Processing Unit and Local Storage (MI)
i) Instruction architecture. A measure of in-
struction frequency will indicate which in-
structions are not needed. Similar]y the
measurement of subroutine usage of various
control features will indicate which instruc-
tions need to be incorporated into the de-
sign.
ii) Depending upon the use of Ml its size should
be variable.
iii) The algorithm used to assign processes to
processors should remain a variable as should
most of the executive functions dealing with
resource allocation.
Communication
i) The P-M2 internal bus width and rate should
be changeable especially if a bottleneck is
anticipated, based upon phase 1 simulation.
ii) The communications link from processor to
processor as well as from processor to I/O
should be made flexible so that the traffic
capacity can be increased if a bottleneck
is discovered.
8-15
INTERMETRICS INCORPORATED. 701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 " (617) 661-1840
References for Chaunter 8
i) Intermetrics, Inc., Final Report, Contract NAS9-12119,
"Advanced Data Management System Analysis Techniques
Study", July 1972.
2) Gordon, Geoffrey, S][stem Simulation, (Prentice-Hall,
Englewood Cliffs, New _Jersey, 1969).
3) IBM, CSZ II General Information, Technical Publications
Departi_{_£-_-i:J] Westchester Avenue, White Plains, New
York.
4) System Development Corporation, "Information Management
Syste1_ Design For Future Missions, Users Manual",
(Report TM- (L)-4719/001/01, Contract NAS9-11211, NASA
Manned Spacecraft Center, Houston, Texas).
5) Chandy, K.M., and Ramamoorthy, C.V., "Rollback and Re-
covery Strategies for Computer Programs" (IEEE Trans.
on Comp., C-21(6), June 1972), pp. 546-555.
8-16
INTERMETRICS INCORPORATED • 701 CONCORD AVENUE . CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840
t
Chapter 9
C]IITIQUE 0]? SUMC's AI,CI]ITECTUAAL CIIARAC_:_RISTICS
9.1
A crit:ica]_ evalual-ion of the SUMC desiqn is provided
in ord<:,.: that fu[:c_:e eli-errs may have the be1:efit of the present
_l-_,r-I....,-_.,_-.]tso '_'_:ir_cr_ki<,'ue wi_!]_ l_ot be T)rJmarily dir.,_cted at
the i};_i,]:":L_entati<)i_ asuects of the c:ircuit and/or logic dosmgn,
but ratl_c,r at th<,. higher ].cvel archJ tectural features of the
hard<.Jar<_. An ev;_luation of any design must of necessity rest
with a C,-_t:ermin,:_iion of how well the design el]preaches a set
of goals. Therefore, a set of design goals is now presented
which is Intermetrics' interpretation of MSFC's desires in the
development of the SUMC project. "
a) The _-IS]_:'Cdesire to use a basic SUMC hardware design on
a wide variety of missions, which will require a wide
range of comoutation power, leads to the requirement
for a hardware design which is expan(iable. "Expandabi-
]ity" should be considered with respect to such features
as word length and sizes of the various memory and pro-
cessing structures, including the mJ cro memory, scratch
pad, ALU, multiplexers and main memory.
b) The variety of application requirements leads to a de-
sire to create an architecture which is flexib]_e and
adaptab].e to changing conditions. For example, the
instruction set should be able to be modified or
change(]. Similarly, components should be able to be
utilized within tl_e same architectural structure re-
gardless of their execution speed. As various tech-
nologies improve, this then allows the smaller and
faster logic and/or memory elements to be incorporated
into the design with a minimal impact.
c) A specific requirement of the SUMC expandability and
adaptabi].ity design is the ability to utilize the de-
sign as either a stand-alone uniprocessor or as a
larger multiprocessor system.
d) The "U" in SUMC stands for "ultra" reliability. This
must not only include the ability to operate for a long
period of time without failure, but also (from a
9-1
IN[ERMETRICS INCORPORATED. 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840
practical point of view with respect to current tech-
nology), must indicate the ability to detect failures.
The detection of failures is required if a multipro-
cessor is to constrain error propagation and possess
the ability to reconfigure.
e) Since the SUMC family of computers is meant primarily
for aerosioace applications, the conservation of weight
and power becomes of primary importance.
Keeping in mind these different critieria, the follow-
ing sections will e'<amine various aspects of the SUMC design.
Not all of the aspects are independent of each other, but they
are presented in such a manne_- so as to highlight different
points of view.
9.2 Micro Instruction Sequencin_
presently available in SUMC is described in Figure 9.1.
only control actions possible are
In a microprogrammed machine where flexibility is one
of the objectives, it is extremely important that the micro
sequence control itself be flexible. The sequencing control
The
b)
a) stepping thru the micro code (0., i., 2., 3.)
branching to a location described by
I) an ALU output (4.)
2) associated with an opcode (5.)
3) given in the micro code (13.)
c) alternate choice in either
i) branching or holding (6. , 7.) or,
2) branching or stepping (8. , 9., i0. , Ii., 12., 14.,
15.)
Although these forms of sequencing do allow the genera-
tion of a static set of linked micro code, they do not allow for
easy modu]arization of micro code.
While this feature becomes particularly important when
the instruction architecture contains powerful semantically
concise operations, it is also extremely important with stand-
ard current forms of instructions. The execution of an
9--2 ¸
INTERMETRICS INCORPORATED " 701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 " (617) 661-1840
CONDITIONS SE o. ACrPION
None +I
Non(: +i
None 4.1
Non e -Ii
None PRM (22 31)-_SEQ
None IAROM SEQ
IC> 0 IIo]d
IC--0 MROM (C7-.C]6)
IC>4 llold
IC<4 MRE)M (C7 -- C]_G)
C_{']_ = ] MF,OM (C7 - C16)
G,_T : 0 + ]
ZNT + DOT : ] Mr,_oH (C7--C]6)
INT " DOb' : 0 +]
INT }{oq. = ] MROM (C7-C16)
INT l<oq. = 0 +i
INT+bO'f4DiN = 1 MROM (C7-C16)
INT.DOT-DIN = 0 +i
I.C. ACTION
tto]d
IR (26-31) >IC
MROM (CII-C]6)->IC
PI_]_ (20--31->IC
Hold
Hold
IC- I->iC
Ho] d
--l]
tlo] d
Hold
Ho]d
Hold
Hold
Hold
Hold
Hold
Binary
Code
0000
0001
0010
0011
0100
0101
0110
0111
1000
]001
I010
i011
CNT =: EALU overflow or ALU overflow or DEX3 as s]?ec]fied by ACCS, CNT
field
ACCS = 1 +l Hold
ACCS = 0 MROOM (C7-C16) Hold
ii00
ACCS = PRM sign or ER sign as specified by the ACCS, CNT fie].d
None MROM (CV-Cl6) Hold ii01
IC>0 +i -i iii0
IC:0 MROM (C7-C16) Hold
IC24 +i -4 iiii
IC<4 MROM (C7-C16) Hold
Figure 9.1: Control Conditions and Actions
9-3
1TERMETRICS INCORPORATED. 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSE:TTS 02138 . (617) 661-1840
instruction can be viewed as occuring in three phases': instruc-
tion fetch, operator decode, and execution of the operation.
The SUMC allo?is for common manipulation of all instruc-
tions in both the instruction fetch and operator decode phases
of execution. It is interesting to note that after the instruc-
tion has been fetched, the memory operand is fetched, if the
operator is of the appropriate "class" of instructions. This
differentiation is performed by the hardware and is completely
dependent, therefore, upon both the instruction architecture and
its physical bit mappin<T. There is no general way to have sev-
erol classes of instructions, each with its own idiosyncrasies,
without this special hardware help. This is because the decision
on whether or not to read memory must be performed in the "com-
mon" section of code.
If there were to be the ability to call and link in the
micro code, then the question as to whether to read an operand
from memory could be decided after the operator had been decoded
and the execution of the operation had been entered.
(The one current possibility for modularization within
the SUMC micro code would be:
a) Place the return micro address in the P_._
b) Branch to the micro sub-routine
c) Upon entering the subroutines, save the return address
in the SPM
d) To return, gate the return address from SPM to the PRM
and into the SEQ.
This would effectively take four micro words.)
Besides the desire for micro code modularization for
complex instruction sets, the next section will point out the
need to be able to do much more micro condition testing for
sequence control.
9.3 Choosing Functions to Optimize
It has been observed that the SUMC hardware has been
optimized for the implementation of the multiply, divide and
square-root operations. However, what is the actual expected
percentage of occurence of these operations? In particular,
what is the frequency of distribution of all the implemented
machine instructions?
9-4
INTERMETRICS INCORPORATED • 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840
1
C.C. Foster eto el. [i] has made a study of OP code usages
on th© CDC 3600 and ha_ _ I _:cotnu thaL in scientif:Lc Forts"an Proqrams,
the compi±od code colltained only ]0% arihhmotic instructions. The
rem,_ning inst_-t_ctions wc_.re invo] v©d with ]oad, storo, subroutine
linking a]-"icivarious ot]]er control c)merations_. . For the more com-
me.rcial type of application, the total pe]:centage of all arith-
metic instructions fell[ to less than 5% [2]. The most common
arJth_.t_,tic operation was clearly a@dition. Even for the pro-
9r_al, .,,ith the most arith-,L_etic f___nctions, multiply and divide
were. ]c_ss than 2%.
C.C. Church [3] states:
"In _n:_tructi.on r>ccurence _,e found arfth_,'_ctic 8.3 percent
the_ (:omm_,,nds? (%viously, we need the "likuta Move" func-
tion, buh do flow charts call for anythJ]_g near 40 per-
cent? And what of the tra_]._-',fers:My flow charts do not
call for anything u<:ar 23 ])ez_cent of the prob].em to be
involved in tra._:;]-erring."
While these types of statistics can be inte]:preted as indica-
ting a mismatch betwee._ the problem to be solved (i.e., the
program) and the operations p_:ovided (i.e., the machine in-
structions) , they can also provide insight into the design and
implementation of instruction sets. If the instruction set
prov:[(led is of the current machine level form (e.g., IBM 360)
then, for exc:m[_].e, the mu].tip]y and divide Jnstrucit]ons are
not driving design feat.ures. If these instructions are truly
less than 2% ]n occurrence, then their optimization and reduc-
tion of their execution time by half will only save 1% of the
overall execution time. On-the other hand, an optimization of
branches by half their execution time would make a dramatic
savings in actual execution time.
While it is understood that certain data reduction or
filtering problems do require an above normal amount of multi-
plication, this is not a common occurence and the multiply and
divide instructions should not form the basis of the machine
architecture.
If one takes the SUMC JZ (Jump Zero) instruction for
an example, (Figure 9.2) it can be seen that not only can it
be made faster, but the number of micro instructions can be
reduced if necessary conditions to be tested are generated by
the hardware. The testing of conditions is indeed the method
of determining control flow through an algorithm, and there-
fpre, will always either have to be in some fashion artificially
produced or eY.plicitly tested. The cost of providing these ex-
tra dynamic conditions is small when compared to the gains in
execution time and savings in micro-memory.
9-5
INI-ERMETRICS INCORPORA] ED • 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840
If (A) =0, GO TO Z
If (A)_0, GO TO NI
(A=0)
No
I 1X'FER to F CH
J
%
I_ (MAR) - i÷ (PC) 1
i
6
Figure 9.2: Micro Program Flowchart
(Jump Zero)
4
9-6
INTERMETRICS INCORPORATED "701 CONCORD AVENUE " CAMBRIDGE, MASSACHUSETTS 02138 " (617) 661-1840
It can be noted that further savings can be had in the
revised []IZ i[I(]]_ c I-][_]"_t (Y'igure 9.3) by either placing the
(P4AI',)--I->(PC) function as a special entrance to the FETCH routine,
since it occuzs in so.v<:ral SU_.IC instructions, or for this same
re_son have this action as part of a sequence control state.
9.4 Fie].d Mani]_ulation - Maskings - Shifting .- Bit Addres-
s.]._<._ _q_{i ::_h_i.fl;j nq
']'he word lonqibh of a comj>uter is often chosen because
of arithmetic p;l;ccisJon and, cotemporaneous].y, the instruction
format size. Once chosen, this word length then becomes an
a_:tificial <_u;_nt<_m of ;._dressabi.!i..ty. ?L'h_s is th<; case wit}l
the SU]iC. q.'i_eJla}-_].<n,_<_]i;atio_lof an i_,_st]ruc[;ion s_:t often rc-
guJ};es the' (;fficicnt mu_nipu]ation of variable lenqLh fields,
masking, bit m._nipu]atJon and testing. While i;he SUMC can ac-
com];liish al]. these funekJons at a Macro ].evel by using shift
and logical instructions, it is suggested that if flexibility
is to be obtained the high frequency of use of these functions
in various instruction architectures requires that they should
be more directly under Micro level control.
']%0 SUMC does recognize this fact, in a limited way,
by providing in the hardware the extraction of mant<ssa, c]lar-
acteristic and sign of floating point arithmetic words stored
in scratch [}ad. However, this is rigid. The hardware desiqn
shon]d _ot Jnitia]ly };i-esume -to know the @esired arithmetic
precision of the app].ication. For example, tlle queues and con-
trol ])its re<!uired for the executive functions of the SUMC are
not given spec.ial hardware since they are not known in advance.
What is desired is a generalized bit manipulation,
masking, field insertion and extracting mechanism which can
be micro con [:rolled . In the actual implementation of a parti-
cular instruction set for a particular mission, it is recog-
nized that this generally could be specialized in order to op-
timize the actual usage. An example of the need for testing
of certain bits efficiently would be if it were decided to im-
plement indirect addressing and hence the "indirect" bit of
the operand _would have to be efficiently known during the ef-
fective memory address calculation. Besides changes in the
meaning of instruction fields, it could also be possible to
realize, other data types or other physical forms of current
data types.
9.5 Limited Scratch Pad Addressin_
The philosophy of a generalized register set contained
in a scratch pad structure is very good as far as providing an
9-7
IN1ERMETRICS INCORPORATED • 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 . (617) 661-1840
If (A):0, GO TO Z
If (A)_0, TO GO NI
< Jz
(A) + (PRR)
(_,{AR)-i÷ (PC)
No (AFO)
Figure 9.3: Micro Program Flowchart
(Jump Zero) Revised
9-8
INTERMETRICS INCORPORATED " 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 . (617) 661-1840
I
adaptable design. It would be desired on the hardware level to
al]_(._.wany register to serve any funct:ion. The specification,
thc,:reFore, as to the assign-,Lent of registers would be contained
within the micro cod<'.. The location within the scratch pad of
the mac'ro ]eve]. program counter, base register, etc., should be
specified bv t:h© micro pro0.<om and not dictated by an arbit]=ary
hard wired location. One can easily conceive of i_nstruct.ion
sc'Ls with a_L<:op.atic base registers or none at all, or with a
retttrn a({r-':r'_sss£ac]:. The present SUNC design does not allow
this genera:_J xation.
'.!"1)(_internal ._nterconnectJ.on of -the scratch pad ad-
• • ,_( - - pdross rec,<]':;Lc_r (SPA}t) [_o the instruct.;[ <;n Rogzs_.::; (.[,) .;_nd-the
m:ic:ro me_L_ory buffer register J__(][oates that addressing of the
.................. _ Jt J.q
L,_± aC_:ii p ...... J.... (_ (_.U,;il :!.. t.(L±y ;_ < t.__i.t.." 0_-_- )_,_.[.ii Or'_ ..... ._:-', _,r
_N - _,-, within instruction code or m<c.ro memoryspecified i_. _, _van .....
code. Ti_e a},ilJty to dynamically ded'dce or calculate a scratch
pad address is not possible because the SPAR can not be loaded
from one of the SUHC's internal registers, such as the PRR,
NOR or }._N_. The dynami_c determination of scratch pad address
wout.d be zcguired if one wished to implement a stack within the
scratch pad.
9.6 Micro and Hain Memo:___i,i._[_,})eed Ratio
The current T2L version of SUHC operates with a micro
memory cyc].o time of 330 nanoseconds, while main m©mory pos-
sesses a 660 J_anosecond cycle time. It is suggested that the
speed ratio between m] cro and main memory should be closer to
5 or i0 to I i.nstoad of 2 to i. This becomes especially de-
sirab]e when an instruction set is more complex and semanti-
cally powerful than the IBM 360 instruction set. In more power-
ful instruction sets, one finds both:
a) an instruction operation .specified in fewer bits, and
hence memory does not have to be read as often, and
b) the operations to be performed are themselves more
complex and therefore take more computational steps.
9.7 Main Memory Synchronization
While reviewing the micro code flow charts, it was
observed that the processor or micro memory cycle time was
synchronized to the main memory cycle time by executing micro
level NOPS. The main memory cycle time therefore was an in-
tegral part of the micro code. This can be disastrous for two
entirely different reasons.
9-9
INTERMETRICS INCORPOhATED' 701 CONCORD AVENUE • CAMBRIDGE, MASSACttUSETTS 02138 • (617) 661-1840
a) If a slower or faster main memory were employed many
changes would be required in the actual micro code.
b) In a multiprocessor one can not determine the exact
time bet_.:cen a memory request and the response, since
the addressed memory module might be busy with another
processor and the request might take a number of memory
cycles to be satisfied.
Mu]tiprocossors, therefore, can not guarantee their exact re-
sponse time with respect to memory.
What is required is a completely asynchronous operating
memory interface where the execution of micro code and memory
timing are not intertwined.
In a mu]tiprocessor environment it is necessary that
a process be able to read the contents of a memory location
and change its value by wrJting into it all in one period of
time at the exclusion of all other processors. This form of
read/write mechanism must be provided by any potential multi-
processor.
9.8 Limited Modularity Concept
The "M" in SUMC, which stands for modularity, seems
to extend only to the packaging of arithmetic and register func-
tions into 4 bit entities. The concept of modularity can be
extended to the higher level of internal architecture by pro-
viding an internal structure which is organized around i, 2
or 3 buses. These buses allow all the internal structures to
communicate between one another. As needed, new structures
may be added, such as a floating point unit or an associative
memory unit. Most present day mini computers (see Figure 9.4)
are designed around an internal bus structure.
This' concept can be extended as in the MLP 900 (IC
9000) which also provides what are called program cards. These
are hardware modules addressed by micro memory to provide spe-
cific hardware functions.
Mini computers such as the HP 2000 series, PDP-II,
MODCOMPI, GRIg09, etc., are all built around an internal bus
structure. Often it is this internal bus structure which en-
ables the system to expand and contract to meet varying re-
quirements.
The "M" in SUMC is severely limited with respect to
this described form of modularity.
9-i0
INTERMETRICS INCORPORATED. 701 CONCORD AVENUE • CAMBRIDGE, MASSACtJUSETTS 02138 . (617) 661-1840
I
(/j ._J
__J
LtJ
0")
-._.._
--%
Ch_
£/9
C).°
F--
L--D
43
CU
0
L)
.,H
.rH
(D
U_
al
o_
©
Dl
.,H
-- 9-11
9.9 The "U" in SUMC- Ultra Reliability
!-',e].Jability, clearly requires "good" components. The
SUMCprogram does attempt to achieve component level reliabi-
lity by experimentin,_ with advancod state of the art component
an<] ]_acka_ing _nd fabnication techniques. Reliability is one
of the major design goals of the SUMC architecture. This being
the case, it is surf)rising that the architecture of the SUMC
does not co_isJder hardua_:e detection of the major fault condi-
tic)ns of intec/::-ated circuit i_L)]cr_,entationo The packaging and
definition of the modules should consider the effect of failure
and should al_i_e_',])t*o m<_',:e dc,tc'ckable failures more statisti-
cally indepen_-lant. For example, integrated circuit modules
should tend to be more ])it oriented than function oriented.
It is necessary in reliable systems to have "immedi-
ate" fault dotection within the hardware in order to prevent
propagation of errors. The inte_raction of transient faults and
the micro execution of instructions must be carefully considered,
and made part of the basic structure.
9.10 Confusion Between Desian Leve]s
A basic philosophical comment seems appropriate. A
truly modular design should possess maximum independence be-
tween design levels. That is, the architecture (block diagram)
level, instruction definition level, and the implementation
(logic design, circuit technology) level should be approached
as independently as possible. A change of definition at one
level should not cause major impact on the other levels.
When flexibility is desired the implementation archi-
tecture should be generalized enough to allow the implementation
of a wide variety of instruction sets. This is particularly
true when one considers a large future time framework. While
most current instruction architectures are similar to the 360,
they will become more and more problem oriented such as the
Burroughs D6700. The instruction set should reflect the major
application to which the system is to be used. For example,
when a Higher Order Language is employed, the instruction set
should be so specified as to aid in the generation of, and
hence the efficient execution of, compiled code.
Similarly, the introduction of new technology at the
implementation level of design affects speed, weight, power and
cost, but should have no major impact upon the instruction set
or (processor, memory, I/O) architecture.
Clearly one can not be too pedantic in the utilization
of the principle stated above and must appreciate the practica-
lities of all design levels.
9-12
INIERMETRICS INCORPORATED • 701 CONCORD AVENUE • CAMBRIDGE, MASSACHUSETTS 02138 . (617) 661-1840
I
Th© SUMC design has gz-eatly intertwined the (proces-
sor, memory, !/O) architectural, instruction set definition,
and i]_Ip.lemcnta!.:ional levels of the design.
I_<.jc)T<_._]ces for C]]ag_e_r 9
i) Fo<;t,._n, C.C. , c<t. al. , "I:_easure of OP Code Utiliza-
tion", IEEE Trans. on Comp., May, 1971, [)p. 582--584.
2) Bing]_am and I<auffman, "Analysis of Static Object Code
Produced by Algol and Cobol Compilers fo_" the Bur-
roughs B5500", i_urroughs Corporation, Paoli , Penna. ,
February, 1969.
3) Church, C.C., "Computer Instruction Repertoire-Time
for a Change", SJCC, 1970.
9-13
INTERMETRICS INCORPORA]FD • 701 CONCORD AVENt]E • CAMBRIDGE, MASSACHUSETTS 02138 • (617) 661-1840

