Techniques for the realization of ultrareliable spaceborne computers  Interim scientific report by Goldberg, J. et al.
General Disclaimer 
One or more of the Following Statements may affect this Document 
 
 This document has been reproduced from the best copy furnished by the 
organizational source. It is being released in the interest of making available as 
much information as possible. 
 
 This document may contain data, which exceeds the sheet parameters. It was 
furnished in this condition by the organizational source and is the best copy 
available. 
 
 This document may contain tone-on-tone or color graphs, charts and/or pictures, 
which have been reproduced in black and white. 
 
 This document is paginated as submitted by the original source. 
 
 Portions of this document are not fully legible due to the historical nature of some 
of the material. However, it is the best reproduction available from the original 
submission. 
 
 
 
 
 
 
 
Produced by the NASA Center for Aerospace Information (CASI) 
https://ntrs.nasa.gov/search.jsp?R=19690020584 2020-03-23T21:44:41+00:00Z
Interim Scier,;if)c Report 2
TECHNIQUES FOR THE REALIZATION
OF ULTRARELIABLE SPA: EBORNE COMPUTERS
By.- J. GOLDBERG	 M. W. GREEN	 K. N. LEVITT	 H. S. STONE
Prepared for:
NATiONAL AERONAUTICS AND SPACE ADMINISTRATION
ELECTRONICS RESEARCH CENTER
575 TECHNOLOGY SQUARE
CAMBRIDGE, MASSACHUSETTS 02139	 CONTRACT NAS12-33
tkVI
g NFq - 9 99 62
_	 'ACCESSION NUMBER)
	 (TNRU)
0
r
'PAGES)
 ^	 ICODEI
V
{
(NASA CR OR TM% OR AD NUM©ER)
	 TCA70EGARYI
Interim Scent flc Report 2
October 1967
TECHNIQUES FOR THE REALIZATION
OF ULTRARELIABLE SPACEBORNE COMPUTERS
Prepared for:
NATIONAL AERONAUTICS AND SPACE ADMINISTRATION
ELECTRONICS RESEARCH CENTER
575 TECHNOLOGY SQUARE
CAMBRIDGE, MASSACHUSETTS 02139 	 CONTRACT NAS12-33
By: J. GOLDBERG	 M. W. GREEN	 K. N. LEVITT	 H. S. STONE
SRI Project 5580
Approved: D. R. BROWN, MANAGER
COMPUTER iECHN10UES LAUCRATORV
J. D. NOE, EXECUTIVE DIRECTOR
INFORMATION SCIENCE AND ENGINEERING
an
Copy No. . A. !
i
}'KECEDING PiNGE ELhNK N4T C1LMWI
ABSTRACT
This is the second scientific report of a study of the development
techniques for the re^lization of ultrareliable, high-performance, space-
borne computers. The techniques developed are in support of computer
structures in which reliability is achieved through autonomously con-
trolled logical reconfiguration and fault masking. A multiprocessor
model is described that is particularly appropriate to the attainment
of ultrareliability. Local design techniques, which facilitate recon-
figuration, are discussed for various computer functions, including
memory and microprogram control. Design techniques are presented for
economical, fault-tolerant, data commutation networks. An initial
effort is being directed toward a formal description of program-design
techniques that will facilitate hardware diagnosis and, hopefully, yield
mistake-free programs.
iii
PRECEDING PAGE BLANK NOT FILMED.
FOREWORD
This is an interim report, summarizing work accomplished during the
first six months of the second phase of a two year program, the goal of
which is the development of techniques for the realization of ultra-
reliable space computers. This study has been conducted in the Computer
Techniques Laboratory of Stanford Research Institute, under the sponsor-
ship of the Electronics Research Center of the National Aeronautics and
Space Administration.
The goals of the first phase were to survey the state of the art of
design for achieving ultrareliable spaceborne computers, and to form a
basis for research which would advance that art. The final report, which
resulted from the first phase of the program, was concerned with the
following:
(1) The basic characteristics of an advanced spaceborne
computer
(2) A description of fault-masking techniques for general
logic functions
(3) A survey of codes for storage and arithmetic operations
(4) Problems of system organization for dynamic error control
(5) Tests for diagnosis of fault conditions
(6) Some initial descriptions of network designs for a recon-
figurable computer, including commutation or interconnec-
tion networks, programmable processing modules and pro-
grammable control units
(7) Error-control techniques for memory systems
(8) Distributed power supply systems
(9) The application of magnetic logic
(10) A survey of the published literature on the attainment of
reliable systems through the use of redundancy.
1
T
v
The goal of the second phase is to develop detailed techniques for
the logical design of an advanced, ultrareliable spaceborne computer.
The techniques to be developed are to be used in support of computer
structures in which reliability is achieved through autonomously-
controlled logical reconfiguration and fault masking. In particular,
those techniques have been developed by following a certain method of
approach, which entails certain steps. The first step in the approach
involves the development of a system organization that facilitates
dynamic maintenance processes. On the basis of the selected system
organization, a detailed logical design is then performed of networks
that are uniquely appropriate for a reconfigurable computer. Thirdly,
diagnostic procedures, reliability enhancement techniques and reliability
analysis measures are developed for these networks, where the require-
ment exists. The next step in the approach requires that software tech-
niques be developed to aid in the diagnosis and detection of failures.
Also, techniques must be developed for designing reliable programs.
Finally, reliability analysis techniques are developed for the overall
system.
The report is organized into six chapters and one Appendix. The
first chapter which serves as an overall introduction to the report,
contains the statement of the problem; the goals, methods, and assumptions
of the study; and a brief review of prior work on pertinent aspects of
reliability enhancement; in addition to the organization of the report.
Chapter II is concerned with the principles of multiprocessor sys-
tem design that are particularly appropriate to the attainment of ultra-
reliability. Chapter III contains logical design techniques for networks
identified with the memory, control, and microprogram control functions.
Chapter IV contains design details for commutation networks that are to
perform the important function of data switching in a multiprocessor
computer. Chapter V contains a formal description of program-design
techniques, which will facilitate hardware diagnosis and which will,
hopefully, yield mistake-free programs. Chapter VI contains a brief
vi
s
description of other topics considered in the study, namely network
diagnosis, and a survey of the pertinent literature, and also the con-
clusions and a summary of our plans for the remainder of the program.
The report is self-contained at least as far as the statement of
principles is concerned. Detailed mathematical proofs and the descrip-
tion of some hardware designs, particularly concerning commutation net-
works and arithmetic processing elements have been deferred until publi-
cation of the iinal report. The reader is referred to the first phase
final report for detailed background information.
The technical studies reported herein are the work of the following
members of the Computer Techniques Laboratory:
Mr. J. Goldberg
Mr. M. W. Green
Professor E. L. Lawler (University of Michigan-Summer
employee of SRI)
Dr. K. N. Levitt
Dr. R. A. Short
Dr. H. S. Stone
Dr. J. B. Turner
All of the individuals contributed to the writing of the various sections
of the report. Mr. Goldberg who serves as Project Supervisor and
Dr. K. N. Levitt who serves as Project Leader are responsible for the
organization and editing of the report.
vii
er,ECEU;NG PAGE BLANK NOT FILMIEU.
CONTENTS
	
ABSTRACT. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
	
i i i
	
FOREWORD. . . . . . . . . . . . . . . . . . . . . . . . . . . . .	 v
LISTOF ILLUSTRATIONS . . . . . . . . . . . . . . . . . . . . . . . xi i i
LISTOF TABLES	 . . . . . . . . . . . . . . . . . . . . . . . . . . 	 xv
	
IINTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . 	 1
A. Statement of the problem
	 . . . . . . . . . . . . . . . . 	 1
1. Multiple-Problem Sets . . . . . . . . . . . . . . . . 	 1
2. Highly-Varied and Complex Computations . . . . . . . . 	 1
3. Computations with a Range of Priorities . . . . . . . 	 1
4. Variable Capacity of the Earth-Vehicle
	
Communication Link . . . . . . . . . . . . . . . . . . 	 2
5. Failures Which May Be Transient or Permanent . . . . . 	 2
6. Failures Which May Not Be Independent
	
or May Not Embrace Single Components . . . . . . . . .	 2
7. Severe Constraints on Weight and Power . . . . . . . . 	 3
8. Failures in Most Components . . . . . . . . . . . . . 	 3
	
B. Statement of Program Goals . . . . . . . . . . . . . . . . 	 3
	
C. Brief Review of Prior Work . . . . . . . . . . . . . . . . 	 4
	
D. Brief Summary of Report . . . . . . . . . . . . . . . . .	 6
	
II PRINCIPLES OF MULTIPROCESSOR SYSTEM DESIGN . . . . . . . . . . 	 9
A. Introduction . . . . . . . . . . . . . . . . . . . . . . . 	 9
B. Multiprocessor System Organization . . . . . . . . . . . . 	 10
C. Description of Error-Control Policies . . . . . . . . . . 	 13
D. Logical Design, Strategy, and Software Problems
	
Associated with Multiprocessor-System Design . . . . . . .	 20
	
III TECHNIQUES OF LOGICAL DESIGN . . . . . . . . . . . . . . . . . 	 23
A. Introduction . . . . . . . . . . . . . . . . . . . . . . . 	 23
B. The Organization of a Reliable Memory Module . . . . . . .	 23
1. Introduction . . . . . . . . . . . . . . . . . . . . .	 23
2. System Description . . . . . . . . . . . . 	 . . . .	 24
3. Coordination of Error-Control Modes . . . . . . . . . 	 27
ix
CONTENTS (Continued)
C. Design Techniques for a Modular, Microprogrammed
Control Unit . . . . . . . . . . . . . . . . . . . . . . .	 28
1.	 Introduction	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . 28
2.	 Schemes for Selection of the Next 4-Instruction 	 .	 .	 . 29
a.	 Composition of the Address Code	 .	 .	 .	 .	 .	 .	 .	 .	 . 29
b.	 Generation of Test Functions
for Branching-Type p-Instructions
	 .	 .	 .	 .	 .	 .	 .	 . 30
3.	 Schemes for Programmable Selections
	 .	 .	 .	 .	 .	 .	 .	 .	 . 31
4.	 Schemes for Programmable Generation
of Boolean Implicants
	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . 32
5.	 Hierarchy Schemes
	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . 34
6.	 Conclusions	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . 36
D. An Improved Realization for Switched-Adaptive Voting .
	 . . 37
1.	 Introduction	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . 37
2.	 Description of a New Scheme
	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . 37
IV	 PRINCIPLES OF COMMUTATION NETWORK DESIGN . . .
	 .	 . . . .	 .	 .	 . 41
A. Introduction	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . 41
1.	 Commutation Requirements
	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . 41
2.	 Prior Solutions to the Commutation-Network
Design Problem
	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . 44
3.	 The Primitive Building Block of Commutation
Networks
	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . 45
B. Commutation Networks for Complete Permutation--
Complete Utilization
	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . 47
1.	 Nonredundant Networks
	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . 47
2.	 Byte-Sliced Commutation Networks .
	 .	 .	 .	 .	 .	 .	 . 50
3.	 CPCU Networks Insensitive to Cell Failures .
	 .	 .	 .	 .	 . 51
a.	 The Stuck-Function Fault .
	 .	 .	 .	 .	 .	 .	 .	 . 51
b.	 An Alternative Single Stuck-Function
Correcting Construction	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . 55
c.	 Correction of Bad-Output Fault Types .
	 .	 .	 .	 .	 .	 . 56
C. Commutation Networks for Complete Permutation--
Incomplete Utilization .
	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .
58
D. Commutation Networks for Incomplete Permutation--
Nonorder Preserving
	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . 64
E. Commutation Networks for Incomplete Permutation--
Order Preserving .
	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .
67
F. Commutation Networks for "Shorting"
	 .	 .	 .	 .	 .	 .	 .	 .	 . 71
G. Summary	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . 74
x
i
s
i
E
CONTENTS (Concluded)
V	 ULTRARELIABLE PROGRAMMING .
	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . .	 .	 .	 .	 .	 75
A. Classification of Program Faults	 .	 .	 .	 .	 .	 .	 .	 . .	 .	 .	 .	 .	 75
B. Faults Arising From Numerical Analysis	 .	 .	 .	 .	 . .	 .	 .	 .	 .	 77
1.	 The Need	 for Analysis	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . .	 .	 .	 .	 .	 77
2.	 Design of Floating-Point Hardware
to Aid Numerical Analysis 	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . .	 .	 .	 .	 .	 79
3.	 Detection of Failures Arising
from Nt!inerical Analysis	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . .	 .	 .	 .	 .	 82
46
	
Recovery from De.^cted Numerical
Computation Failu:,;s	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . .	 .	 .	 .	 .	 84
C. Failures Arising From Program Faults	 .	 .	 .	 .	 .	 . .	 .	 .	 .	 .	 85
1.	 Prevention of Programming Faults	 .	 .	 .	 .	 .	 . .	 .	 .	 .	 .	 86
a.	 High-Level Languages	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . .	 .	 .	 86
b.	 Independent Chet... Calculations 	 .	 .	 .	 .	 . .	 .	 .	 .	 .	 89
c.	 Software Maintenance and Modification .
	 . .	 .	 .	 .	 .	 92
2.	 Summary	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . .	 .	 .	 .	 .	 98
D. Techniques for Detecting Software Failures	 .	 .	 . .	 .	 .	 .	 .	 99
1.	 Protection Against Incorrect Memory Accesses .	 .	 .	 .	 .	 99
2.	 If and Only If Programming	 .	 .	 .	 .	 .	 .	 .	 .	 . .	 .	 .	 .	 .	 102
3.	 Recovery From Detected Faults . 	 .	 .	 .	 .	 .	 .	 . .	 .	 .	 .	 .	 105
E. Summary and Conclusions 	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . .	 .	 .	 .	 .	 107
VI	 CONCLUSIONS AND SUMMARY OF OTHER STUDIES IN PROGRESS .	 .	 .	 .	 .	 109
A. Conclusions	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . .	 .	 .	 .	 .	 109
B. Summary of Other Work in Progress .
	 .	 .	 .	 .	 .	 .	 . .	 .	 .	 .	 .	
110
APPENDIX. .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . .	 .	 .	 .	 .	 113
REFERENCES
	 .	 .	 .	 .	 .	 .	 .	 . .	 .	
115
DD Form 1473
xi
vKECEDiNG PAGE BLANK NOT FILMIE0•
ILLUSTRATIONS
Fig. II-1 Multiprocessor Computer System Block Design .
	
.	 .	 .	 .	 . it
Fig. III- 1 Redundant Memory Module	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . 25
Fig. III-2 Fixed-Structure Selection Network . 	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . 31
rig. III- 3 Reconfigurable Selection Network	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . 32
Fig. III- 4 Programmable Boolean-Implicant Function Network . 	 . . . 32
Fig. III-5 Two-Level Microprogram Scheme (After Graselli)	 .	 .	 .	 . 35
Fig. III-6 Fixed Program Va:I.abl.e Translation Microprogram
Scheme	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . 36
Fig. I %-1- 7 Switched-Adaptive Voting	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . 39
Fig. IV-1 Classification of Data C,)mmutation Requirements .
	 .	 .	 . 42
Fig. IV-2 "Crossbar" R ,salizati .:)n of Commutation Function 	 .	 .	 .	 . 44
Fig. IV-3 Basic Cell for commutation Networks .
	
.	 .	 .	 .	 .	 . 46
Fig. IV-4 Network for Complete Permutation—
Com^ .lete Utilization	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . 49
Fig. IV- 5 Byte-Sliced Permutation Network .
	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .
50
Fig. IV- 6 Permutation Networks Insensitive to Single
to
	 Fault	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . 53
Fig. IV- 7 Non-Minimal Redundant 4-Permuter, Single
it Stuck-Function" Correcting
	
.	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . 56
Fig. IV-8 Network for Correcting Bad-Output Faults	 .	 .	 .	 .	 .	 .	 . 57
Fig. IV-9 Schematic Representation of Decomposition of a
Complete Permutation- - Incomplete Utilization
Network	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . 58
Fig. IV-10 Basic Cell with Augmented Set of Inputs . 	 .	 .	 .	 .	 .	 .	 . 59
Fig. IV-11 An N, m Combination Network	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . 61
Fig. IV-12 Recursive Approach to m-N Combination Network . . . . . 62
Fig. IV-13 4-8 Combination Network 	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . 63
xii:
ILLUSTRATIONS (Concluded)
Fig. IV-14 Redundant 4-8 Combination Network for Correction
of Single "Stuck-Function" Failures .
	 .	 .	 .	 .	 .	 .	 . .	 .	 64
Fig. I11-15 Recursive Approach to Incomplete Permutation—
Nonorder-Preserving Network 	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . .	 .	 66
Fig. IV-16 An Incomplete Permutation--Order Preserving
Network	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . .	 .	 68
Fig. IV-17 Recursive Approach to Incomplete Permutation--
Order Preserving Network	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . .	 .	 69
Fig. IV-18 "Shorting" Network	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . .	 .	 72
Fig. IV-19 Redundant Shorting Network 	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . .	 .	 73
W
xiv
TABLES
	
Table II-1 Flow Description of Multiprocessor . . . . . . . . . . 	 14
Table II-2 Functional Requirements of Supervisory
	
Control Unit . . . . . . . . . . . . . . . . . . . . .	 19
	
Table II-3 Functional Requirements of Executive . . . . . . . . .
	
19
	
Table IV-1 Failure Conditions for Basic Cel . . . . . . . . . . .
	
45
Table IV-2 Number of Cells in Single Stuck-Function
	
Correcting CPCU Networks . . . . . . . . . . . . . . .	 54
xv
TI INTRODUCTION
In this chapter we discuss briefly and in general terms, the problem
of realizing ultrareliable spaceborne computers. Specifically, the chap-
ter contains a discussion of the problem of designing computers that are
appropriate to the characteristics of space missions, the statement of
the program goals, a brief review of prior work on the attainment of ultra-
reliability, and a brief summary of this report.
A. Statement of the Problem
In this section we consider the basic characteristics of space mis-
sion computation, which impose severe constraints on the spaceborne
computer design; and the conclusions, concerning the design, that are a
consequence of these constraints. *
 The advanced spaceborne computer must
rP^-:pond to the following.
1. Multiple-Problem Sets
Several problems must be accommodated simultaneously, implying that
a multiprocessing and/or a multiprogramming capability is required.
2. Highly-Varied and Complex Computations
It is anticipated that the scope of the mission will require a
general scientific-type computer that will accommodate to a wide variety
of sensors and output devices.
3. Computations with a Range of Priorities
It is convenient to assign to each mission computation three priority
measures. The first of these is a critical value, which indicates the
relative need for the existence of a particular computation, compared with
the other computations at the moment. For example, certain launch
References are listed at the end of the report.
1
computations are probably essential compared with certain ion-density
computations. The second such measure is an accuracy value, reflecting
the value attached to varying degrees of accuracy in the particular
computation. For example, although a 2000-mile pass may be desired, a
20,000-mile pass may still be tolerable if the closer pass simply cannot
be attained. The third measure is an urgency value, which reflects the
required speed with which a certain computation must be performed and
hence, the amount of equipment that must be devoted to its execution.
These characteristics imply one attractive approach. A computer may be
designed which is capable of altering the logical interconnections among
the computer components and, the tasks may be scheduled to match the
available performance capability. Such an organization is, colloquially,
said to embody reconfiguration with graceful degradation.
4. Variable Capacity of the Earth-Vehicle
Communication Link
During certain phases of the mission,communication to the vehicle
will not be possibl y , although there will be many instances when high
speed communication--of a rate possibly exceeding 1 megabit/sec--will be
quite feasible. Hence, the computer must be capable of autonomously
carrying out diagnosis and repair routines during part of the mission,
and conversely must be capable of responding to (and indeed taking ad-
vantage of) external control information during other times.
5. Failures Which May Be Transient or Permanent
This fact is, of course, obvious, but the optimum technique, which
will distinguish between these two fault types, is not at all obvious.
One approach is to treat all failures as transient and effect a try-again
procedure in response to all failures.
6. Failures Which May Not Be Independent or May Not
Embrace Single Components
It is clear that the potentially high-stress space environment can
result in a failure that will not be confined to a single component.
For example, a radiation pulse could affect a sizable portion of the
system, and a sudden unexpected acceleration could result in a fractured
2
chip. It has been convenient in the past to assume that failures (possibly
embracing many elements as described above) occur one at a time. This
assumption might be tenable if each failed subsystem is repaired immediately
following the fault occurrence. One possible counterexample is the case
wherein a system relies upon the switching of standby units to achieve ultra-
reliability. It is required that such standby units either possess fault
detection capabilities, or that they are diagnosed immediately prior to
insertion in the system.
7. Severe Constraints on Weight and Power
Although the constraints on the weight and power of spaceborne equip-
ment have been relaxed in recent years, it remains imperative to achieve
a design which provides the maximum ratio of computing power per total
equipment. This observation provides additional evidence for a
multiprocessing-graceful degradation system, with a minimum amount of
pure standby redundancy.
8. Failures in Most Components
Statistical reliability measures, comprehending various risk policies,
have been shown to provide useful estimates of system performance. How-
ever, since the variance of most semiconductor failure distributions is
quite large, it is important to minimize the number of blocks for which
a single component failure would disable the entire system. Essentially,
it is important to incorporate redundancy into as much of the system as
possible, even though, on a statistical basis, not much improvement is
realized by the inclusion of such redundancy.
B. Statement of Program Goals
In recognition of the severe problem of designing ultrareliable
spaceborne computers, NASA Electronics Research Center has set up the
following study goals, the first three of which relate to the completed
It is apparent that the design philosophy should reflect Murphy's Law.
3
first phase, and the latter four of which relate to the half-completed
second phase:
(1) To survey the state of the art of logical design of space-
borne computers as it pertains to the enhancement of reli-
ability
(2) To conceive and evaluate new schemes of system design and
operation that offer promise of advancing the state of
the art
(3) To recommend further directions of research that will aid
in the improvement of present techniques
(4) To investigate the organization of aerospace computers
having high computational performance in which autonomously
controlled logical reconfiguration of equipment and fault
masking are employed for the purpose of achieving ultra-
reliable operation. Derive specifications for classes of
networks and processes that are appropriate to the realiza-
tion of such organizations
(5) To investigate the design of networks that realize the func-
tions of general logic and commutation for reconfiguration
for the systems derived in Item 4. These networks will
incorporate various features of error control including
fault masking, diagnosability, and reconfigurability. The
design will employ criteria appropriate to advanced
integrated-semiconductor array technology. Develop logical
designs and techniques of analysis and synthesis appropriate
to the particular networks designed. Develop criteria for
the evaluation of the reliability of such networks.
(6) To investigat.: the design of programs that facilitate the
reconfiguration processes in systems of the type developed
in Item 4 and that facilitate the flexible variation of
computational performance with amount of available equip-
ment. Propose the specification of requirements on
executive programs that consider reconfiguration. Develop
schemes for re-addressing and replacement of hardware con-
trol by software subroutines.
(7) To investigate techniques for the evaluation of the recon-
figurable computers proposed in Item 4. Develop approaches
to the construction of theoretical reliability models for
such computers.
C. Brief Review of Prior Work
A significant effort has been devoted in the past decade toward
solving various facets of the problem of realizing ultrareliable digital
systems. This work, summarized in detail in the Phase I--Final Report,'
4
W,
ranges from investigations of fault-masking techniques for various com-
puter sub-blocks,such as arithmetic processor and memory units, to studies
of reliability-enhancement policies for large systems. It is felt that
most of the problems pertaining to enhancing the reliability of isolated
digital sub-blocks are now understood, at least from the standpoint of
making sound engineering judgments concerning the use of the various
techniques. * Strictly passive redundancy techniques have been applied to
the control and arithmetic processing sections of the Saturn IVb guidance
computer, and it has been concluded  that the application of such tech-
niques, exclusively, cannot economically satisfy the computation and
reliability requirements of future spaceborne computers.
The anticipation of this conclusion prompted an increased effort on
the part of many organizations in the investigation of dynamic error-
control mechanisms, in which the logical interconnections among the
components of the computer may be altered. Theoretically this approach
enables more efficient utilization of redundancy than the passive approach.
In an attempt to substantiate this supposition, Avizienis, 3 ' 4 et al. have
been working on the construction of the JPL-STAR reconfigurable guidance
computer, which embodies probably the simplest form of reconfiguration.
This computer consists of a set of identical processing units, each with
self-contained error-detection control by means of an arithmetic code,
and a reliable magnetic power stepping switch, t which can connect power
to any one of the units. IBM s is presently building a reconfigurable
version of the Saturn IVb guidance computer. In these systems the recon-
figuration is employed only at very high functional levels but it is
well-known that there is potentially greater gain to be achieved by
employing the reconfiguration at lower system levels.
Three of the most promising so-called passive techniques discussed in
Ref. 1 are the use of replicated-voting logic, error-correcting codes,
and adaptive voting logic.
t In a forthcoming SRI report the operation of this power switch will be
described.
5
Graceful-Degradation Systems which enable more efficient utilization
of available equipment have been conceived and supposedly evaluated.
Among the many references on this approach, in Ref. 6 a single processor
structure is assumed, and in Refs. 7 and 8 a multiprocessor structure is
postulated. In these studies as well as many others, several important
items are not treated in depth, namely those relating to the following:
(1) Diagnostic and replacement policies
(2) Logical design techniques for memory, control, and
processing units so that diagnosis and repair are facil-
itated (or indeed feasible)
(3) Reliable commutation (or data switching) required for the
execution of subsystem replacement
(4) The specification of software for the control of diagnosis
and repair.
When the new sources of failure that are introduced by the mechaniza-
tion of the four items are included in the reliability analysis, it is not
clear that the systems will perform as promised. This is an especially
intrical problem when reconfigurability is extended to low-system levels,
or when the capability for graceful degradation is provided.
D. Brief Summary of Report
The succeeding chapters represent an attempt to study in depth many
of the detailed problems which must be investigated before a design for
a reconfigurable spaceborne computer with graceful degradati')n can be
specified, or indeed even before an intelligent estimate of the feasibility
of designing such a system can be formulated.
In Chapter II, the system organization of a multiprocessing structure
is discussed along with the particular characteristics of such a structure
which facilitate reconfiguration and graceful degradation and the unique
logical design problems attendant to the achievement of an ultrareliable
multiprocessor. Maintenance policies, which reflect the computation and
design constraints defined in Sec. I-A, are discussed and a flow descrip-
tion is given, which points out the system response to various error,
input and interrupt conditions.
5
In Chapter III, we summarize an initial effort to perform a detailed
logical design of the various functions associated with the selected com-
puter organization. It appears that the reliability will be enhanced if
some capability for repair is incorporated into the memory, processing,
and control units. Concerning the processing units and possibly a portion
of the control and memory units, the most attractive repair scheme is one
for which the logical realization is a one-dimensional cascade of identical
elements, whence the repair operation requires only the routing to a suc-
ceeding element in the cascade of the signals destined for a faulty element
We call such a logical realization a byte-sliced realization since it is
natural to assign a byte (containing at present an undetermined number of
bits) of each of the registers, adders, decoders, etc. to each element or
slice in the cascade. A design of a byte-sliced microprogram control unit
is described, and a memory module, which embraces byte slicing in addition
to such reliability enhancement techniques as data channel coding and
access switch-failure detection is described. A novel digital realization
is given,whi.ch relates to the voting-switchover scheme described in de-
tail in Ref. 1.
Chapter IV is concerned with the design of commutation networks for
the various data switching functions associated with a byte-sliced multi-
processor. In particular, we consider permutation networks that, for
example, route data from a selected memory to a selected control unit,
and order-preserving networks that route data between the working (i.e.,
failure-free) byte slices of memory and control units. An important
feature incorporated in the designs is the capability to accommodate for
hardware failures in the commutation networks without requiring the dis-
carding of the entire memory and processing units or even the byte slices
served by the network.
Chapter V is concerned with what seems to be an initial attempt to
specify formal rules for the synthesis of programs in an ultrareliable
computer. The description is in a general framework so as to embrace
two important problems. The first of these is the synthesis of programs
7'
which are mistake-free, wherein a mistake is assumed to be the result of
human frailty, or of programs for which mistakes are readily detected and
recovery to correct execution is possible. The second is the synthesis
of programs that will facilitate the detection of hardware failures. We
also consider the interrelationship between hardware and software, in
particular, the ancillary hardware that will facilitate the specification
of reliable software.
8
II PRINCIPLES OF AWLTIPROCESSOR SYSTEM DESIGN
A. Introduction
In Sec. I-A a brief list was presented of the characteristics of
advanced spaceborne computation and the constraints on a computer design
that are a cons:)quence of these characteristics. It is observed--although
clearly the observation is not original--that the multiprocessor organiza-
tion provides an excellent match to the severe spaceborne computer design
constraints. The basis for this conclusion is that in a multiprocessor
structure:
(1) The facility exists for simultaneous execution of several
programs
(2) It is possible to satisf y a wide range of computation
time requirements by assigning a varying number of
processors to a given task
(3) It is possible to satisfy a wide range of accuracy*
requirements for the same reason as in (2)
(4) Reconfiguration, to enhance reliability, is easily
accommodated, at least at the processor level
(5) Accommodation to different urgency * levels is easily
achieved by assigning a varying number of processors
to a given task, although in this case each processor
will operate with the sa:e set of datat
(6) The data switching does not appear to be significantly
more complex than would be expected for a single
processor structure with extensive reconfiguration
capability.
Reference 1 concluded that the key problems of system design in
achieving a reliable reconf.igurable computer are flexibility of struc-
ture, modularity, simplicity of diagnosis, and reliability of control.
These terms were informally defined in Sec. I-A.
t It is implied that a set of processors are operating in the replicated
mode, and the data outputs are in some way compared.
9
The first two problems, at least at the processor level, are clearly
solved by the multiprocessor organization; the solution to the latter
two problems must await the detailed study of maintenance policies and
of the logical design of the processors.
Many descriptions of multiprocessor systems have appeared in the
literature, a,s,io
 and several contemporary computer systems il
 rely upon
multiprocessing. Most of these previous descriptions have been concerned
with (1) gross estimates of system reliability, assuming for example,
that diagnosis and switchover are always executed correctly; (2) schedul-
ing analyses and simulations to facilitate the determination of system
responses to various inputs; and (3) the specification of software that
will enable the optimum utilization of the hardware. Our intention in
this program is to study maintenance policies and logical designs that
will maximize the system reliability, *
 and then to formulate a realistic
estimate of system reliability, width will, hopefully, compare favorably
with the previous gross estimates.
In this chapter we discuss a possible multiprocessing system organiza-
tion, considering those policies which apply particularly to error control
and the functional requirements of the various blocks.
B. Multiprocessor System Organization
The multiprocessor model with which we are concerned, is depicted in
Fig. II-1. This system consists of a set of M high-speed working memories
(10.1); a set of N simple processor and control units (SP); a set of Q
arithmetic logic units (ALU); a set of t: back-up memories (BAbi); an input-
output device controller (I/0) for which we provide sets of spare registers,
cc,unters, buffers, and real-time clocks; two com::utation networks (CN); a
At present the "reliability of a multiprocessor" has not been formally
defined, but we will temporarily assume on a qualitative basis that
the reliability measure reflects the capability of the unit to carry
out the set of mission computation tusks weighted according to priority,
urgency, and accuracy.
10
NC7 wZ ^
YO
O
n
•^
n
Y	 'Z s
J
H
Z
O
^ u
• aw O Z
• M t' Z 5
• f w
O
0
wN W
^ c9 p
Y
uO
J
CD
W
H
J
O
to
a~~
H
^i8 MU WH
a
OU
wON
J N(L
aw
Q 0
=) I.-
w C7 U dN w
^at
D J
m
^
^
w
CK
Q _
N
W (7ZZQ
QHQp
JJQ
J
	
•••
Q
	
co	 co
O
Q
^3 n
w
OZU
0. zN
ZOf Y
Q 2H 3
^ F
w
OZ
u
000	
*6
11
supervisory control unit (SCU); and two registers for setting up the
commutation networks (it is convenient to view these registers as form-
ing a component of the SCU).
In operation a given set of Ws, SP's, and ALU's will be in com-
munication by means of links established in the two commutation networks.
In addition, the commutation networks #
 perform the function of directing
data to the failure-free byte slices, thus, effectively "repairing" the
units.
It is envisioned that each SP unit will have the capability of
executing comparatively simple decision and arithmetic algorithms, the
capability of controlling program flow, and the capability of controlling
processor allocation and scheduling J An ALIT will be used for the execu-
tion of complex algorithms requiring extensive processing hardware. The
SCU will function as a referee in all error-control processes, and,
essentially, represents the system hard-core, although it can be super-
ceded by a command from the ground. The BAM's will store the task pro-
grams, diagnosti2 programs, and the setup programs for the commutation
networks.
In addition to the inter-unit communication links provided b y
 the
commutation networks, a single data channel (probably serial), shown as
bold-faced lines in Fig. II-1, is provided. This data channel links the
SP and WM units with the supervisory control unit. It was noted by
Alonso1O
 that a complete multiprocessing system could be designed, con-
taining only this single data-channel communication link, although in
practice it is doubtful that this link would be serial. However, we have
included the possibility of multiple-simultaneous communication between
SP and WM blocks because of the additional flexibility thus provided, and
because it seems appropriate, at this stage, to work with a general model.
Detailed logical designs of commutation networks are presented in
Chapter IV.
t It is implied here that each processor can function as an executive..
This "floating-executive" technique, which is developed with greater
detail in Sec. II-C, is also discussed in Ref. 12.
12
One important feature of the system, which is not shown explicitly
in the figure, is that each defined block of the system will have at
least one distinct power supply associated with it. * Furthermore, it is
assumed that the power can be disconnected from a faulty block without
resulting in the propagation of errors into connecting blocks due to
excessive loading on the part of the disconnected unit.
C. Description of Error-Control Policies
In this section we will describe a possible set of error-control
policies by indicating in flow-form the system response to various input
and error conditions. It should be recognized that the system will not
function exactly according to this description, but, by reference to this
description, we are provided with a reasonably complete set of functional
requirements for the various blocks.
In Table II-1 we present the flow description, and in Tables II-2,
and II-3 respectively, we summarize the functional requirements of the
supervisory control unit and the simple processor and control unit per-
forming the role of the executive. At the conclusion of Table II-1, a
set of comments are presented to clarify anomalies in the flow descrip-
tion and to point out some instances where there is lame doubt concerning
the optimality of the strategy selected.
It was shown in Ref. 1 that the distribution of power supplies does
not appreciably increase the weight or raw power requirements beyond
that associated with the use of a single power supply.
13
ITable II-1
FLOW DESCRIPTION OF MULTIPROCESSOR
1. SCU selects an SP (say SPE ), which is listed in the table of
available SP's, to function as executive.
2. SCU instructs SP  to access an available BAM.
2.1 If accessing capability of SP  has failed, SCU selects
another executive, and SP  is removed from the table of
available SP's, and Step 2 is repeated.
3. SP  selects an available WM to serve as the executive working
memory, WME.
4. SP  retrieves from a small store , possibly associated with the
SCU, the setup data so that SP  can access WME.
5. Setup data is transferred to the setup register of CN1.
6. A WM and SP diagnostic program is directed from BAM through
SP  to WME.
6.1 If the diagnostic program indicates SP  cannot function as
an executive, another SP is chosen and Step 2 is repeated.
6.2 If diagnosis of WME fails and WME cannot be repaired,
another executive memory is selected, and WM E is removed
from the available table.
6.3 If WME is repairable, SP  perforr:s the task.
7. The executive program is transferred from BAM through SP 
to WME.
8. SP  supervises diagnosis of the remaining set of SP's.
9. SP  selects a set of WM's, SP's and ALU's to communicate with
each other.
10. SP  computes the setup data to effect this communication and
transfers setup data to the two setup registers.
11. Each SP diagnoses its associated ALU and WM.
11.1 If diagnosis fails and associated WM's and ALU's are not
repairable, a different assignment is effected.
14
Table II-1 (Continued)
11.2 If the Ws and ALU's are repairable, the associated SP
performs the task.
12. SP  responds to an input device requesting service.
12.1 Assume urgency value of input is 1 (simplex):
12.1.1 SP  selects SPs , WMSi p ALU s to handle program.
12.1.2 Pertinent program is transferred from BAM through
SP to WM
S l	 S1
12.1.3 Input data is transferred to WM s and also to WM E.
i
12.1.3.1 Results of intermediate calculations
might be stored in WME (as well as in
wMs ) to provide a convenient roll-back
i
point in the event of a failure.
12.1.4 Assume SP  detects a computational error:
i
12.1.4.1 The computation is repeated, using the
data and results stored in WME , to de-
termine if failure is transient or
permanent.
12.1.4.2 If error continues, the problem, data,
and intermediate results are transferred
to another set of SP, WM and ALU units
for continuation of computation.
12.1.4.2.1 SP  is diagnosed and either
i
retained or discarded.
12.1.4.2.2 WM and ALU are either
s l 	 sl
diagnosed by SPs 
i 
(if it does
not contain the failure) or
retained for other tasks.
12.1.4.3 If computation error continues with the
replacement set of SP, WM, and ALU units,
the error resides in the program or
input data.
15
Table II-1 (Continued)
12.2 Assume urgency value of input is 2 (duplex):
12.2.1 SP  selects the set of units (SPD , % , ALUD )
1	 1	 1
and (SPD2 , D2 
^2
WN	 , ALUD2 ) to handle program.
12.2.2 Pertinent program and input data are transferred
from BAM through SPD , SPD to WMD , WMD .
1	 2	 1	 2
12.2.3 After each intermediate computation is completed,
a change-commutation network command is directed
to the SPE.
12.2.4 The assignment of CN 1 is altered so that SPD
1
and SP  exchange working memories.
z
12.2.5 The results computed by the two sets of units are
compared.
12.2.5.1 The intermediate computation is repeated,
commencing at the latest point in the
program where the computations were in
agreement.
12.2.5.2 If the discrepancy ceases, then the fail-
ure was transient.
12.2.5.3 If the discrepancy continues, then the
failure is immediately attributed to one
of the sets of units (SP , WM , ALU 
	
D 1
	D1	 D1
(SP ,	 ).D2 ^2, ALU D 2
12.2.5.4 The faulty set is disconnected, and CN2
is altered by SP  so that an available
set of blocks are connected.
12.2.5.5 The faulty units are then diagnosed.
12.3 Assume urgency value of input is 3 (triplicated):
12.3.1 SP  selects three sets of units (SP T 1 , WMT , ALU T )
(SP 
T2 #
 
WM
'1'2
	TZ	 T3'
, ALU ) and (SP
^3 ,	 T3ALU ) to	 1
handle program.
12.3.2 Pertinent program and input data are transferred
to the three working memories.
16
Table II-1 (Continued)
12.3.3 After each intermediate computation is completed,
the results obtained by each of the three sets of
units are compared.
12.3.4 If a unit's results disagree with those computed by
the remaining two units, the dissenting set is dis-
connected and replaced by another set.
Comments
1. It is assumed throughout that the decisions of the executive
are continuously checked by the SCU. For example, an exec-
utive that consistently requests the diagnosis of the other
SP's would be disconnected from service and subsequently
diagnosed.
2. The status of the SCU can be periodically monitored by, for
example, a ground station, whence the SCU can be discon-
nected from service if it appears to be faulty. The ground
station can then assume the "veto" power previously assigned
to the SCU.
3. It has not been assumed for a general system that only a
single unit can fail during the period between the execution
of diagnostic routines, although a system which contained
only two SP units would probably be disabled by the occur-
rence of simultaneous faults in each SP.
4. In the description, it is implied that all units are diag-
nosed immediately prior to insertion in the system. Although
this policy would make a single-failure assumption seem more
tenable, it is not strictly required since the system can
accommodate a policy wherein diagnosis is deferred until a
failure is detected.
5. In Step 9 it is assumed that a sufficient quantity of Ws,
SP's and ALU's are available. If this is not the case, then
several tasks might be executed with the same equipment in a
17
Table II-1 (Concluded)
multiprogrammed mode, or an SP unit might execute a program
without recourse to an ALU. In addition, in order to satisfy
stringent accuracy requirements, several ALU's might be
assigned to function with a single SP in the execution of a
program.
6. We have assumed that each task could be assigned one of three
urgency values, corresponding to simplex, duplex and triplicated
modes of operation. The simplex mode, wherein a processor de-
tects its own faults, would be attractive for tasks that could
be interrupted and for which the data and program code permit
convenient roll-back to a known error-free state. The duplex
mode, wherein two sets of units function simultaneously, would
be used for interruptable tasks that require immediate error
detection, but for which convenient roll-back is not possible,
and for which an accurate record of all calculations must be
available. The triplicated mode, wherein three (or possibly
more) sets of units function simultaneously, would be used for
the most critical mission tasks.
7. The reliability status of the various SP, M1, and ALU blocks
is stored in a table of available equipment, which can be
conveniently considered to form a portion of the SCU. A fail-
ure of this table could be circumvented by assuming that all
units are faulty and whence each unit is diagnosed prior to
insertion in the system.
is
Table II-2
FUNCTIONAL REQUIREMENTS OF SUPERVISORY CONTROL UNIT
1. Store a table of available equipment.
2. Select an SP to function as the executive, and direct
the chosen SP to access a diagnostic program.
3. Control the diagnosis of an SP that will function as
the executive.
4. Monitor all directives of the executive, but the SCU
can only 'veto" the commands of the executive.
5. Respond to earth commands to disconnect itself from
service.
Table II-3
FUNCTIONAL REQUIREMENTS OF EXECUTIVE
1. Compute the set up data for the commutation networks.
2. Respond to all input and interrupt commands.
3. Select sets of SP's, Ws and ALU's for various tasks.
4. Organize the diagnosis and repair of units to be in-
serted in system.
5. Respond to the error and program status conditions of
the other SP units.
19
D. Logical Design, Strategy, and Software Problems
Associated with Multiprocessor-System Design
As indicated in Table II-1, many detailed problems of design and
analysis are critical to the functioning of the multiprocessor. Among
these are the following:
(1) A study of the simultaneous data transfer capacity required
for the commutation networks. Of concern here are the
trade-offs between rate of program execution and the com-
plexity of both the commutation network and the setup
algorithm.
(2) The design of commutation networks that offer economy of
design, ease of setup, ease of diagnosis, and a tolerance
to failures in the sense that a failure in a portion of
the commutation network should disable a minimum amount of
processor and memory capability. (See Chapter IV)
(3) The design of memory and arithmetic logic modules that
offer combinations of fault masking and ease of reconfigura-
tion. A discussion of such ALU's is presented in Ref. 1,
and the techniques of microprogram control for such units
is discussed in Sec. III-C of this report. A description
of the organization of a reliable memory module is given in
Sec. III-B of this report.
(4) The design of simple processor and control modules. It is
envisioned that these modules would be of minimal complexity
and would possess the capability of either controlling the
flow of program data between a WM and an ALU, or perform
the role of both controller and processor upon the occur-
rence of an insufficient quantity of available ALU units.
A convenient framework for such a module is contained in a
paper by Frankel, 19 which discusses the minimum complexity
required for a digital computer. Some reliability enhance-
ment might be incorporated within the module, either in the
form of reconfiguration capability at the register and
adder-byte level, or in the form of fault masking for ir-
regularly structured control functions.
(5) The incorporation of protection for the back-up memories.
If these memories are in the form of tapes, reasonably
simple error-correction coding techniques can be applied
for the protection of stored data.
20
(6) The synthesis of programs that are amenable to simplex
type error detection, and the development of techniques
which permit the detection of, and the recovery from,
errors introduced by programming mistakes. (See
Chapter V.)
(7) A reliability analysis of the overall multiprocessor system
and an evaluation of the as •;umed maintenance policies in
order to uncover a possible "optimum" set of strategies.
21
pp,SCEDING 
PAGE BLANK NOT FILMED.
III TECHNIQUES OF LOGICAL DESIGN
A. Introduction
This portion of the report is concerned with schemes for the logical
design of networks that realize major functions appropriate to a recon-
figurable, reliable computer. The functions to which we have given the
greatest attention thus far in the second phase of the study are memory,
microprogram control, and adaptive voting. The important case of an
arithmetic-logical processor was examined in the first phase. The approach
taken was to organize a processor as an iterative array of byte-oriented
modules, called byte-slices. This approach, which appears to be quite
powerful, is applicable to a number of special networks (e.g., in micro-
program control units). Further development of processor networks will
be examined in the coming period.
B. The Organization of a Reliable Memory Module
1. Introduction
The working memory of a computer comprises a natural organisational
unit for the application of error control. It is a vital unit, accounting
for a large fraction of the equipment of a modern computer, and its sim-
plicity permits a wide range of error-control approaches; hence, it is of
interest to develop effective and flexible design techniques appropriate
for memory systems.
In Appendix A of the Final Report--Phase I, 3 a number of techniques
for error control in conventional working memories were reviewed. As a
result, the following observations were made. First, it is generally
impractical tc apply fault masking at the circuit level to the circuits
involved in memory selection and storage; hence, error-control techniques
must be applied at the logical level. Secondly, there are several signif-
icant sources of error, including data channels, word selection (access)
circuits, basic timing and power circuits, and the logical schexea most
23
effective for error control for each source are quite different from one
another. Furthermore, both transient and permanent errors are significant.
Iu this section we will describe the organization of a memory module
in which several kinds of error-control schemes, primarily logical, are
incorporated. Only the basic schemes are presented. Further analysis
is needed to determine the optimum allocation of redundancy in the several
schemes.
We assume a model that is a magnetic core memory. Thus, we include
a power-decoding-switch (access switch), for word selection, a passive,
destructive-read, recording medium, a set of power drivers for recording,
a set of sense amplifiers, and associated timing and power-supply circuits.
The techniques to be described allow for other kinds of memory, such as
nondestructiv--read, or active-element storage, or even some cyclic mem-
ories, but we shall not attempt to provide full generality.
Experience has indicated the need to consider the following error
types as very significant:
(1) Transient errors, primarily in the data-read channels, due
to external noise or to internal, data-sensitive signal
cross-coupling
(2) Permanent, single-element, independent component failures
(3) Permanent, multiple-element, nonindependent component
failures (e.g., a cracked integrated circuit device)
For the present we ignore the timing and power-supply circuits, since
their design problems do not seem to be unique to memory systems.
2. System Description
The memory system, which we consider, is described with reference
to Fig. III-1. It is a memory system of W words, each containing B bytes
of data, where a byte contains D binary digits. In this system, the
following redundancy schemes are incorporated.
(1) The system is partitioned into B subsystems, each serving
all W words. Each subsystem is an independent memory,
containing its own access switch and drivers, store and
data sense and drive amplifiers (in a complete design,
separate subsystem power supplies and timing circuits
24	 -
ADDRESS
MEMORY ACCESS REGISTER
BYTE 1	 BYTE B
ACCESS
	 0	 ACCESS	 ASSOCIAV
REPLACEMENT
SWITCH
	 SWITCH	 MERRY
STORE
	 STORE
SENSE	 SELECTION	 SENSE	 SELECTION
DRIVE	 CHECK	 DRIVE
	 CHECK
ERROR
	
CHECK
	 ERROR
CORRECT	 CORRECT
•00
REPAIR NORMAL
COMMUTATOR CONTROL TIMINGDECODING AND CONTROL
DATA ERROR REPAIR MEMORY
DETECTION MODE MODE
COMMAND COMMAND
»- Kw - I I?
FIG.	 111-1 REDUNDANT MEMORY MODULE
25
would also be appropriate). B is greater than the number
of bytes needed for a minimally acceptable word, BN; hence,
an order-preserving commutator network is provided to
connect any subset of BN out of the B bytes to the memory
system interface.
(2) Incorporated in each store is a special circuit for deter-
mining the number of words selected at a given cycle,
quantized to the levels 0, 1, and more than 1. Reliable
circuits for this determination are well known. For
example, one may provide one or two storage elements
(cores) per word, operated so as to switch on a standard
word selection excitation, with a common three-level sense
amplifier. This circuit is very useful in determining
whether or not the access circuits are faulty, since by
far the most common modes of failures within access
switches result in either no output or in multiple word
selection.
This circuit is not absolutely essential since the cases
of zero or multiple word selections could be inferred
from analysis of the data channels. For example, if
multiple parity-check redundancy is used (it is recom-
mended in the next paragraph), the code could be designed
to have a large error-detecting capability, and most
multiple selections would be detected. However, the
redundancy required for such error detection would be
better used for error correction.
(3) The data of each subsystem are encoded with an error-
correcting code. The most attractive codes for this
purpose are those based on threshold decoding (see
Sec. II, part B-2a of Final Report--Phase I i ) because
most single faults within the decoder network for such
codes are masked, and the codes are reasonably efficient.
Furthermore, as discussed in the Final Report--Phase I,
double-error correcting codes are far superior to single-
error correcting codes with respect to reliability-weight
tradeoffs. The disadvantage of the former lies in the
doubled cost (per data bit) of the decoding network, but
with the new LSI technology, this is not a serious
disadvantage.
An important consideration here is the problem of multiple
transient errors due to environmental noise. The problem
that arises is that if there are more than t + 1 errors
in a word in a 2t + 1 error-correcting system, the
resulting pattern may appear (falsely) as a valid symbol,
or as a symbol with t or fewer bits different from a
valid symbol. As discussed in the Appendix,' this
problem is reasonably well solved by using the error-
detecting capability of the code; thus, the error-
correction logic should provide an alarm when more thanI bit errors are computed in a 2t + 1 error-correcting code.
26
(4)	 An associative memory is employed to provide relocation
for words that become unusable.
	 The primary failure source
for which this scheme is intended is the access switch
and its drivers, although it is also useful for the case
in which only a small number of storage locations suffer
more bit damage than can be accommodated by the error-
correcting codes on the data channels.
	 In the scheme,
a block of words, say the last 2 a
 words in a 2c-word
memory, are reserved for relocated data.
	 An associative
memory with capacity for 2 a
 entries is driven in parallel
with the main memory subsystems.
	 If one of its entries
is excited, indicating that the external computer is
addressing a word that has been relocated, a substitute
location number, or alias, is emitted.
	 For economy of
storage, only .2 bits are needed for the alias since the
high order digits may be provided by a constant (c - a)-bit
number source; thus, the associative memory contains 2a
words, of c + a bits each. 	 The high- and low-order digits
are combined, and the resulting number replaces the
original address in the memory access register.
3.	 Coordination of Error-Control Modes
Mode 1, partitioning with switch-over replacement, would be uses
when the error-correcting capability of all lower-level schemes is ex-
ceeded.	 It requires external diagnosis and control.
	 Mode 2, detection
of over- or under-selection, is important in preventing the data errors,
due to access switch failures, being falsely corrected by the error-
correcting logic in the data channels.
	 Mode 3, error correction in the
data channels, is the primary means of masking transient error.
	 Mode 4,
word-relocation, is important for accommodation of access-switch faults.
It appears to be a valuable sc_=e, because, in commonly used switches,
the average single fault results in the loss of only a small fraction of
the outputs.
	 The value of the scheme would be increased further, if the
switch were designed so that the maximum fraction of outputs lost, due
to any single switch fault, was limited.
A reliability analysis of the total system is needed in order to
determine the optimum allocation of redundancy among the various error-
control modes.
27
C. Design Techniques for a Modular, Microprogrammed Control Unit
1. Introduction
The use of a microprogram (p-program) structure for the control
unit of a reliable computer is attractive because it simplifies the task
of modifying the behavior of the con t..' unit in order to accommodate
system failures. In addition, the into, nt modularity of the major part
of the structure potentially allows the use of highly efficient forms of
error control within the control unit itself.
In this section we shall examine techniques for improving existing
p-program schemes in two ways that are significant to reliable computers;
these are, increasing structural modularity, and increasing the effective-
ness of reprogramming for system reconfiguration. The important problem
of applying error control to the new schemes will be considered in sub-
sequent work.
In a p-program control unit, a control algorithm is represented by
data called µ-instructions. These data are usually treated as words
stored in an addressable memory. In practice, this memory may be realized
as a matrix of logic elements (e.g., diodes), and frequently logic opera-
tions other than simple memory functions are employed within the matrix.
In this study we assume that only memory functions are allowed. This
assumption not only increases the generality of the results, but it per-
mits the use of the main working memory of a computer as a back up for
the µ-program data store.
The basic components of a µ-instruction are as follows:
(1) Specification of the active system data paths
(2) Specification of the operating modes of functional units
(3) Specification of the rules forselection of the next
p-instruction.
The third component is the sosrce of most of the nonmodularity of struc-
ture. In the next section we discuss various kinds of next-instruction
rules, and means for increasing modularity of implei— ntation.
28
2. Schemes for Selection of the Next p- Ptb .Nuction
There are two aspects to the selection of the next p-instruction.
These two aspects, the composition of the address code and the generation
of test functions for branching-type µ-instructions, will be discussed
separately.
a. Composition of the Address Code
The address code may be composed, using data stored explicitly
in the p-instruction, or data stored in temporary data registers on coun-
ters, which may be considered to be implicit in the µ-instruction, or by
some combination of explicit and implicit data. The explicit form has
the advantage of speed, since i..a time is needed for loading special reg-
isters, and the disadvantage of storage cost. There is a wide variety
of useful schemes, employing combinations of implicit or explicit
references.
It is convenient to distinguish four types of µ-instructions,
nonbranching, unconditional branching, two-way branching, and multi-way
branching.
The explicit form is feasible and commonly used for the first
three of the four types. For two-way branching, providing memory space
in each µ-instruction for two explicit next-location fields may be very
extravagant if only a small fraction of the µ-instructions are of this
kind.
The implicit form is appropriate for the first type, non-
branching, µ-instructions; the natural rule is to take the new location
as the previous location plus one. The second type, unconditional
branching, µ-instructions require explicit data. In simple control units,
the fraction of unconditional branching type µ-instructions may be low
enough to justify placing the next-address data in a separate memory word
so that a given word holds either external control information or next
µ-instruction information.
29
For the third type, two way branching, µ-instructions, one of
the next-address codes r.,.y b y implicit (present address plus one).	
6
Another important scheme is to use two implicit addresses in a "skin"
manner, as follows:
first branch: next a&'ress = present address plus 2
second branch: next address = present address plus 1.
In the second case, the next µ-instruction clearly must be an unconditional
branch, the second type. Th. "present address" may also be explicit.
For the fourth type, multi-way branching, µ-instructions, the
"skip" sequencing scheme may be generalized as follows: 	 }
next address = present address plus N,
where N is generated as a result of the test.
All but one of the next µ-instructions must, again, clearly be uncondi-
tional branches. Again, the "present address" may be explicit.
The foregoing descriptions illustrate the wide range of choices
that are present in the design of the address-composition function.
Inherently, all of them are consistent with a modular structure, since
they require only the operations of register transfer, counting and
addition. The particular choices require detailed examination of engi-
neering trade-offs of speed and equipment cost within the context of a
particular system.
b. Generation of Test Functions for Branching-Type p-Instructions
Let (x1 , x2 ,	 x n ) be the set of system status indicators
(arithmetic overflow, input/output requests, program symbols, etc.)
i
There are several functions of these indicator variables that
are of practical importance in the testing of system status. For the case
i
of a binary test, i.e., for a two-way conditional-branching µ-instruction,
4
the test function is a single variable, say f. This function, f, may be
a general boolean function, but some functions are of special practical
importance. These are f = xi and f = xi xi ... xi , where xi may bi the
1 2	 b
true or complement form of a system status variable. Since f may be
30
tested to be true or false (and the x's may be complemented), the second
form is equivalent to the form f = xil + x12 + ... + xib.
3. Schemes for Programmable Selections
In order to expand the power of the branch instruction, it is desir-
able to generate f under program control .14 This also has the obvious
advantage of enhancing reconfigurability for error control. This may be
accomplished by providing a function-selection vector, say
T = (ti, t2 , ..., t s ), asa subfield in a µ-instruction word, and a
logic network with inputs T and X = (xi ,	 xn), and output f.
A convenient network for this purpose may be built, uLing a decoder
with n - 2s outputs and a simple AND/OR network as shown in Fig. III-2.
An extension of this scheme, which provides for a programmable reassign-
ment of selection code, is shown in Fig. III-3. A E stage acts as a +1
adder, such that output Ti+l = Ti + 1 (arithmetic sum) if ri = 1, or
Ti+1 Ti , if r  = 0. An arithmetic overflow, y i , is produced if Ti = 2s.
Furthermore, only one y will be 1. The cascade of U stages serves to
select the x corresponding to the uniquely energized y, and to deliver
it to the output f. The logic for a U stage is ui+l = U  + yixi. The
r variables serve to program the selection, since r  = 0 causes stage i
to be bypassed. This way of reassigning T codes to X variables permits
reconfiguration without changing the T data in the µ-program memory.
This may be advantageous, since it -,,emits use of a read-only memory.
x^
A
x2
t i 	A
T	 Decoding
Switch
i s 	 xn	 f = x j T j . x2 T 2 : ... rxj.
a
rA-5380-n0
FIG. 111-2 FIXED —STRUCTURE SELECTION NETWORK
31
1Reconfiguration moister
ti
T i	Ti+t
	 T i+) = T j - r i (orith, sum, r i = 1 or 0)
T 
	 Y	 s
ui+) = u i + Yixi (booiton)
is
Yt	 Yi	 Y„
1	 U	 ... u i _ U	 ut+t... U	 f
X,	 xi	 xn
TA-9540-130
FIG. 111-.3 RECONFIGURABLE SELECTION NETWORK
4. Schemes for Programmable Generation of Boolean Implicants
A high degree of modularity and programmability in the generation
of Boolean implicants may be achieved by using a certain functional unit
that appears to have great utility it modular computers. As shown in
Fig. III-4, this unit consists of a single binary parallel adder and a
scratchpad memory. (Such memories will be widely available in ISI form,
and, in fact, the functional unit itself is a good candidate for LSI.)
MASK SELECTION	
xs 3C #
	
xZ xZ x t x,
CODE	 TA-9540-l"
FIG. III-4 PROGRAMMABLE BOOLEAN-IMPLICANT FUNCTION NETWORK
32
In this application we employ the following carry function for
ea r 'i state:
c 
	
= (ai + bi )ci-1 ,
instead of the usual arithmetic function,
ci = (ai + bi )c i-1 + aibi
If a standard arithmetic unit is used, the term a i b i must be suppressed
under separate control.
Let the set of system status variables be X = (x
s , xs-1'	 x2' x1)'
We wish to evaluate boolean implicants, using any specified subset of the
input variables. For example, we may wish to determine if x5x3x1 = 1.
We apply both the variables and their complements to one side of the adder,
e.g., in the order (x s , xs, xs-1 , xs, ... x2) x2, xl , xi), with xx at the
least significant binary input, and we apply to the other side of the
adder a "mask" vector, obtained from the memory in response to an externally-
specified address code.
Let the elements of the stored mask vectors be M = (p , q , p 	 ,
s	 s	 s-1
p2' q2' ply ql), thus elements p i , q i correspond to the inputs 
xi , xi,
respectively. If xi appears in the implicant in true form, set (pi gi)
to (01); if it appears in complement form set (p ig i ) to (10), and other-
wise, set (pi g i ) to (11).
To carry function at the 21 t stage is
c21 =(p + xi )( gi + xi)c2i-2
thus, depending on the values of p i and q i , we have
iC 2 - xic2i-2 '	 xic2i-2	 or l *c21 -2
33
If the addition is carried out with c0 set to 1, the final carry
cgs will be 1 if and only if the implicant is true.
As an example, let
The appropriate mask ve
vector sum M + X is (1,
if and only if x5x3x2 =
s = 5, and let the desired implicant be x5x3x2.
etor is M = (01, 11, 10, 01, 11). The Boolean
i t ..., 1) (necessary and sufficient for c gs = 1)
1.
A similar operation may be used to obtain the boolean sum of an
arbitrary set of variables, each complemented or not. For example, in
order to determine if (x4 + x3 + xi) =
	
a test is made to determine if
x4x3x1 = 0. This is done by creating a mask for x4x3x1 as described
previously, and observing if the final carry c gs is 0 (rather than
More complex expressions may be realized using a single flip-flop
to store the results of a succession of implicants. If the flip-flop is
initially set, and if a zero carry resets the flip-flop, a given sum of
products will be true if and only if the flip-flop remains set following
the last product test.
S. Hierarchy Schemes
The simplest structure for a p-program unit provides for the direct
application of the binary elements of the p-instruction word to the ex-
ternal control lines. This structure is used in practice, but there are
some advantages to be derived from the use of a hierarchical structure.
One form, proposed by Grasselli, 15 is analogous to the so-called
interpretive programming systems. In this form, illustrated in Fig. III-5,
the number of different p-instructions is limited, and a compact code is
established for their representation (each word is called a p-order).
A g- program is recorded in a first memory as a string of µ-orders. A
"dictionary" of g-instructions is held in a separate memory, and, in
operation, a selectod border serves to address that memory. The merits
of this scheme are several. First, most of the data storage, i.e., the
dictionary, may be held in a fixed memory, while still preserving some
34
Instruction
• A—ccess µ—Order	
µ—Instruction
Logic Memory	 A	 ControlMemory
Counter
Selector	 System
Counter Control
TA-9990 -fte
FIG. III-5 TWO —LEVEL MICROPROGRAM SCHEME (after Graselli)
flexibility in the l.6-program sequEnce. Secondly, since the Ii-orders are
in a compact code, several may be packed into a single word of a relatively
slow memory (e.g., the main working memory) without unduly slowing the
control sequences.
Another form to which the authors know of no prior reference of
special value in error control is to provide some kind of modifiable
translation on the components of the µ-instruction word. The purpose of
this translation would be to change the assignment of activation signals
to the external funct=onal units, thus effectively reconfiguring the
system, This translation would be effected in a special network (a useful
form for such a network would be an associative memory). This scheme also
has the merit of allowing the main memory to use a fixed data store.
Another advantage is that for high degrees of redundancy, the capacity
of the p-instruction store is minimized, since only the nonredundant
con-crol codes need be recorded. Figure III-6 illustrates the application
of this method.
Both hierarchy schemes have attractive features, but further inves-
tigation is needed to determine the costs in speed and amount of logic
needed for their implementation.
35
Access µ— Instruction
Switch	 dory
Sequence
Control
Translation
Data
i
Data Path
Translator
1	 System ControlAeration
Mode
Translate
Test	 1
Function	 1
Translator
Next
Address
Test	 $,stem Test
FunctionVariablesSelector
Instruction
T&- s5e0-120
FIG. 111-6 FIXED PROGRAM-VARIABLE TRANSLATION MICROPROGRAM SCHEME
6. Conclusions
Various alternatives in the design of P-program type control units
have been reviewed. A number of schemes have been developed for real-
izing major functions in modular and programmable forms and for permit-
ting a substandard use of fixed-data type meLory stores. However, the
merits and costs of the various schemes need to be evaluated. It is also
ne--essary to investigate appropriate means for applying logical redundancy
for error control within the control unit.
D. An Improved Realization for Switched-Adaptive Voting
1. Introduction
In the von Neumann voting scheme, up to m errors in the outputs of
2m + 1 nominally identical binary channels may be corrected by obtaining
the system output as the majority function of the channel outputs. Pierce
demonstrated the value of combining the outputs by a junction that weights
36
the contribution of each channel according to its recent error rate. A
continuous weighting function is discussed on page 371 of the Phase I,
Final Report. As a special case, it may be seen that it is beneficial
to completely disconnect a permanently-failed channel.
Several logical realizations of the latter scheme are also described
in Sec. IIA, 2b of the ftnal Report--Phase I. The simplest structure was
obtained by use of a lines.:-input logic element in which the 0 and 1 states
of the inputs were encoded as +1 and -1 signals with the output 1 if the
sum of the inputs is at least +1. Since linear input elements are diffi-
cult to realize in microelectronic technology, a realization was developed,
using only binary-valued logic elements. This scheme required rather
costly logic networks that effectively counted the number of active chan-
nels and also the number of channels displaying a 1 value. These counts
were needed because the scheme called for repla ying the output of a dis-
connected channel by 0; hence, the number of inputs (to the combining
network) that constituted a majority depended upon the Number of active
channels.
2. Description of a New Scheme
The Following approach permits the use of a fixed majority network
as the entire combining element. Let only one channel be disconnected
at a time, then, as successive channels fail, replace their outputs by
constant 0 or 1 signals, alternately.
It is convenient to represent the inputs to the majority network as
CO , C1 , and x, representing, respectively, the constants, 0 and 1, and
the free variable x. As an example, for the case 2m + 1 = 7, the follow-
ing useful sets of inputs appear at the input to the mF;ority network;
(7x), (6x, 1C0 ), (5x, 1CO3 1C1 ), (4x, 2CO3 1C1 ), (:x, 2CO3 2C1 ), and
and (2x, 3CO3 2C1 ). Since the system output is 1 when four 1's are
37
(2x, 3C01 2C1 ). Since the system output is 1 when four 1's are input,
the corresponding minimum ratios of free 1 inputs to total free inputs
required for system output 1, are 4/7, 4/6, 3/5, 3/4, 2/3, and 2/2
respectively.
The overall structure of a realization of this approach is given in
Fig. III-7a. Each of N channels, x 1 , xn, is presented to a Majority
Net via an AND/OR cascade, whose other inputs are obtained from a control
unit. Thus, for example, the first input to the Majority Net is the func-
tion (xl )(C11 ) + (C12). By proper choice of the control signals, C11 and
C12 , the input may be set to the values x1 , 0, or 1. The Comparison and
Control Unit determines when a channel is to be disconnected and also
generates the appropriate control signals. Inputs to the unit are ob-
tained from the channels and from the system data output.
A logical realization of the Comparison and Control Unit is illus-
trated in Fig. III-7b. The subnetworks corresponding to channels are
arranged vertically, between dashed lines. The control outputs for a
given channel are obtained from two flip-flops; thus, for channel 1,
C11 F11 F12' and C12 - F12' The Majority Net inputs x 1 , 0 and 1 are
produced by control outputs 
C11C12 1 ' C11C12 = 1, and C12 = 1, respec-
tively. These outputs, in turn, correspond to the flip-flop states
F11F12 = 1 ' F11F12 = 1, and F12 ==
	 respectively. Initially, all flip-
flops are cleared to t'he F state, thus, setting all Majority Net inputs
to the x state. In order to set the channel 1 Majority Net input to the
0 or the 1 state, either F 11 or F12 is set to the F state, depending on
the parity of the number of previously disabled states. This parity is
recorded by a trigger flip-flop, F p , the state of which is sampled by
the signal gl.
Signal g1 , which is derived from signal f1 is gated by an output of
the STEPPER unit. Signal f 1 is true whenever the channel data, x1 , and
the system data output differ, the difference being taken as indicative
of a channel error. The STEPPER unit serves to scan the channels serially;
this is done to avoid ambiguity in the determination of the parity of
active channels, in the case that several channels may fail at the same
moment. The STEPPER is assumed to return to a rest state autonomously.
38
DATA
OUTPUT
(a) OVERALL SCHEME
CHANNEL X1
CHANNEL XN
DATA
OUTPUTE
TES1
CL
F 11 F 12	 FN1	 FN2
A	 A
XI C 11 C 12	 XN CNI CN2
»-sseo-.n
(b) DETAILS OF COMPARISON AND CONTROL UNIT
FIG. III-7 SWITCHED-ADAPTIVE VOTING
The presence of a difference in any channel is indicated by signal
y 	 fl + f2 + ... + fN, which, when true, starts the STEPPER, if per-
mitted by the external control signal TEST. For each channel, if a g
signal is produced, a y2 = 9  + g2 + ... + g  signal is produced, which
triggers the parity flip-flop, F p . The information in F  is also avail-
able in the set of Fil status flip-flops.
In a system composed of a number of such voting stages, various
portions of the control unit may be centralized. For example, a single
39
STEPPER unit could serve all stages, and in the extreme, all the control
logic could be programmed, with the exception of the status flip-flops.
The minimum cost of the stage, assuming that, of the control unit,
only the status flip-flops and their input gates are realized in hardware,
is 2N flip-flops, 4N gates, and one majority network. The cost of the
latter is about eight three-input gates for N = 5, and about eighteen
three-input gates for N = 7. The total number is thus 10 flip-flops and
28 gates for N = 5, or 14 flip-flops and 46 gates for N-7. For comparison
(also excluding the comparison logic), the scheme described in the Phase 1
report (pp. 74 and 85) requires about 5 flip-flops and 50 gates for N = 5,
and 7 flip-flops and 95 gates for N = 7. 3
Assuming that a flip-flop is equivalent to four gates, the equivalent
gate costs for five- and seven-input stages are 68 and 102, for the new
scheme, and 90 and 123, respectively, for the old scheme. The avail-
ability cf a three-state storage element would allow further economies.
40
A. Introduction
1. Commutation Requirements
In Chapter III, a multiprocessor system was proposed for which a
set of simultaneous one-to-one connections could be established between
working memory modules (um) and simple processor and control modules (SP),
and also bet'veen SP modules and arithmetic logic units (ALU) or back-up
memories (BAb.) or input-output devices (1/0). Also di,cussed was the
possibility of "repairing 0 ' an SP, WM or ALU module wherein s he module is
realized in a byte-sliced manner. In such a case, the repair operation
requites the routing of the data, previously destined for a faulty slice,
to a succeeding slice. The networks that perform the data switching,
which is inherent in the operations of module assignment and repair, are
called commutation networks.
We have classified two types of data commutation for the module
assignment, namely (1) complete permutation--complete utilization,
(2) complete permutation--incomplete utilization, and three types of
commutation for the repair operation, namely (3) incomplete permutation--
order preserving, (4) incomplete permutation--nonorder preserving, and
(5) "shorting." These five commutation functions are described schemati-
cally in Fig. IV-1, where the specific applications of each function are
also listed.
The assignments associated with the complete permutation--complete
utilization [CPCU(N)] function is probably obvious. The commutation net-
work is to be capable of permuting, in an • rbitrary manner, a set of
N input data lines, emerging, for example, from a set of memories to a
set of N output lines incident to, for example, a set of SP units. In
the illustration, a data transfer path represents a parallel set of
lines containing many bits (24-56). The assignments associated with
the complete permutation--i.ncomplete utilization [CPIU(N,:n)] function
41
ROUTING OF DATA
AROUND FAULTY
BYTE SLICES
'SHORTING"
COMMUTATION FUNCTION 	 SCHEMATIC	 APPLICATION
COMMUNICATION BETWEEN
COMPLETE PERMUTATION	 WORKING MEMORY-SIMPLE PROCESSOR
COMPLETE UTILIZATION 	 (FULL CAPACITY OF
INTERCONNECTION)
COMMUNICATION BETWEEN
COMPLETE PERMUTATION BACK jP MEMORY- SIM OL F PROCESSOR
INCOMPLETE UTILIZATION (LIAi TED CAPACITY OF
INTERCONNECTION)
IN1 ^RCONNECTION WHEN
UNIT °EDUNDANT UNITS EXISTGOODINCOMPLETE PERMUTATION _UNIT 1. COMMUNICATION BETWEENNONORDER PRESERVING SPARE REGISTERS
GOOD
^
GOO') 2. COMML" i1CATION BETWEEN
;INIT UNIT ALU-SIMPLE °ROCESSOR
INTERCONNECTION WItEN
GOOD RFOUNDANT UNITS EXIST
GOOD BYTE ,.VD DATA ISO`2DER SENSITIVEINCOMPLETE PERMUTATION BYTE
ORDER PRESERVING GOOD 1. COMMUN CAT10N BETWEEN
BYTE BYTES OF WORKINGGOOD MEMORY-SIMPLE
BYTE PROCESSOR
TA-5560- 116
FIG. IV-1 CLASSIFICATION OF DATA COMMUTF.TION REQUIREMENTS
42
differ from those associated with the CPCU function in that for the former
only a subset containing m inputs of the total of N inputs and outputs*
needs to be interconnected at a given time. For the incomplete permutation--
order preserving [IPOP(r,m)] function a subset containing m inputs of the
r inputs, say for example associated with the working byte slices of a
simple processor and control unit, is to be connected to a subset of the
outputs, say associated with the working byte slices of an arithmetic
logic unit, but with the restriction that spatial ordering of the input
signals is to be preserved at the output. The preservation of order is
clearly required since the data to be commutated is a binary number. The
assignments associated with the incomplete permutation--nonorder-
preserving [IPNOP(r,m)] function differ from those of the IPOP case in
that, for the former, preservation of order is not a requirement. For
the shorting function the outputs of a given byte slice are either to be
connected to the succeeding stage (slice) or "shorted" around that suc-
ceeding slice.
For any commutation function it is desirable to synthesize networks
for which the following are true:
(1) The design is economical
(2) The network setup is not difficult
(3) The data transfer is rapid
(4) failures in the commutation network do not disable
either the commutation network or the modules served
by the network. This tolerance to CN failures should
be achieved with minimal increase of network com-
plexity.
(5) If the commutation network is "repairable," the diagnos-
tic routines should be easy to specify and of minimal
length.t
We are assuming that for all commutation requirements there are an equal
number of input and output terminals; this is clearly not a necessary
restriction and it is only postulated for convenience in description.
t The "length" of a diagnostic routine is defined to be the maximum number
of sequences of input data required to diagnose the network.
43
T1 - 5480 - 122
WM	 SP
2. Prior Solutions to the Commutation-Network Design Problem
The obvious solution to the commutation-network design problem relies
upon the use of a single-level crossbar switch, similar to the type com-
monly found in central tele}'lone exchanges. Figure IV-2 represents,
schematically, a crossbar switch serving a set of WM's and SP's. Here
the ,-ossbar, where each single-pole single-throw switch represents a
crosspoint, fulfills the requirements of both the CPCU(N) and IPOP(r,m).
Clearly a crossbar with N 2 r 2 crosspoints would bt sufficient, but it was
shown' that actually N2 (r2 - m2 ) crosspoints are sufficient. In any event,
it is seen that 28 . 104
 crosspoints are required to serve 25(=N) processors
and memories, 32(=r) total bytes, and 24(=m) bytes are required for com-
putation. Although a multiprocessor of this complexity might appear un-
reasonable, there is considerable motivation to seek more economical
commutation network designs.
Goldberg'," described a serial transfer network for the IPOP func-
tion which exhibited a complexity proportional to 2N, but, even with the
FIG. IV-2 "CROSSBAR" REALIZATION OF COMMUTATION FUNCTION
44
use of speed-independent "Logic, the data-transfer rate is probably not
adequate for our application.
3. The Primitive Building Block of Commutation Networks
Most of the commutation networks described in following sections
will be composed of interconnections of the "cell" shown in Fig. IV-3.
The cell, which is very simple, behaves as a reversing switch, under the
ca:,trol of a single memory flip-flop.
In essence, it is a dou'jle-pole, double-throw reversing switch,
controlled by a storage element (e.g., a flip-flop). In addition some
means is provided by which the storage element is set to the desired
state. Figure IV-3(a) shows a relay-contact version, analogous to cir-
cuits in the MOS technology, and Fig. IV-3(b) represents a NOR gate-
realization of the cell in question. The two mcdas consist of a crossing
Fig.IV-3(c) and a bending Fig. IV-3(d) of the pair of input leads to the
pair of output leads. Figure IV-3(e) represents a redundant flip-flop
version of the cell, for which any single component failure will result
in one of two possible failure conditions. In the first failure condition,
which we will call the "stuck function" condition, the cell can realize
only one of the two possible modes, i.e., the bend or the cross. In the
second such condition, which we shall call the "bad-output" condition, one
output lead wntains a faulty signal. Table IV-1 summarizes the failure
conditions resulting from various component failures of the cell of
Fig. IV-3(e).
Table IV-1
FAILURE CONDITIONS FOR BASIC CELL
Component Fault Failure Condition
Faulty OR ga=:e Bad-Output
Faulty 2-input AND gate Bad-Output
Faulty 3-input AND gate Stuck-Function
Flip-flop stuck in a mode Stuck-Function
Same logic value on two outp»ts Bad-Output
of a flip-flop
45
^	 scTU	 +	 i
P	 ^Ix	
'1\
x
1	 1 \ 1
x	 ^x	 —	 ,
P
12
R
0l
(a)
	
(b)
I	 ^
i	 I
	
t	 I
^	 I
	
i	 I
L_ _J
	 L-	 _J
(c)	 (d) U- 556.-151
02
r•-5600-166
(t)
FIG. IV-3 BASIC CELL FOR COMMUTATION NETWORKS
46
B. Commutation Networks for Complete Permutation--Complete Utilization
1. Nonredundant Networks*
It is easy to see that a CPCU(N) network must contain enough two-
state cells t to specify all N! possible permutations; thus, N1 (N), the
number of cells in the network, is given by:
N1 (N) Z log  (N:)
or, asymptotically (from Stirling's formula.),
N1 (N) z N log  N - 1.443N + 0.5 log  N
In Fig. IV-4(a) we depict a CPCU network. The subnetworks PA and P  are
themselves complete interconnection networks, each with half of the number
of inputs of the total network. This arrangement requires a number, N1(N),
of cells which satisfies:
N1 (N) = N1 [(N/2)] + N1[N - (N/2)] + N - 1
Solution of this recursion for N = 2 r , [using N1 (2) = 11, gives
N1(2 r)= 2r (r - 1) +1
or
N1(N) = N log2 (N) - N + 1	 .§
# The nonredundant networks for the case N = 2 k described below were dis-
covered under another contract, and are described in two forthcoming
papers. 17 , 18 All of the other resultF were obtained under this present
contract, as well as the generalization of the nonredundant network to
cover the case where N # 2k.
t For the CPCU applications of interest to us, the cell is actually an
r-pole, double-throw, reversing switch, where r is the number of output
lines contained in, for example, a WM. It is shown in Sec. IV-B-3 that
it is convenient for reliability enhancement purposes to byte slice the
CPCU network so that the Number of "poles" each cell contains is con-
siderably less than r.
It can be shown that the same value of Nl (N) is obtained for arbitrary N.
47
A constructive proof that this network is capable of performing an
arbitrary permutation is as follows. As previously, let the inputs and
_outputs be labelled X 1 , Xa , ... Xh and Yl , Y3 , ... Y.X , .tispt•etivel , in
order as shown in Fig. IV-d(a). We start by building a path backwards
from 1'1 through PA, and through whichever inpu'. cell is connected to the
input that is supposed to be connected to Y1. (The routing of signals
within PA and P8 will be determined in later steps.) A forward puth is
now formed from the other input Cl= 1„ sharing the _1 ne input cell with Xyl,
through this input cell, through PB, and through whichever output cell is
connected to the desired destination. (Cs , say) of this input line. In
like manner, the mate to this new output is traced back through PA to its
associated input, and the mate of this new input is connected in a forward
pats► through P8 to its intended output. this path-building process is con-
tinued, alternating use of P a and PB, until output Y  is _vached. If not
all input-output connections have yet been completed, then start with any
unconnected output line and continue the process, going through the cycle
as many times as necessary, until all input and output cells are set in
either the „bending" or "crossing" mode.
The entire procedures are now repeated separately for each of the
interconnection arrays PA and PB , etc., until the entire arra y has been
set up. The procedure will never be forced to stop because of the lack
of connection, since PA and P  are each connected to every input and output
cell except Y1 for P  and Y  for PA , and these connections are avoided by
starting as indicated. The array for X = 8 (with some of the connections
straightened and the outputs renumbered) is shown in Fig. IV-•i(b).
I*_ is of interest to investigate efficient techniques for setting up
the network cells to reali-e the necessary mode for a particular permuta-
tion. Consider, for example, the network of Fig. IV-4(b) redrawn in
Fig. IV-4(c), and assume that we require the setting .:f the cells as shown.
Referring to Fig. IV-3 we see that each cell is in the crossini; mode upon
resetting the flip-flop. The cell is set to the bending mode by the co-
incidence of logic-one vr.lues oil
	
data inputs and the P input. Clearly,
then, by applying a logic-one value to the P input of all _ells in a given
level of the network--in Fig. IV-4(c) a set of values, X . = X(; = P1 = 1,
48
r„
4^^
Yt
i
x.
X,
Ytis
is	 xs
x,
A
A	 y
+Rii•	 a
');b
	
1
l^l
FIG. IV-4 NETWORK FOR COMPLETE PERMUTATION — COMPLETE UTILIZATION
49
will set the cell serving X  and XS..-and appropriately setting the N input
signals, the network can be set up in a period proportional to the number
of levels in the network.
7. Byte-Sliced Commutation Fetwork&
In tht* section we are concerned with the behavior of the CPCU net-
v%orks under cell fault eLnditions and a at- nple technique of accotltntodatAng
to these faults.
It Is .tear that for the cane wherein the basic cells are double-pole,
double-throw, reverEing switches, each bad-output cell failure results in
an error only on a single output of the network. Similarly, each stuck-
function failure results in att error on a maximum of two outputs. In the
latter case, it is sometimes possible to accommodate to this failure type
by appropriately setting the working cells of the network. 1ltis accom-
modat ion techniques it discussed in detail in Sec. It'-B-3. Unfortunately,
for the case of r-pole cells, a single failure could result in the inability
cf the CPCU network to realise several of the SM-SP assignments.
Tntis state of affairs in significantly tmproved by byte slicing the
CPCU network as shown in Pig. It'- 5. In this case, the bytes (where each
! 6t-, bFrH
..r
•	 N - PERMUTEk
	 TO ORDER PRESERVING
	
- POLES CELL	 NETWORK AND SP UNIT
1
*M,^	 f .r
1l ^o
FIG, IV-5 BYTE-SLICED PERMUTATION NETWORK
n
JtI
byte is assumed to contain
-
 b-bits)_of-the iN1t's are permuted to separate
networks. That is-, the first CPCU has, as inputs, the first byte of
each WM, the second CPCU has, as inputs, the second byte of each Wit, etc.
The outputs of thet'ftrst CM -are ultimately directed to first byte of
each SP, etc. It is thus seem that -for this byte-sliced realisation,
'which, of course,-requires the distribution of
-
the cell memory fliprflops
among each of the CPW'k, a call failure disables a single byte. These
commutation network byte failures can be accommodated for in the ides-
tical_manner proposed for other byte failures, i.e., by the use of the
incomplete permutation--order-preserving networks+ descrihed-in Sec. IV-D.
3. CPW Networks Insensitive to Cell Failures
a. The Stuck-Function Fault -
For-the-moment, consider only the caso in which the network
is fault-free or has precisely onq bad switch (cell). Figure IV 6(a)
Illustrates a straightforward solution to the single error-correction
problem. If PI and P3 are both full permutation networks, then a fault
occurring in one of them (such a fault being of the stuck-function type
that does not disturb lead continuity) has no effect on the operations
of the other network. Obviously, this is a rather wasteful approueh
	
m
since all of the remaining switches in the network containing the fault
contribute nothing toward forming the desired permutation. Instead, let
P  of Fig. IV-g(a) represent a permutation (CPMt ) network, and P1 be a
network specifically designed to undo the damage caused by a fault in Pa.
Then if a fault occurs in P39 it can be repaired by Pl , while a fault
in Pi
 causes no trouble because P  is a full permuter. What is required
of the network Pl? A single fault in P3
 causes a simple interchange of
some particular pair of leads at acme cell within the network. This can
only become manifest as a spurious reversal of exactly two leads at the
output. Of course, the possibility exists that the switch might fail in
the correct position and no trouble would occur. In any event it would
be sufficient that the network P I be capable of effecting the intorchange
of an arbitrary pair of input leads without changing the relative assign-
mrnts of the other input leads.
51
The double-tree (D-T) network of Fig. IV-d(b) can do this job.
Any pair of input leads can be directed to some switch on the left of
the center switch. At this switch (to the left of the center switch),
the leads may be interchanged. Whatever switch settings are required to
do this interchnngtng, w-L 12 v by reflection in the (imaginary) center-
line to the right-hand part of the network with the exception of the
switch that actually effects the interchange. The corresponding switch
in the right-hand part of the network is set in the opposite state.
The scheme is illustrated in Fig. IV-6(b) for the particular case of
N s 8 in which we wish to interchange inputs X2 and Xg. Here the switch
settings to the right and left of the center line are identical and the
center switch effects the desired interchange. Similar networks exist
for all values of N, and are obtained byrp uning the corresponding tree
networks for the next largest power of two greater than h.
When one of these double-tree networks is placed in tandem with
a full permutation network, we are able to correct the effect of one
switch failure wherever it may occur in the composite total network formed
by P1 and P2 . If P2 is one of the CPCU type commutation networks we have
already considered, it will be found that the input peripheral switches
of P2 _and the output peripheral switches of the D-T network match up into
tandem pairs of individual switches. Note, for example, the pairing of
leads at the input of the network of Fig. IV-4(b) and the similar pairing
of output leads in Fig. IV-6(b). whenever such a pairing occurs, -we may
omit one of the two switches if we have provided for the possible failure
of one or the other of them. We may therefore omit the entire column of
peripheral output switches from any of the D-T networks whenever we ad-
,loin them to one of the CPCU networks. The single--error correcting capa-
bility is not affected by this pruning operation. Call the network that
results from the removal of the output switches from the double-tree net-
work the TDT(N) network (for truncated double-tree of N leads).- Then
single-error correction of any CPCU(N) network can be obtained by the
tandem addition of one TDT(N) network. Furthermore, if the possible ef-
fect at the outpLt of the CPCU(N) of the multiple failure of switches is
considered, it turns out that the addition of TDT(N) networks in tandem
52
Xi
X2
X3
X4
xs
X4
X1
xa
x;
x'
x'3
X4,4
X#
S
Xb
X;
x'e
to-9500- III
x
x
(b)
A "DOUBLE — TPEE" NETWORK
Y 
Y2
Y3
y 
t 1	 ►A- euo -I..
MINIMAL REDUNDANT 4 —PERMUTER --
SINGLE STUCK — FUNCTION CORRECTING
FIG. IV-6 PERMUTATION NETWORKS INSENSITIVE TO SINGLE "STUCK — FUNCTION" FAULT
53
to-any-CPCU(N) network will suffice to correct as many no switch failurea
in the total network. The argument, which is much the some as the fore-
going one, depends on the possibility o..,ecomposing all such multiple
failures into - separate pairwise lead interchanges.
To estimate the cost of error protection according to the fore-
going scheme,-we no-to that the TnT(N) networks contain approximately
(3/2)N switches. To correct j2 errors then takes about 3/3 (pN) switched.
Thus, if N is very large, we can correct a "few" errors at a cost that
is small compared with the total number of switches in the network
-(-;N 1092
 N). On the other hand, correction of multiple errors with
TDT(N) networks does not furnish a recipe for creating arbitrarily rel.-
able networks (in the Shannon sense) while still meeting the asymptotic
cell count, i.e., ^N log  N. We have not yet discovered how to reiolve
this problem although a solution seems possible. If anything can be con-
cluded from results obtained for small values of N, it seems that single
fault correction should be obtainable at a cost of (log 2 N)* extra switches
in excess of the CPCU(N) count. In particular we can exhibit specific
networks that correct one fault and have switch counts as indicated in
Table IV-2. In each case the number of switched shown in Table IV-2 is
Table IV-2
NUMBER OF CELLS IN SINGLE STUCK-FUNCTION
CORRECTING CPCU NETWORKS
Switched in
Leads Redundant CpCU(N)
Network
2 2 1
3 5 3
4 7 3
3 11 8
n
r	
-'m r - -
	
a=te=	 _	 -	 -
exactly (1692 N) larger than the corresponding value- of CPCU(N). In the
cases N 3 3 1 3 and 4 1 -it is reasonably certain that these realizations
are the minimal ones that exhibit the single-fault
 
correcting property. 	 -
Ate example of a single-fault
 
correcting network for N 4 Is illustrated
in rig. IV-0(c).
A considerable number of network forms and iterative combining
rubes were studied during attempts to establish the minimum cost of fault
correction. Several time-shared computer programs were written that allow
hourist c tinker_ ing with networks, i.e., the progi,&_t BAD-ONE that verifies
whether a proposed network ix .indeed a full permuter if one cell has a
stuck-function fault. As a result of all this experimentation, one feels
impelled to make the following conjecture:
_	 The cost of protection against (correction of)
p faults in a MCU(") network is no more than
the difference in cost between CFCU(N) network
and a CTCU (N + P) network. That is,  the cost
of correcting each additional fault, say fault 1,
is smaller than (log 2(N+ i)).
If this conjecture is indeed true, then there would exist per
-mutatior. networks of arbitrarily high degree of reliability whose cell-
count would not exceed K-N1 (N), where K is a constant related to the
probability of a switch failure.
b. An Alternative Single Stuck-Function Correcting Construction
A different and slightly more economical [than the TDT(N) device]
method of providing single-fault protection in CPCU(N) networks stems from
the observation that the construction of Fig. IV-4(b) can tolerate one
cell failure in any peripheral cell if an extra cell,serving outputs Y1
and YN column, is retained rather than deleted. The reason for deleting
this cell in the first place was that exactly one peripheral cell is un-
necessary in the nonredundant case. Since all of the peripheral switches
are exactly equivalent in function, it makes no difference which one we
delete. Hence, by retaining all of them we can ignore exactly one failure
in any of them. If each of the "internal" permutation networks that im-
plement the construction of Fig. IV-4(b) are augmented in the same way
i
55
X•	 Y,
X Z	 YZ
X 3	Y3
X 4	Y4
FIG. IV-7 NON-MINIMAL REDUNDANT 4-PERMUTER, SINGLE STUCK-FUNCTION
CORRECTING
c. Correction of Bad-Output Fault Types
Considering the effect of a single fault of this type on an
otherwise full permutation network, it becomes apparent that exactly one
output lead will receive an incorrect signal in response to the supplied
inputs. The situation can be remedied by adding one extra lead to the
N-permutation network if we can also be certain that the input signals
are applied to, and the output signals derived from, the correct subset
of N leads. One way of accomplishing this is illustrated in Fig. IV-8.#
The figur^ is drawn for the case N = 4, but the method is perfectly
Clearly a bad-output failure on a cell immediately preceding a network
can never be accommodated for. In this case the appropriate byte is
disabled.
56
41
X,
X2
X3
X4
Y,
Y2
Y3
Y4
V - 9060-12•
FIG. iV-8 NETWORK FOR CORRECTING BAD —OUTPUT FAULTS
general. To permute N leads, we employ a nonredundant N + 1 permuter
flanked on input and output with a ladder network having N switches. If
the N + 1 permuter has a bad cell lead, this will show up as a failure
of one of the :signals on leads A, B, C, D or E of the internal N + 1 per-
muter to arrive at its specified output. By setting the switches of the
adder network in the obvious manner, the fault can be corrected, as
illustrated in Fig. IV-8 for the case wherein input C does not arrive
correctly at internal output 4.
The cost of correcting one failure by the method of Fig. IV-8
is 2N switches for the ladder networks plus \log 2 (N + 1)) extra switches
Lassunting the CPCU(N) construction of Fig. IV-4(b)] to implement the
N + 1 permuter rather than the N permuter. This cost, 2N + log  N, is
asymptotically negligible compared with the cost of a CPCU(N) network
for large N. It is obvious that the foregoing construction can be extended
to correct multiple faults of the bad-output type.
57
C. Commutation Networks for Complete Permutation--Incomplete Utilization
We are assuming here that the commutation network is to serve N in-
puts and N outputs [similar to the function of the CPCU(N) network], but
in this case it is only necessary to provide simultaieous connections
between m(m < N) inputs and outputs. Such a commutation f unction, re-
ferred to as embodying complete permutation--incomplete utilization.
is denoted as CPIU(N,m). It is desired to specify a network that is more
economical in terms of cell-count and/or is easie: to set up than a CPCU(N)
network which, of course, also achieves the CPIU(N,m) function.
It is easy to see that a CPIU(N.r.) network must contain enough two-
state cells to specify (m) (m') possible permutations; thus, N2 (N,m), the
nur •`:,^r of cells in the network, is given by
\2^N,m) - log 
[ 2
 (M") (m.V%
or
	
N2(N,m) z 2 loges (m)	 N1 (m).
The above formulas suggest that the CPIU(N,m) function could b•^ realized,
as depicted in Fig. IV-3. by a network composed of a CPCU(m) block (m-
permuter) sandwiched between two combination networks.
It is assuned that a combination network serving m inputs and N out-
puts, denoted as a COM(m,N) network, is to have the capability of connecting
X, —4	 ` x;	 Xj	 Y,
X,	 N—MX_ I	 x:	 M — N	 Y=
COMBINATION'	 —	 ` COMBINATION
NETWORK	
PERMUTER	
NETWORK
X.	 X,:	
YN
r.-ss^o-^aa
FIG. IV-9 SCHEMATIC REPRESENTATION OF DECOMPOSITION OF A COMPLETE
PERMUTATION INCOMPLETE UTILIZATION NETWORK
58
II
11
13
IZ
1
0,
01
03
0,
02
0,
x- it
( a)
CROSS MODE
(V)
BEND MODE
" - 5500 - _4o
the set of m inputs onto any specified set of m output leads, without
regard to the order of these signals on the outputs. A similar definition
applies to a OOXI(n,m) network where the number of inputs is assumed to
exceed the number of outputs.
Tice number of cells, ti
e 
'N,m), required for an m,X (or an X,m) com-
binati,)n network is given by:
(	 \'	 X:
Xc X,m)
	
log2 (m) = log2lm!(,x - m
Asymptotically, from Stirling's formula, we find that a network composed
oi X - 1 cells should be sufficient to perform the combination function.
We have not found combination networks, composed of the 2-input basic
cell which approach this \ - I cell bound, as closely as the CPCtI(.) net-
works approach the {log9
 X:) bound. However, it is not difficult to specify
a COJI(X,m) network that -requires only X two-state cells, but wherein cacti
cell contains in 	 1 inputs.
Consider the two-state cell depicted in Fig. IV-10, with in horizont:al
inputs (in
	 3 for the case show-n), I 1 , I 2 , ..., I m , m horizontal outputs
O1 , u^,	 0^^, cne vertical input, X i , and one dummy (unused) vertical
FIG. IV-10 BASIC CELL WITH AUGMENTED SET OF INPUTS
As indicated in Sec. It'-B-1, if a network is commutating parallel data
channels, cacti call input is in reality a bundle of many wires.
39
output. In the "cross" mode, the three horizontal inputs are transferred
unaltered to the three horizontal outputs. In the bend mode we effect
the transformation Xi — Or.i' im — 0m-1' .. '' I2 — O1
'	
An m,\ combina-
tion network is easily synthesized as a cascade, containing \ of these
augmented input cells, as shown in Fig. IV-11 for the case ni = 3, X = 6.
The appropriate cell modes are shown for the case where it is desired to
connect the inputs X,,, X O and X5 to the three outputs, 11 , 1,^, T3 , which
are the horizontal outputs of the last cell in the cascade. It is noted
that the first cell in the cascade actually requires no horizontal inputs,
the second cell only 1 horizontal input, ..., the mth cell only m - 1 hori-
zontal inputs. However, if we consider the approximate cost of a cell to
be proportional to the number of terminals served, then the cost of the
combinational realization of Fig. IV-11 is of the order of r.LX. We can
find realizations that require a number of cells in exce-s of \, yet
exhibit a cost measure significantly less than the network previously
described.
We will recursivel y synthesize a COM(m,X' network that is composed
of the basic two-input cell, using a procedure suggestive of the one
describeo in Sec. IV-13-1. Consider the CO\!(m,X'• network represented by
Fig. IV-12, where it is assumed tliat 2Im and 21\. The subnetworks, Q,
and Q., are themselves combination networks, each with half of the number
of inputs of the total network. This arrangement requires a number,
^ ), (m,N), of cells which satisfies:
1
Solution of this recursion for \ = 2 1 , m = 2 i-1 , _using \,x (1,2) = 1^
yields
2)2r- 1 
+ 1
it It is cloar that a single two--input cell, where one of the inputs is
not used, is a CO\111,2i network.
60
S1
K
Ir
OC
O
ui
z
E
Ca
Y.
Y
Y
Y.
Y
Y
x.
%
x
X Y',
•A Mt •!
P10. IV-12 RL Ua51VE APPROACH TO tit r N COMBINATION NEtwrJRK
tit,
4
tJ t_	 Ea t i	 S I, (N) . 3 + 1
A similar vNitrr gwiou can by 4 cvtved for the case whrretn 4n 1 It "iv not
powers or 4u.
A construx , tire proof' that thin nvtkork in capable of Wx4opiug a
path from the tit inputs to an arbitrartiv xviv ted not "i K i5nputq 14 riV
follows, het the• inputs and outputs fir lubrled A l , 12 , .••• xtt , and
Y	 121 ... 1 1
x
 us indicated in Fig. 1V-124 Q kill start by in4iiyating
the appropriate modes for the (N; 21 - 1 fautput! coils wvWng tut-
huts X421 .••1 4.1 , so that for an arbitrary Hi}uetion of m outputg4
vxactly M12 or these outputs are connected to Q  and tttk2 to 42 4 ti Ott
output ecll serves two melceUd outputs, it can be arbitrarily not to
Whop
 mode. The output cells which serve the rur=taming net of meivutcd
outputs are then net to the appropriate triode tau that the first of theme
outputs (including possibly 1 1 is condoctV4 to 41 , the* tsucond connected
to Q21 vivo
The entire procedure in now repfatud for Each of the networks ttl
and `01 ete•, until the entire artworb has been net up, Thu uutworh for
621
4 i
l+
i
4
Y
Y6
r0
FIG. IV-13 4-8 COMBINATION NI TWORK
m - 1,	 - - H, 10 ',thou it In F1 g.	 iV-1:i, NIt ► 'r ► , thu	 I I	 tittic • t tons .trc- 1 till 1-
1-et- -cl
 
for the .rc'luc • te'cl ot ► tpttt ' I't Yj , Y l , Yy, y 
1t !m 1 tit vrt-stinv. tc ► im• emtikait, tecltnttitium fur i tit , urporat Imo re-
dundanc v into there rombitlatto ►t nvtwm-kx such that they cttit conttime to
v1fQCt a wtvvli ammIgntnvill in thu prc . xrncv of Cell failurvS. cmisidering
tilt, mtut-k- fttttc• tim, fail ► trc , , a simpliv teelmitlue can be spectfled, [or
sittal y f ►ttlttrvm 1,1 thin tvpc. •similar to tile,, method dc!.cribed In sec-f1-3(1)).
11, ferring to Fig. fV-11. %%v note that au output cell was not required to
mervv outputs Y 1 +tnd YN , and it corrumponding cell was onttttcd from each
internal •ubtict wm-k.
	 i f wc , i nmt-rt this  cell in each case, and lit adilt-
t lull dulll irate . tilt- cellm err% i ng till . inl ►ul lt • all!s, wv c, l l l ha y ► adcled
'i - 1 cell% and itroduce,d a network that is toleront to sit ►gle stuck-
funct Ion fttllure -S.
	 to Fig. TV-11 we display such a r(Aundant . 1 - N com-
b itlat ion nttwork and lntlivatc tliv appropriate cell modes, so as to connect
tilt ltlputs to Y 1 , Y3 whrreln cull S is a%sumed stuck in the cross mode.
For tht,
 case of bad-output failures, a technique mintllar to that
drst • ribed In See. IV-11-3(C) ran br applied. The tt-chniciue for a single
failure in the combination nttwr,rk came would require a CU11(m + 1, N . 1)
63
i
S T 
Yz
Y3
Y4
it - 9940 - 164
X,
X2
FIG. Ii/-14 REDUNDANT 4--8 COMBINATION NETWORK FOR CORRECTION OF SINGLE
STUCK —FUNCTION FAILURES
network flanked by ladder containing m cells on the input and a ladder
containing N cells on the output. The technique generalizes easily to
accommodate to multiple failures.
U. Commutation Networks for Incomplete Permutation--Nonorder Preservin
It may be recalled that incomplete permutation--nonorder-preserving
networks art, required to establish connection paths between a set of in-
puts and outputs, where only a subset of the inputs and outputs are re-
quired for a particular task, and the spatial ordering of the signals at
the input does not have to be retained at the output. One application
of these networks, which has been described, is concerned with the trans-
fer of data between registers where a redundant set of registers is
specified.*
We note, somewhat trivially, that a COM(m,N) network would function
as an incomplete permutation--nonorder-preserving network serving a non-
redundant set of m inputs and a set of N outputs of which N-m are redun-
dant. In this section we will be concerned with networks for the incom-
plete permutation--nonorder-preserving function, IPNOP(r,m), wherein there
are r inputs and r outputs, and it is necessary to connect m input-output
Note that this application is quite different from the case wherein
we are concerned ,vith the transfer of information be..•veen registers
that have redundant bits. In the latter case we would require an in-
complete permutation--order-preserving network.
6 Li
pairs together without regard for spatial order. For example, if r = 6,
m = 3 and we distinguish inputs X 1 , X3 , X4 and outputs Y2 , Y4 , Y6 , then
the network is functioning properly if it would establish one of the fol-
lowing assignment sets, 1	Y4 , X3	Y2 , X4	Y
6], or [X 1 - Y6 , X2	Y2,
X4	Y,} ], etc.
It can be shown that a lower bound on the number of two-state cells
required for a IPNOP(r,m) network is between 1092 (m) , and 2 lo92 (m)•
In Section IV-E, which is c. icerned with the order preserving case, we
will describe an incomplete permutation--order-preserving network com-
posed of 2r two-state cells that can, of course, function is a IPNOP net-
work. However, each cell in the network must serve up to m + 1 inputs,
providing an overall network cost that approaches 2r(m + 1). We will now
describe an IPNOP(r,m) network that requires more than 2r cells, yet ex-
hibits a cost measure significantly less than 2r(m + 1).
Consider the network shown in Fig. IV-15, which we will demonstrate
yields recursively an IPNOP(r,m) network, where R 1 and R2 are each
IPNOP(r/2, m/2) networks. This arrangement yields a number of two-state
cells, N3 (r,m), which satisfy
N3 (r,m) = 2N3 (r%2, m/2) + r - 2
Solution of this recursion for 4 = 2 k , m = 2k-1 and, using N3 ( 2,1) = 1,+
gives
N3(2k, 2k-1) = 2k (k - 1) - 2k-1 + 2
it The proof of this bound is deferred until the final report under this
contract. It appears for this case that the lower value is the tighter
bound.
It is clear that a single two-input, two-output cell is an IPNOP(2,1)
network. It is also a IPOP(2,1) network, a property which will be ex-
ploited in the succeeding section.
65
x,
X2
X3
Xa
x5
Xr.2
X,.,
x,
Y,
Y;
Y3
Ya
YS
Y,. z
Y.,,
Y,
iA-55*0-145
FIG. IV-15 RECURSIVE APPROACH TO INCOMPLETE PERMUTATION --
NONORDER PRESERVING NETWORK
or
N3 (r, r/2) = r log2 r - 3r + 2
Although we considered only the case in	 r/2 = 2k-1 , the recursive tech-
nique is quite general and will yield a network corresponding to arbitrary
parameters, r, in.
A constructive proof that this network is capable of finding a mate
for each input and output contained in an arbitrary set of m inputs and
m outputs is quite similar to the proof provided in Sec. IV-C for the
combination network. For the network of Fig. IV-15 it is apparent that
the peripheral cells immediately serving inputs X 2 ,	 Yr-1, and the
cells serving outputs Y 1 , ..., Yr can be set so that exactly half of
in distinguished inputs are directed to It  and half are directed to R2;
an identical requirement is satisfied for the in distinguished output
leads. The entire procedure is repeated for each of the networks R1
and It2 , etc., until the entire network has been set up.
66
Procedures, similar to those described in Secs. IV- g-3(b) and IV-13-3(c),
can be applied to the IPNOP(r,m) network, so that the network is tolerant
to stuck-function and bad-output type faults. For the stuck-function case
we note, by referring to Fig. IV-15, that a single cell is missing on the
input and output portions of each subnetwork. The insertion of this cell
at each level in the network will yield a network tolerant to single stuck-
function cell failures.
E. Commutation Networks for Incomplete Permutation--Order Preserving
It may be recalled that the memory modules, arithmetic logic units,
and simple processor and control units can be realized as a cascade of
identical byte slices. These modules can continue to function, upon the
occurrence of failures in slices, if several spare slices are provided,
and if a commutation network is provided to route the signals between
operating slices. The function of such a commutation network, denoted
as an incomplete permutation--order-preserving network, IPOP(r,m), is to
set up connecting paths between an arbitrary set of m inputs and an ar-
bitrary set of m outputs, both sets of which are subsets of the r inputs
and r outputs, r ', m, so that the signal order is the same at both input
and output.
Similar to the nonorder-preserving case, it can be shown that a
lower bound on the number of two-state cells required is log  (m) and
2 log2 (m) , although for the order-preserving case, the upper value ap-
pears to be tighter. It is possible to realize the IPOP(r,m) function
in a network composed of 2r two-state cells, where each cell contains
m + 1 inputs. It is seen that this network approximately satisfies the
2 log2 (m) bound for the case m = r/2 since
r
lim logy [2]  ) = 2r - 2
V — CD	 (
The basic cell is the type shown in Fig. IV-10, and the network is dis-
played in Fig. IV-16 for the case r = 6, m = 3: the modes of the cells
67
68
0
e
T
tl
T
NT
T
X
x
x
U31-
w
Z
U
Z
wNW
w
a
Ili
W
0
Z
0
P:
Q
c0
MGLL
^WLL
w
wJ
CL
OU
Z
Z
Q
^O
c,
x	 LZ
NX
X3
X.	 Xa
z
Xi3	 X5
X6
X i	 X,
X6
Y3	 Yi2
Ya
Y5
Y6	 Y i 3
Y;	
Y a
YB
TA - 5560 - 147
are such as to realize order-preserving connections between Inputs X2,
X3 , Y. 5
 and outputs Y4 , Y51 Y6. Even though the number of horizontal
inputs served by the first m - 1 cells in the cascade and the number of
horizontal outputs served by the last m - 1 cells in the cascade can both
be reduced, the cost of this network is of the order of 2mr, a cor_3ider-
able cost. Similar to the situation involving the other commutation
functions, the cost of the realization is significa.itly reduced if the
tv,o-input cell is used as the basic primitive block.
A network composed of two-input cells, which realizes the IPOP(r,m)
function, is identical to the network (Fig. IV-15) for the nonorder-
preserving case. We have redrawn the network of Fig. IV-15 as Fig. IV-17
for the case r = 8, m = 4. It may be recalled that for the nonorder-
preserving case the function of the (r - 1) output cells and the r - 1 in-
put cells was to connect the distinguished m input and :^ n»tput lead6 to
the two subnetworks,- R
1
 az,.: R2 , such that exactly m/2 distinguished inputs
and m/2 outputs-were connected to each of the networks R 1 and R2 . For
the noimrder-preseeving'ease the input- .a-,id output cells could be set in-
depends-rtly; such is not-the-'case if _-ii,° network is to be used as an
order preserving network.
X 1 	Y 	 Y.
X.	 X
	
Y
FIG. IV-17 RECURSIVE APPROACH TO INCOMPLETE PERMUTATION —
ORDER PRESERVING NETWORK
69
The following procedure, for the order-preserving case will indicate
the proper mode of each cell of the network of Fig. IV-17, for a given
set of m inputs and outputs; and, hence, will prove that the network can
function as an order-preserving network. Let the distinguished set of
m inputs t•? Xi 1 , Xi2 , ..., X im , where i s > i^ if a >	 and the m outputs
be Yj 1 , Yj 2 , ..., Yjm , where j . > j y
 if o >	 Consider each of the
m inputs as residing in one of two disjoint groups. Group A I contains
those inputs that do not snare an input cell with another distinguished
input, and group BI
 contains those inputs that do share an input cell.
t:e will similarly define groups A0 and B0
 for the distinguished outputs.
The goal is to assign 
Xi1 and YT 1 to the same subnetwork (S 1 or S2),
Xi2 and Yj^ to the same subnetwork, etc., and the procedure is as follows.
Assign Xi1 and 
Yj1 to network S 1 , by appropriately setting the per-
tinent input and output cells (except for the case where Xi1 = X 1
 and/or
Yj 1 = i in which case the assignment to S 1
 is automatic). Then assign
Xis and Yj, to network S2j ; if Xil and X- are in group BI , and/or Yj1
andYj2 are is group BO , the assignment to S 7 is automatic. -Next, assign
Xi3 and Y 3 to S1 , etc., until all of the m distinguished inputs and out-
puts have been set. This procedure is then applied to set the pertinent
output and input cells of the networks S1 and S2 , etc. It is clear that
this assignment procedure
	 always be carried out. In Fig. IV-17 we
show the setting of the input and output cells for the case 
Xi l = X2'
:ii = X^^ Xi3 = tS, Xi t = X. and Y. 	 = Y1 , Yj2 = Y3 , Yj3 = T6 ,
 
Y- = Y7,
The techniques for providing failure tolerant IPOP networks are not
discussed in this section since they are quite similar to the techniques
des.rib°d in Secs. IV-B-3(b) and IV-13-3(c). ttie note that a cell failure
in the IPOP network (two-input cell type) can disable no more than two
byte slices each for the input and out put. Since it is assumed that re-
dundant slices are provided, it is possible that a nonredundant network
would be used, and when cell failures are detected, the slices that could
not be served b y
 the network would be discarded.
70
F. Commutation :Networks for "Shorting"
In Sec. IV-E we described networks that, for a redundant byte-sliced
network, can serve to route external data beth°een the operating slices
of distinct networks (e.g., between an SP and A1,U). It was noted' that
internal data (e.g., control and carry information) must be routed between
the stages of the byte-sliced cascade. If a state (or slice) has failed,
then the internal data intended for that stage, which clearly comes from
its immediate predecessor or successor, must be shorted around that failed
stage. If this shorting process is not accomplish^:J reliably, then the
entire network will be disabled.
Me shorting :unction is quite naturally achieved with the two-input
basic cell, as illustrated in Fig. IV-18. For simplicity, only a signal
flow to the right has been indicated although it is clear that the network
could be modified to handle bi-directional flow. We have shown the appro-
priate cell nodes so that byte slice 2 and byte slices 5 and 6 are shorted
out. We note that the network could recover from a single component fail-
ure within a cell, which results in either the stuck-function failure or
the bad-output failure. However, a more severe cell failure which results
in, for example, a permanent logical zero signal on 'both outputs of a cell
would clearl y disable the network, i.e., interrupt the signal flog.
Such a failure, which could only result from two component failures
within a cell, could be accommodated for by the redundant shorting net-
work of Fig. IV-19. Also indicated are the appropriate modes for cells S1,
S1, S9 , S' such that byte-slice 1 is shorted out, i.e., the output from
slice 0 is directed to the input terminal of slice 2. We have also shown
the appropriate modes for cells S 3 , S3, S4 such that the network con-
tinues to function although both outputs of S'4 	faulty. In this case
byte-slice 3 cannot be used, but the signal flow is not interrupted.
Similarly we have shown how the network accommodates to a double-output
failure in S_ in which case slice 5 is bypassed. This technique can be
clearly extended to handle failures of greater multiplicity.
71
NN
N
N
N
Yir
O3H
w
Z
t,
Z
H
O
h
LL
N
aN
2

G. SLIM mal'y
In this chapter we have studied in detail the logi.,A design of net-
works that could perform the . , r•ious data sKitc:iing or commutation required
in a multiprocessor organization %there the various OKIOUles all' repairable.
1` is assumed that the memory, arithmetic logic, and possibly the simple
processor and control modules are realized in a bete-sItcvd manner--a
realization that has been demonstrated to be practical. It is felt that
the designs we have presented. based upon the primitive two-input, two-
output reversing cell represent adequate engineering solutions to all of
the commutation problems posed, although some theoretical minimization
problems still remain. These problems relate to minimum cost designs
for the complete- and incomplete-permutation functions considering both
the nonredundant realizations and the realizations that are tolerant to
cell failures.
7-1
V ULTRARELIABLE PROGRAMI\G
I p this section the problem of constructing ultrareliable computer
programs is treated. The term ' Iultrareliable" in this context refers to
a program that not only operates completely flawlessly in normal circum-
stances, but one which is insensitive to faults that are introduced through
the input data stream or certain hardware failures. The approach used here
is to classify the common reasons for faulty programs, and discuss methods
for the prevention of these faults, the detection of failures arising
from such faults, and the recovery within the computer system from detected
failures. Although the principal interest of this section is computer
software, because of the intimate relation of computer software. to computer
hardware, in many cases the appropriate means for attaining reliable pro-
grams will be hardware oriented or will be a combination of hardware and
software techniques.
In Sec. A the classification of common software faults is presented.
Each class is treated separately in Secs. B-D, and Sec. E contains a short
summary and a list of several unresolved problems that are identified in
the course of this discussion.
A. Classification of Program Faults
sor purposes of exposition, we define a fault to be a program charac-
teristic that can cause a program to execute improperly under some set
of conditions that may depend on program state, input data, or timing.
An instance in tXhich a program executes improperly is said to be a failure.
In this section, common program faults are grouped into one of three cate-
^ories according to qualities that determine methods for eliminating the
faults. The categories can be briefly described as faults due to problems
of data analN. sis, program checkout, or the execution of a program in a
context that is outside of the scope of its validation and de.firitioti.
These faults can be attributed to human mistakes in the composition of
the program or to hardware faults in, for example, a back-up memory
where a program is stored. Move specifically the categories are:
Type I.
	 Program algorithm is essentially correct, but
program produces inaccurate results or fails
to terminate because of problems in numerical
analysis.
Type II. Program contains bugs, i.e., has faults such
that it fails to perform processes according
to specifications.
Type III. Program is completely correct for its scope
of activities, but fails when operating out-
side its scope.
Because the statements above are stated rather broadly, the cate-
gorization immediately allows us to make some statements about a general
methodology for achieving reliable programs.
Type I faults are principally due to failure of an algorithm to
account for cumulative ef fects of round-off and truncation errors, or
operates on a range o^ data for which the algorithm is unstable.' The
methodology to be used for this class of faults is oriented to problems
in the representation and manipulation of numerical data in a computer.
Rather than attempt to address this section to the entire field of
numerical analysis, we shall merely call attention to the need for such
analysis and devote the greatest portion of the discussion to the design
of hardware that will alleviate many problems of numerical analysis.
The methodology to be used to eliminate Type II faults, the program-
ming bugs, is slanted to the use of redundancy in program specification
to permit software support programs to aid the creation and check-out of
reliable programs. Program bugs can arise from many sources and all are
At initial observation numerical appears to be strictly a software
problem which is solved once, i.e., when t`.e problem is written, but
within the graceful degradation concept we are proposing, it is imper-
ative to detect instabilities in algorithms caused by an insufficient
quantity- of equipment remaining available for program execution.
76
susceptible to some checks. Transcription errors can be caught by redun-
dancy in the language, logical errors by the use of aids like decision
tables, and blunders by consistency and completeness checks that are
programmed to be independent of the algorithm that they check.
Type III faults are intended as a catch-all category. Failures that
arise either from undetected Type I or II faults or from minor transient
hardware failures generally appear in programs that are otherwise error-
free. It is common practice to validate programs by checking their behav-
ior with test data that lies in their scope of validation, i.e., for the
if conditions of hypothetical cases. Few programs are written to work
properly only if the data and the state of the program are confined to
the scope of validation. To guard against Type III faults, we use the
methodology of if and only if programming. This is the practice of using
extensive checks, both software and hardware, to validate all input data
and, when possible internally generated data in order to guarantee that
they are in the range of definition and remain so during the course of
a computation.
B. Faults Arising from Numerical Analysis
1. The Need for Analysis
Most of the problems of numerical analysis that give rise to computa-
tion failures are due to the finiteness of the representation of numbers
in a computer. If, in practice, numerical representations can be made
arbitrarily long, representation errors can be made arbitrarily small.
Nevertheless, for practical reasons, numerical data are fitted into fixed-
length fields that are deemed to be sufficiently long to give accurate
results in most cases. The length, which varies from machine to machine,
is typically between 24 and 56 bits long for floating-point mantissas.
A calculation, i.e., the implementation and execution of an algorithm on
a specific machine, must be subjected to thorough numerical analysis to
guarantee that it will produce results of sufficient accuracy.
77
To borrow an example from Ccdy, 19 let us examine the computation of
the mean of two numbers and use the formula
M = (x l + x2)/2
If the representation of x l
 and x2 is a floating point, and the base is
other than 2, then in a binary computer it is possible to compute a mean
M that is less than either x 1 or x2 . For example, consider the computa-
tion using radix 10 arithmetic with two digit precision. Let x 1 = 51,
v2
 = 52, and note that 51 . 100 + 52°100 = 103 . 100 , which, to two places
accuracy, is 10°10 1 . Division by two yields 50 . 100 , which is less than
either operand.
There are two serious effects of this type of fault. The most
obvious is in the introduction of a small error in the least significant
digit that is slightly greater than the apparent accuracy of the computa-
tion. A more subtle effect of the error is to place the result outside
the theoretical range of possible answers. The latter effect could lead
to instability of a computation.
A second example of the pitfalls of computation is given by Neely,20
who illustrates several alternate computations for the mean, standard
deviation, and correlation coefficient calculation on typical statistical
data. Although the several alternatives are algebraically equivalent,
the computations give widely disparate results. In his example, the
most direct computations generate answers that are among the least
accurate, and the simple expedient of carrying out the calculation in
double precision results in answers that are among the most accurate.
The point of these examples is to demonstrate the need for numerical
analysis to aid in the development and analysis of computations for
particular computer systems. It is vital that the analysis be done in
the context of the computer on which the algorithm is to be executed,
of course, because it is precisely the idiosyncratic behavior of computer
arithmetic units that requires the detailed numerical analysis. The
grosser characteristics of numerical processing are inherently consistent
from machine to machine.
78
Because the numerical analysis is so broad a subject, we cannot be
more explicit than to issue a caveat to the programmer to take into account
the problems of numerical analysis. However, there are other general guide-
lines that fall in the area of the computer design which we can explore
here. In particular, the hardware can be designed so as to minimize the
errors of numerical representations, and specific steps can be taken for
the detection of computational errors, and for the programmatic recovery
from errors when they are detected. We pursue these questions separately
in the following subsections.
2. Design of Floating-Point Hardware to Aid Numerical Analysis
Two differing viewpoints have come to be reflected in the design of
floating-point hardware. The first focuses on obtaining the greatest
possible precision of an operation and usually depends on normalized
arithmetic. The second viewpoint, which is somewhat opposed to the first,
is concerned primarily with obtaining results that are known to be sig-
nificant at the possible expense of precision. The latter is implemented
primarily in unnormalized arithmetic. Both viewpoints are compatible
with reliable computing, because the best estimate of the true value of
an answer can be obtained through the use of normalized operations,
whereas unnormalized operations can be used to give an estimate of the
range of possible values that may contain the true answer. In this
section we discuss the design of floating-point hardware for both n^ruralized
and unnormalized operations.
We first consider normalized floating-point operations. In the pre-
vious section, the calculation of a mean yields an erroneous result because
significant data is shifted off the right hand portion of an intermediate
result and replaced by significant 0's. Even if the operation is normalized
floating-point addition, it is possible to lose significance as shown in
the example when the floating-point radix is greater than 2. When the radix
is not binary, then the representation of a normalized quantity in a binary
machine may have leading 0 bits (10°10 1 in the example) and
0's are obtained at the expense of bits lost from the least
portion of an operand. Hence, in a binary computer, normal
point arithmetic in a radix other than 2 contributes to the
79
the leading
significant
ized floating-
i.naccuracy of
computation. The number of bits of lost precision is.approximately one
half the base two logarithm of the floating-point radix. For radix 2,
this loss is 1/2 bit ;
 which is equal to the intrinsic loss of precision
in the binary representation of r(,al quantities. For larger radices,
the loss of significance becomes greater and can become quite noticeable
because it introduces a biased error in the representation. Single
precision on System/360 computers, for example, is radix 16 with 24 bit
mantissas, yielding a si--nificance loss of two bits in 24 (a precision
of one part in 6 . 106 ). This is very low for general purpose computation
and must be used with caution.
To obtain the most reliable results, it is clear that radix 2 yields
the greatest precision for a fixed mantissa length in a binary computer.
There is a trade-off involved in using radix 2 arithmetic because the
increased proc.ision of mantissas comes at the cost of increased length
of exponent. Larger radices appear to be attractive because for each
bit of exponent that i s saved, only 1/2 bit of significance of mantissa
is lost. However, because larger radices introduce a biased error, and
because the apparent economy of representation, using large radices, is
only one or two bits for radices that are reasonable to implement, it is
recommended that radix 2 arithmetic be used for normalized floating-point
operations, especially where reliability is an important factor.
There are several details in the manipulation of numerical quantities
that require examination. It is straightforward to construct a repertoire
of normalized arithmetic operations that consistently yield the highest
possible precision. Nevertheless, the most important details ar ri given
here because so few computers have incorporated all of these details into
their hardware. In the material that follows it will be assumed that the
floating-point operations are radix 2.
Floating-point addition and subtraction operations introduce signifi-
cance loss when an operand is shifted during exponent adjustment. If we
assume that operands are normalized, then the smaller of a pair of operands
must be right-shifted until its exponent is equal to that of the larger
operand. Hence, significant digits are shifted off the right of an operand
prior to forming a sum.
80
W,
A single guard bit at the right-hand end of an adder can ue used to
ensure that the smaller operand is rounded rather than truncated prior
to addition. Both truncation and rounding introduce errors in significance,
but rounding errors are preferable because they tend to be unbiased.
Multiplication in floating-point representation can sometimes yi°ld
unnormalized products, even when both operands are normalized. Post-
normalization of the product involves a left shift of no more than one
bit, but that bit should be significant to ensure full precision. Hence,
the ruuitiplication hardware must make provision for saving a guard bit
from the partial product in case a postnormalization left shift is
necessary. Rounding of the final product should occur after postnormaliza-
tion so that there must be provision for a second guard digit for rounding
the product in case postnormalization is necessary.
Accumu l ators in computers can be constructed with several guard
digits so that several intermediate operations can use extended precision
operands, then round the numbers to single precision prior to restoring
in main memory. In actual practice, many algorithias make use of double-
length accumulators for intermediate calculations, following the philosophy
that is suggested here. There is a trade-off to be considered here because
the number of guard digits that should be kept in an accumulator depends
on the relative magnitude of the operands and the number of operations to
be performed ^rior to restoration of data in memory. It may be possible,
for example, to obtain satisfactory precision by extending the accumulator
by four to six bits, and, thereby, save a second full-length accumulator
for other purposes. When the nature of the calculations are known, their
characteristics should be taken into account to determine how extended
precision operations can best be implemented.
When single precision is ina<<_yuate for a computation, all arithmetic
operations can be made sufficiently accurate by using multiple precision
arithmetic and extended length numerical representations. Some multiple
precision operations might be included in the computer instruction reper-
toire, but provision should be made for performing extended precision
operations of arbitrary length by using appropriate software subroutines.
y
81
To simplify the multiple precision software, it is essential that the
mantissa oy-rflow bit be program-accessible so that overflow bits can
be treated as carries from word to word in the extended precision
representation of a number. It is also necessary that overflow be
signalled after the completion of an operation and that the single
Precision result of an operation t4at produces an overflow condition
be significant except for the information that is held in the overflow
bit. In practice, some computers treat overflow as an error condition
and place meaningless information in the mantissa of the result. This
plates an undue burden on the task of programming the multiple precision
operations. Overflow of exponent should be treated in the same manner
as overflow of mantissa.
The steps above are suggestions for obtaining the greatest *possible
precision from floating-point hardware. An alternative viewpoint is to
obtain results that are guaranteed to be fully significant. For this
purpose the unnormalized mode of floating-point arithmetic has been
recommended and is described in detail elsewhere. 21 Unnormalized
arithmetic achieves its goal by controlling the postnormalization of
results to eliminate the introduction of insignificant data. For addi-
tiov-^, and subtraction, postnormalization is eliminated completely except
when overflow occurs. Multiplication and division use more complex
formula to determine she amount of postnormalization. The discussion
on the dangers of truncation as opposed to rounding and the need for
guard digits applies to unnormalized operations as well as to normalized
operations.
3. Detection of Failures Arising from Numerical Analysis
analysis problems usually lead to one of two types of failures,
loss of numerical significance or algorithm instabiiit). We deal with
each of these separately.
To detect loss of significance, several competitive methods might
be implemented in an ultrareliable system. We have already discussed
the use of unnormalized arithmetic briefly, P^d note that this is attrac-
tive for guaranteeing the significance of the result. Calculations can
82
be performed in both normalized and unnormalized modes so that the best
estimate of the true result can be obtained from Lhe normalized answer
while the unnormalized answer can be used to indicate the precision of
the calculation.
Another scheme is to perform a calculation in two modes in order
to define the endpoints of the interval of numbers that contains the
true answer. One endpoint is calculated by using basic operations that
always truncate their results and by using algorithms that provide a
lower bound for the true result. The second mode is identical to the
first except that results are always rounded upwards and the algorithm
is programmed to provide an upper bound for the true result. Inter-
mediate operands and final results are intervals on the real instead
of simple real numbers, hence the name interval arithmetic. A full
discussion appears in Sec. V-A-4.
Check algorithms can be used to reveal the accuracy of a result.
Linear problems, for example, usually yield a set of residuals that should
sum to zero. The actual sum of the residuals provides an estimate on
the significance of the result. In many other cases, there is sufLicient
data available at the end of a calculation that can be used for similar
check computations.
Another scheme that has been implemented in some commercial computers
is the use of a significance alarm to detect excessive prenormalization
or postnormalization shifts during floating-point addition operations.
One or more bits per operand can be allocated to store the significance of
the operand. A single significance bit can only differentiate between
significant and insignificant operands, whereas several significance bits
permit several levels of significance to be represented. These bits can
be maintained automatically by the hardware. This is seen to be another
form of unnormalized _.:presentation when the number of significance levels
equals the number of bits in the mantissa.
The second topic for this section is that of the detection of insta-
bility in numerical algorithms. The problem is to determine the rate of
convergence of calculations dynamically so that nonconvergent calculations
83
can be discovered programmatically. Calculations that iterate to a
solution typically assume an initial transient phase characterized by
large fluctuations. After a number of iterations that depends on the
nature of the calculation, the transient fluctuations die away and the
calculation enters a phase in which it converges uniformly to a solution.
The difficulty in determining the rate of convergence lies in the problem
of differentiating the transient fluctuations from divergence. When it
is possible to bound the number of iterations that are subject to tran-
sient fluctuations of large m gnitude, it becomes a simple matter of
computing the rate convergence after the transients are known to have
died away. Calculations that do not lend themselves to this method Trust
be treated in other ways. A simple expedient is to fix an absolute
upper bound on the number of iterations that can be performed.
4. Recovery from Detected Numerical Computation Failures
In this section it is assumed that the two types of detected failures
are loss of significance or excessively poor rate of convergence of a
calculation.
Failures involving the loss of significance can be eliminated by
recomputation of numerical quantities in multiple precision. Problems
for which double precision is insufficient can be eliminated by triple
or quadruple precision calculations. Pathological cases do exist, however,
that require calculations of even greater precision, but precision require-
ments for these calculations can be reduced significantly by preconditioning
and scaling the data. Recovery from loss of significance should call for
recalculation using progressively greater precision until either a signif-
icant result is obtained, or the cost o°_ continuing the recalculation
exceeds the potential usefulness of obtaining increased significance in
the result.
Since the recalculation with increased significance should be involved
automatically by the programming system, we shall consider how this might
be effected. There are at least three different methods for accomplishing
this.
84
The fi rst method is based on the control of precision through the
program instruction repertoire. Multiple precisions for calculations
of varying precision can be prewritten or compiled on the spot from a
stored high-level language deEcription of the algorithm. In either case,
increased precision is obtained by using multiple precision instructions
in the machine instruction repertoire or by malting explicit calls on
multiple precision software routines.
A second method is to make the precision of the calculation data
dependent. Each datum is tagged with a field that indicates its pre-
cision. Upon execution of an arithmetic instruction, the lengths of
the operands and the lengths of the result of the operation is a function
of the tags of the operands. ;The Burroughs' B-6500 uses this mode of
operation.)
The third method is to make use of micropragrammed, primitive
sequences of instructions to implement arithmetic commands, and to use
a microprogram memory that is modifiable. The precision of the arith-
metic operations can be altered by making changes in the microprogram
sequences for these instructions.
Each of these three methods requires further study in context to
determine how to handle problems of memory allocation of data so that
change of precision can occur with relative ease. dote that in each
of these systems, constants must be stored in the greatest precision
that might take part in a calculation.
C. Failures Arising from Program Faults
Programs can be extremely complex entities, sometimes far more
complex than the computer system on which they are executed. In spite
of careful preparation and extensive checkout, rarely do complex pro-
gramming systems contain no faults. There are several raasans for this.
First, programming a complex system is a difficult task. Every contin-
gency must be foreseen and explicit instructions must be written to
handle each contingency. The larger systems require a cooperative
effort of a large number of individuals, but the demands are such that
85
the output of each individual must mesh perfectly into the programmin.-
system. When systems are tested, their behavior depends not only on
the input data, but on the internal state of the system, and q.ite
frequently depends on random timing of events. Exhaustive testing of
programming systems over all possible internal state configurations,
representative input data, and typical timing conditions is not feasible.
For complex systems, even after several continuous months of testing,
only an infinitesimally- small fraction of possible conditions can be
tested.
Hence, t^, create fault-free programs, several techniques that do
not involve exhaustive testing must be used. In this section we describe
several techniques for the prevention, detection, and recovery from pro-
gram faults.
1. Prevention of Programming Faults
There is no panacea for preventing programming faults. It is the
nature of programming that every detail must be specified, either explic-
itly or implicitly. Any single incorrect detail in the program can lead
to its failure, and the total volume of detail easily exceeds the volume
that a normal human can comfortably focus attention upon at one time.
In this section we explore a body of techniques that aid the human in
describing the details that constitute a program, checkingout the program,
and maintaining that program after checkout.
a. High-Level Languages
The value of high-level languages is well known. They relieve
the programmer of a great deal of the burden of specification, and
thereby eliminate an important source of error. As the understanding of
application programs progresses, the compilers for application programs
likewise evolve to include greater power and flexibility. The original
FORTRAN compiler gave the programmer primarily the facility for alge-
braic manipulation, control of loops, definition of data arrays, and
formatted input/output. High-level processes have evolved so that
recent compilers include processes for memory management, file control
86
and manipulation, sorting and searching of data bases, and symbolic
manipulation of data structures. Languages will continue to evolve,
and as they include more complex processes as basic language elements,
the problems of writing programs that duplicate these processes will
vir:.ually disappear.
The future is not as rosy as the previous paruaraph may indi-
cate. The evolution of high-level languages has -seen one level of
programming errors disappear to be replaced by totally new programming
errors. Today's programmers need never write out the details for an
addressing polynomial, but they must be capable of determining the
proper controls for system executive, language compiler, linkage editor,
and relocating loader. Program checkout has actually increased in dif-
iieulty because high-level languages let the programmer create more
complex programs than might be attempted without their aid, thereby
greatly increasing the number of operations within the program that
might be faulty. Evolution of programming languages will undoubtedly
continue in the current direction of increased facility and power of
expression, but the claim here is that languages have tended to over-
look the need for improving their reliability.
FORTRAN illustrates the case in point. The language is so
defined that programmers need not declare variables. The first reference
to an otherwise undeclared variable constitutes a declaration by default.
This relieves the programmer of a small burden in return for increasing
the unreliability of the language. Every missp:;lled naie of a variable
constitutes a default declaration of a new variable so that misspellings
cannot be detected by the compiler. On the other hand, the programmer
finds that he must declare a large fraction of his variables anyway
xith DI3IENSI0\, COMIO , and EQUIVALENCE statements so that the small
service of default declaration comes at a great cost of reliability.
There are several means for inci-^asing the reliability of a
programming language. The most important is to incorporate redundancy
into the language so that compilers can perform consistency checks on
87
statements in the language. Declarations constitute one form of redundancy.
Given declarations, compilers can check such things as the type and pr:-
tision of all variables to determine if they satisfy the conditions imposed
by contest, the number of dimensions on references to subscripted variables,
the agreement of actual and formal parameters of subprograms both in number
and type, etc. Formal languages have tended to avoid redundancy instead
of embracing it. Yet redundancy is a vital part of natural language, and
is essential for communication between humans. The problem than is to
investigate how programming languages can be made redundant to increase
their reliability without placing undue burden on she programme.
One answer along this line of thought is to allow the programmer
to declare contexts for each variable name. The compiler could check each
executable statement to determine if the statement operands can appear ill
the same local context. Another method is to specify the processes that
can alter a Variable or may read its value and program the compiler to
check references against this specification. these suggestions concern
primarily the detection of errors during compilation. Redundancy in the
language also gives facility for detecting errors at execution time that
carrot be detected during compilation. One such error, for example, is
an out-of-bounds index to an array . The remarks here are intended to
point out the problem and indicate a direction to take for its solution.
A deeper treatment is beyond the scope of this memo.
There are other features of a language that affect its reliability.
Some language constructs are error-prone and can be replaced by equivalent
constructs that do not lend themselves to error. An example of an error-
prone feature in FORTRAN is the specification of Hollerith data using the
"nH" format where the n preceding the "H" is a decimal digit that specifies
the number of literal characters in the Hollerith string. It has proved
to be difficult for programmers to count characters in a string accuratel},
particularl y
 if the. string terminates in blank characters. Later FORTRAN'
compilers have permitted strings to be delimited at both Ends by the "S"
synibol ar_d have eliminated the need to count the number of cha r acters in
the string. From this example it is suggested that a study be undertaken
88
to determine%%-hat general features of high-level languages are unneces-
sarily error-prone. What alternatives exist to replace these language
elements with equivalent but less error-prone features? An interesting
subject for this study is the role of default options of the language.
It is conjectured that ill-considered default options share a large
responsibility- for the unreliability of languages. For example, PL/I
default options lead to the unusual side effect that execution of the
statement
I = 25 + 1/3
yields a value for I of 5.333333.
b. Independent Check Calculations
The reliability attained through the use of high-level languages
is achieved because the language processors can mechanically generate
correct sequences of instructions from tersr- statements in the language,
and because the processors can detect those blui.lers and transcription
errors that lead to inconsistencies in the high-level program. Many
faults can escape detection by a compiler because there are elements of
processes that must be specified in part by the programmer, and the
faulty specification is internally consistent with grammatical and
semantic rules of the language. To detect this class of faults, it is
necessary to rely on other approaches.
One of the most useful methods of checking for faults is to
incorporate independent check calculations into n program. The nature
of the check calculations should be to check the functional behavior of
program modules in a manner that is no- dependent on the internal struc-
ture of the modules. For example, a matrix inversion can be checked by
matrix multiplication regardless of the particular algorithm Chat was
used to compute the inverse. Check calculations should be done so that
they do not merely- check functional modules against module specifications 	 =
because the specifications may have suffered transcription errors. The
modules should be checked in their system to see if context meets the
89
performance requirements of the system. For example, consistency checks
on a guidance computer should not merely determine if the guidance computer
solves a set of trajectory equations, but rather if the guidance computer
solves the set of cquat:ions that correctly model the behavior of the space
vehicle that carries the computer. Thus programmed consistency checks
for this example mould compare measurements of the actual vehicle trajec-
tory to the calculated trajectory in order to determine the correctness
of the guidance computation.
Consistency- checks can be performed at several levels within a
programming system. At the lowest level, checks can be performed for
many primitive machine operations such as arithmetic and logical opera-
tions. When the operation is invertable, it can be checked exactly.
When not, it is possible to check if the results of the operation are
consistent. Check,. at this level will not detect programming faults per
se, but rather detect hardware failures that effect particular primitive
operations.
Consistency- checks of higher-level processes can be performed
in much the same way as for lower-level processes. When processes have
inverses, the consistency check can be the inverse process. For example,
root extraction from polynomial equations can be checked by evaluating
the polynomial with the extracted root as an argument. Processes without
inverses can be checked by determining if the output values are self-
consistent and consistent with the input values and initial state of the
process. As an example of a checkable process of this type, consider
the programmed model of a physical process undergoing a smoothly fluctuat-
ing change of state while under the influence of smoothly fluctuating
inputs. If a sharp discontinuity is discovered in the output of the
process, then it is likely that there is a fault in the program, or else
the discontinuity- is characteristic of the process, and its detection
could have been predicted before hand. Thus, monitoring the output of
the process for discontinuities is a satisfactory check.
90
To ensure ultrareliability of a program, all processes should
be checked at some level, possibly at several different levels by several
different checks. Among th factors that affect the placement and number
of consistency checks are:
(1) Tiie cost of performing the check calculation in
terms of programming effort, execution time
relative to the execution time of the process
that it checks, and total system memory require-
ments.
(2) The cost of experiencing an undetected failure
in the module that i5 checked.
(3) The probability that a failure can escape detec-
tion if a check calculation is performed,
(4) The resolution of the check calculation in terms
of its capability of determining the location of
a failure.
(S) The probability that the check calculation contains
a fault.
Of course, during the initial phases of program checkout, a great many-
checks should be performed in order to obtain high resolution of the
location of faults causing detected failures. This entire discussion
is directed to the final phases of checkout and full operational status
when it is desirable to execute without the overhead of elaborate check
calculations, yet maintain a certain level of reliability.
The problem of determining how many check calculations to per-
form and where to place them is the software equivalent of the classic
problem of designing redundant hardware. Software redundancy, like
hardware redundancy, has a definite place in an ultrareliable computer
system. It should be designed into system from the initial inception
of the system. The form and degree of redundancy that should be put
into software is a function of factors particular to the situation. As
this subject is studied further, perhaps there will emerge some general
methods for building redundant software as has happened for hardware.
91
C. Software Maintenance and Modification
The inherent flexibility of software with respect io the fixed
structure of hardware usually results in system modifications being
thrust upon programro.ng systems rather than upon the hardwire that sup-
ports the programming system. The cost of these modifications is quite
high in terms of reliability because changes can cause new errors to
appear in systems that are otherwise error free, :nd the new errors may
escape detection during system checkout. There are two problems to be
faced here. How the progr<ms are to be written to minimize the possibility
of introducing new errors .vheiL they are modified, and how newly modified
systems are to be tested to determine if new errors have appeared. In
this section we consider several techniques for attacking these problems.
An important factor in the solution to the problem of software
modification is in the design of the software system. Software should be
designed to accommodate changes easily, as if changes are unavoidable.
In most cases, changes are truly unavoidable L:cause it is difficult to
anticipate all of the required characteristics of a software system
unt'_1 it has been constructed.
A good method for accommodating changes easily is to design
modular software systems. Subprograms should be organized functionally
so that they operate independently, communicating only through a minimum
set of parameters. System constants should be treated as parameters so
that the modification of a parameter in a single point in the program
causes all references to that parameter to be altered. A process that
is common to several subprograms should be factored into a module that
is called by the several subprograms. For languages like ALGOL and
FORTRAN, factoring is essentially equivalent to the construction of a
subroutine, block or overlay.
Factoring of processes may not require that the processes be
physically partitioned, as is the case when processes are factored into
subroutines. The modularity required for ease of program maintenance
is semantic modularity, and this is a by-product of physical modularity.
Physical partitio:iing produces a certain amount of program inefficiency
92
because of the overhead for establishing linkages during program execu-
tion. Semantic modularity can be attained without physical partitioning
and its attendant inefficiency, by using the programming device known as
macro-expansion. macros have been available for many years in low-level
assembly languages and are only recently being used in high-level languages.
To date,.the implementation of macros in high-level languages has been
somewhat limited, but it appears that macros are a natural adjunct to
common algorithmic languages. U..doubtedly, further effort will be given
to increasing the power and utility of macros in high-level languages,
so that it will become increasingly easier to modularize programs with-
out loss of efficiency.
It is possible to carry the partitioning process too far so
that the net result is a loss of reliabilit; rather than an increase.
This occurs when similar but nonidentical processes are grouped in a
common module. Usually this is done by including several tests to
differentiate among the various processes at appropriate points in the
module. The problem with this type of par yitioa^ng is that a modifi-
cation to one process can affect all other processes that share the same
module.
The problem then remains one of determining how and when to
modularize. Should the programmer use a macro or subroutine? Should
two processes be placed in separate modules if they differ in only one
respect? If not, then how different should they be in order to be
separated? To answer these questicns, the programmer must a ,-,ply his
experience and judgement and consider the factors of the particular
situation. That fact that only subjective characteristics are considered
here is indicative that program partitioning is still very much an art.
Its importance has been recognized to the extent that aids for parti-
tioning are common in high-level languages. These include block struc-
tures, multiple job-step programs, program segments, overlays, and
asynchronous tasking. There is still much to be done to aid modulariza-
tion ^o that standard software packages can be written, completely
checked, and debugged, then used in many systems. This requires that
93
-standards be -established to guarantee standa; d program interfaces, file
structures and languages.
-A second important factor affecting the maintenance of software
I s - the documentation of the software- ,%_ystem. _A-. problem in developing
a-,d-- utte , dommentatiow currently exists-, 
because
	 te^q	 i	 has not been firmly
o -	 t-e-:docus"ta	 n1W- ,'	 S ever a I _me hods.bli4fied -what cordti tutes jiLdequat
'
-been Commonly used, -none of Bch -has--	 entirely.	 satisfactory. .
u _^dlscussfon	 e we	 nguish :	 "tationjor o	 -b^een- tm--M^ -Ol-di"I
1*	 - abd _*e^	 Thy 	 d	 sourceoc - 	 Ahe^-	 llstj^tlisar	 Mewl,Y,
-the	 IV-is I	 zlt L_that -^ 	 mechanDcall-V 	jw!4_^^_p r0grA*_.i	 MeA	 tra
-
- -
7
__1,hW­AA:^iief)i*cuzable^ fp&06w r	 _-dO-CU*eKtRt1JOn
IstUrumis M4	 ntbe_:	 I*tt- canjwvprograms EE
t	 ^genei4-^Iy-, concedect- tha t 	-oni as we-
it-today
-	
_inadequate roirX'S,	 a,
.season	 e= Lm	 thatThe --r 	 - 	 a	 - 	 -_t^e-	 J.- ware, .s	 tems	 _g	 escK
pater can p	 s are q	 to Aifferent_.pq;Wq 	 er-fb	 IhA^^
-.1rom those - fok describing the same- 	 ag-to tv-h	 h&zcan - ------- -
gain as understandijx-,. of it. 	 The-role of program documentation is to -
logy	 -program	 and - to describest-.e: bath -the- intent and the iethodo	 af a
F -the structure .- and meaning of the program data .	 Tbia_ primary document
describes the methodolngy,but not necessarily- the -intent of:the- -program,
-and-'-kivis the structure of the- --data but not necessarily its meaning .-__
Thus	 the primary docuaent -may not be sufficient - documentation, even
though itAs a complerze description of the executable:program.
_7%rough the use of symbolic programin languages, it is -
possible to enhance- the descriptive qualities of the . source document to,
make clear the intent and meaning of a program.	 The programmer can use
descriptive names and phrases for program elements and data structures
:1*0 that the source document takes on the qualities of a narrative descrip-
tion . Consider, for example, - the mnemonic content of the name, INPUT-as
the name of a _ubroutine and-how much more- descriptive that_ _name is
Xy acceptable as acompared to a name like ASIA, which might be equal-
94

Fincrease the readabilitv_ of his - source document.	 Partitioning can also - -:
be used to keep the flow of control in a program in a narrow context. - "=
-Thus, when properly used, high-level languages come close to flowcharts -	 -
with -respect to -their capability for displaying the flow of control._
---Machine produced cross-rreferences _are commonly u	 -sed to document -
_	
_
_ 	 :the interaction-of dat:"And program modules. 	 Moss references often =	 -_
_contain information -as- to the .context, of each reference_ to -a "variable
name. -- Whe#	 ht indicate-z	 her the -oak able is- react -or, irrittea, or
_	
_
-'t
_-	 passed -8s an argament to a subpirogram,'- or 3# the _reference .,ice an , a 	= _
-_	 com4tent or declaration.
	 This ^nformaiou is extremely, useful when	
_ _
= dada# Ications are made--because--the effect- of- thangii* _a single vAri$ble
-	 can lie traced to all pertinent conte^its iit _ he program-.
	
Iaformatian of
i
tTiis t y^e is not present ^n _#lowcbarts or in the s7^irre programs- itself:
Aa_.i	 _rtant -t	 of docuiaen(t^atinn known as the decision tables
_	
m
-.	
Y	
_
_
s_	 has grown polar as a: repiace>degt yor -the_ flowchart. =_Decision gables-
-	 are twck-d^ mensi^rnal ,tables-= in _ which a _ Iist of-processes _.a s._ arranged -
- 	 along - the roar axis and a List of -conditions arrayed - along the- +Munn _
axis.._' An , eatry . at a -row-column intersection- f
	 the- teible andicaites -that_
the- r<m process = is to-be executed when-the =column-conditions: are `satisfied,.
Usually the.-columns_ represent_a set of mutually - -exclusive conditions, and -	 -_;
multiple entries in a- single column are assumed -to be executed sequentially -
from top -to bottom of _the table;	 The value of decision tables is that
they show the flow of control of high-order processes,-wad, hence,=3end
f
=_insight into the intent - of the -program. - Mechanically generated flowcharts
= tend to show more detail than n#_iessary, and thereby tend -to hide the
intent of the:-program.
A novel aspect of decision tables is that it is possible to use
them as a primary source - rather --than as a secondary source of documentation.
That is, the tables constitute a high-level programming language that can
be translated into -executable form.	 Thus, decision taLies are a form of
self-documenting program. -
96
IN
97
a set of data and oporate cGrrectly during oue iteration and fail during
-	 the .second iteration on the very same set of data.
	
Assuming -that an
-
`
-	 exhaustive test of the program is not feasible,.a good approach is to -
nesign a thorough test that takes into consideration the state of the
program. 'To do this, the input data is partitioned into several possible
-	 classes and similarly-the possible.,program states -re divided into.sev-
-, -
-:eral different classes.	 Given-both the classes of input and the program
____	 -_	 state	 one then d_e_signs_ 	 -t=oU test data that will test the -:behavior
- of the program for every,_pa_	 if input -data-prograinstate classes. 	 The T_
:
thvroughaess 'of.the check depends on -the fineness° of the state and input -_
class partition.
- When modifieations are`made toprogr`am6zing systeuls, f the common
-only= - practice -a	 checking	 the changes' that_ were merle -to - the system often	 -
leads`to undetected errors in the -	 -
 program.	 The - problem here is .that =if an- -`	 -
-	 error-is introduced=:into the stem, it can occur because oI subtle inter-
- `actions  that are connecte -]6nls remotely to the seetio^s_ of the program
that were changed.
	
furthermore, if -the interactions hid theen foreseen, =	 =	 -
-while- the- _changes, were being made,. they might -have. been, avoided - Wlien = =	 _
the system is checked_ only for =those cases _which are known to. be affected, _-	 -
-	
-	 -
the  subtle problems may nit appear.	 It is- important- to recognize, there -_
fore that a change to.a program produces a completely new program that -
_must be treated as if it were completely untested.
2.	 Summary	 =	 -
In this section we have cousidered a large number of problems related -
_ 	 `to writing and checking out programs.	 The single most promising approach
for achieving-ultra-reliable programs appears-to- -be in the use of high-
level languages.	 Through high-level languages, it may be possible
to mechanically generate the check calculations, test data, and program
documentation that have been described elsewhere in this chapter.	 In
effect, the comiiuter is.the ideal tool for aiding the programmer in
developing a reliable pregraw.	 = -
f
The most serious type of failure is the write operation, because -
it is inherently destructive.	 However	 for very -little extra cost, it =	 _
is possible to protec} both = read and write operations and gain additional
reliability.
-	 When critical information is constant,, it can be placed in a read- -
only memory -or assure 3-t^ nonvolatility.	 It might be 'wired-in,"_or_ }_	 _	 _
stored in a . memory that `;as both a nondestructki6 ind -destructive read- -
=	 out_capabilitp.	 On et..e:data is stored, the assumption-is--that the
=
_ 	
_
7	 11
_	 LL
memory	 s_switched ta--nondestructive made	 Privileged subprograms	 -
__4-
might tie given the ; capability to alter the- contents -of -the_ protected_ -Y=	 __
- memory	 but the mery ^s alw ays placed in the nondestructive-mode after:
.modification:	 It,	 note here= that	 t- is possible . to use conventional -
}	 _ _ ._.memories together with special hardware: in the access_circuitry to imple-
ment protection against unauthorized attempts to modifp critical data:'-
sIna sense, the _conventional .memories can _be made to s^uulate memories ;' -
-	 vrth< both destructi re. and- nondestructive -react-out,- and a protection -_
_	 _capability.	 This_: type of system- :has been implemented--in seve-kal
_	
-. -commercial computers-,
The protected memory is - most ; ° effe	 ecting againstctive-in=prot	 faults
_
_
=
in nonprivileged programs. : - Privileged programs can be faulty al-.^ o,=so =
that there -must _be some protection :against=,faults- that successfully,
-	 -penetrate the memory protection. 	 The most important aspect _ of assuring -_
reliability in thrs : case, is to be certain that critical data can be `T
restored if it is =lost.	 One method for doing this is to dump the
_contents of critical memory periodically, and restore the memory from -
the last checkpoint when a fault i s detected.	 Another method is to #
__retain the last few values for each item in critical memory so that
values can be restored selectively.	 Care must be used inthe imple-
mentation of the latter technique because data cannot, in general,_ be
altered out of context.	 It is necessary to change sets of data so that E_
all values have meaning as a collection in the context of the programming
-	 system.
The technique of using a real or simulated read-only memory is .
valuable for protection of noncritical data, and for protection of the
programs that manipulate these data. Suppose that all of memory can be
partitioned into - blocks - that are given either read-write or read-only
status. We shall-assume -that the status can be altered dynamically by
some programming mechanism.; - Then invalid addresses that are generated
may be detectable when they lead to attempts to write - destructively in
- --a read=onIy hIock.--frograms caa—always be— —place^zn read=	 s tau- = —
xhen they_ are written as reenrant programs, i.e., they donot;modify
them elves. Data structures might be capable of-being '_ n either status,-
-A-va-luable feature in- -a --reliable programming system is-- the ability to
a --set:-.and alter the status of data _structures,_: as :requires;. -The te.=hnieues n
x can be extended further bq considering other states -that might - clarac--- -	 -
teftze blocks of='memory. For, example,_ execute--only -is a - state that
-could protect machine_ instructions from being read erroneously as=data',-
or could also prevent data fram -being erroneously treated as _instructions.
=	 The ability _ to define protected states places some dema^ids on both = 	 -
the hardware and the ofty4are. , The 5tafe ` information must be held ^at ;= J
some point in memory from where it can be retrieved quickly during each
memory cycle. it might be held in a register or in a. special. memory_
plane. It might be -stored in a tag field that is associated:with every
word in memory so-that it is available for checking during any read
operation .
-
or write following a destructive read operation. In any case,
it is clear that the - status of'a-memory block must be matched against
the type of memory request to determine if-an illegal operation has
occurred.
Several hardware features would . be useful in supporting=the memory
protection techniques mentioned above. Reentrant programs benefit from
instruction sets that include special instructions for facilitating
reentrant coding. For example, the :commoan practice of depositing sub-
routine addresses at fixed linkage points within the body of a program
violates the rule that reentrant programs - must be read-only. To aid -
reentry, the return link should be stored in -a register or in a data
101
. n
area that is accessed indirectly through a lintage address. To aid the
protection of the critical data . area, ultrareliable hardware designs
should be utilized_
 in the block of memory that holds critical data.
This lowers the probability that critical data will be lost because of
hardware " failures:	 Memory access circuitry sKou24t be designed redun-
dantly to minimize the probability - that - valid memory requests will be -	 -
x 
s honored--at invalid addresses.	 Paging hardware r rid relocation registers-
--- -	 -	 _	 are-useful -kte*s -fox	 o-st rjng status_infor_maton_of a^einor	 blocks^Status _^ -
z
checking is a:natural extension of the procossi .ng that_-occurs with paging
and relocation hardware during the computation of effective. addresses.
_.	 -n
The main point ` of this section is to bring to light techniques for
-
rotecti,	 against invalid memor	 re uests.	 next section,P	 n	 _	 y	 q
-	 --
	
-	 _	
,: In.the;
-	 consider - techniques;-for'broadening-the scope of the protection..
_	
u	
-
-
_2. - If: and Only If Programming:
_
-The previous material_on ahe protection cf memory operations is an
- a important-exam ple of the 'notion of-,if and only' if programming.	 Me,.ory _-
- -	 protection is;possible because the program.and`data : essentially defines
' the region of-memory that is addressable at any given time.
	 But it
also defines the region of memory that is not addressable.
	 Furthermore,'
V _	 different_ portions of a=program may=have different -addressable and -_
unaddressable regions.- The protective measures generally compute for
-each effective - address .
'whether it is 'in a-region that is accessible or:
- unaccessible -by the propats requesting a°memory operation.
Thus a request for a memory access is honored:
a process issues the request, _ and
(2}	 only if the request is valid.—_
Carrying this notion further, we postulate that programs should -
be analyzed to determine not only what combination of input data and
"	 internal states are valid, but to determine those combinations that
are invalid.	 When an invalid combination occurs, a failure has
t occurred, which should trigger a detection mechanism.
t _
102
As another illustration of the notion of if and only if programming,
suppose that a program is tested thoroughly, and it is determined that
—
the program is completely correct. It may be the case, however, that
some variable ranged only through the values between 0 and 10 during
	
_	
the checkout phase, and it is known_ beforehand that no other values
...in be encountered du 	 normal_oPeration. What happens_,to the program,
' if the variable-in question takes on a negative value? By assumption, 	 r
OFF
	_	 _this"°cannot o cur_in--norual circumstances_,`but suppose the__negat ve 	 ____	 "-
_value were introduced, by an undetected failure. --The behavior, of the.
s=i5kogram is -not defined, and its true-behavior in- th; s circumstance can
	
_	
_be anything from harmless _to disastrous.
The use of if and-only if -pi ;ramming--in this example_ Would-pegin
-with the determination that the _true =range -_of validation of the `input--
"variable is =the range 0 to 10, and would follow- - with- tYie_ insertion of
a-.test of -the variable :o determine if it _lies" in the range of validation.
_ One class of items that are easily checked-are the class of variables
that are-used as indices-of elements of a data structure. When-a variable
'is'an index into-z particular data structure, then the_ alid range of
values for-that-variable is completely determined by the data structure.
The variable should be checked before every instance'-in which it is
used to-address an element of the data structure. This type of check 	 -
in one sense is comparable-to the memory checks-described in the previous
section, but the following difference is observed. The memory checks
described previously are made after a memory request is issued, and are
made or, the basis-of status of the memory at:the effective address and
-the type of memory request. The check of -the index variable is made
before a request is issued, and is made on the basis of a variable that
is used to calculate an effective address, rather- than on the nature of
the data found at the effective address._-`
Indices to data structures can be checked by range checks to deter-
mine if the index is within the actual bounds of the structure. In
commercial systems, bound checks of this type are common, and have been
implemented by both hardw re and software techniques. The techniques
generally postulate that the bounds of a structure are associated with
103
-	 --	 ou ,u is noL suillclen -u -co.pro^GecZ pnyslcally lar$e-aaLa sL-ructures wnen-
there is' a high probabili ty-	 -	  that any incorrect address Iles :within-bounds.
	 One class of-data that satisfies this criteria is the linked- {
list data structure. 	 Most of the -data within linked-lists are addresses. -.-
(or indices) of other elements in
 the 1ist^structUre.	 Within the 1^^
_	
structure, the linkages may identify severanditferent classes afedata ,r-_
but the linkages,,usually hLvw3 one form that is common to the entire -
- , structure.	 Faults that: generate incorrect linkages might , be undetectable , 	"- -
-by bounds calculations..- A batter check is to store with each pointed:
the class of `the item to which it ,,-must- poirit .^	 -With each item ,a field -is
used to indicate i<ts_class...-Whenever a,.pointer is` used.tc access an
=	 °`	 item, the claims of the fetched item is comparied to_ the _class that was ._
4	 -expected, _ and an.error	 s'signall"ed ii they do not--agree	 A p-9	 em	 -_ _
arises-when a, ,.pointer- might ;point to one of:: --S, eve-ral classes, - The" chFeck' ".
operation then becomes.rather lengthy and contributes system-inefficiency. =:e
A variation of this technique eliminates the problem of overhead. :Instead ='
'of associating classes with data, each item is associated with a, randomly
_ selected tag., When pointers to the item are created-, a copy of:the tag- ==
is placed in each pointer.---Whenever an indirect access is made ',hrough
a pointer, the tag of the pointer must agree with the tag found -with the --
item addressed.
	 -
Indices and address linkages are just one class of variables that -
-	
___	
-,
- _ ^^r eve
hardwaire -in case an error is detected. 
_ln this ray, _a recovery procedure
will be available for every -possible detected error.
A problem exists when at detected error occurs for programs -at the, --
higbeest Je;ie1`.
	 Where- -shoulc. control = be- transferred- in this -case?
-"` Clearly t if the rrong transfer -is taken 	 the error =say-be repeated, and
- thereby _ begin =au infinite loop.
	
There is	 , g	 answer for- this problem,
- it aeay be necessary to atteupt recovery -i	 a is	 fashion az13 simul-
_ IiAedasly :-notify ahuman operator -or aw *x#erna	 computer of the nature	 T ^ ^.__
== of the prpblem .-= Aehaigne tit lr} I_ tend .to	 -duce the	 egnttue of
-the-problem is to make
	
' Ligtecst level pro^ra
	
as suaxl . as possibly
_
-=	
=
` t^ reduce the prabab lfty-+of
	
ine-and. detecting -Prrars= at that leir l
camrvn _practrce	 usilog "tw+a-level
	
erarchfos" places a merge burd6 -
on die executive pr,^gr
	 , - and -- thereby increases the; pr aability of Pain#-
_	 _ " _ '-occurring +rhf le _ flpeitit n exti motive she. 	 ?ire use .. -a_ third level = -
-	 -- _ for the_ highest level
	
s
would be far more sa ^sfactary, became east- -	 -—
=	 - =esecuti^e `pror^sses could Ire- sc=cond _bevel _ -	 . -processes- sud receive a ^grea#.ez
degree of , proteetioa;-than can Ile obtaiue'`-#roes ^ 3ave1 s ystems.	 =-	 =_
As
	 that - recovery ooiats for 'every tape oi _detectable--ernar ria _
- be Ylefinedro
	
-possibly - for those : srr_^rs that = 	 -	 _p	 perly, = ex apt	 detected
in the ,top -levelprogra's, the - problem remains-one-of diagnosing the
reason for the detected _ error. - It- is essential- that , "the `context of the
ptgcessor at ?the tine o- the error be saved and t	 made available for	 -	 -_
diag mtic purposes. - Tbe `context is -the- contents of -all visible register,
_ status bits, the instruction counter; the effective. address (if -the _-last
`_- memory access, other pertinent data.- If the data Is to bereturned
Y to a human, . thea it sb4uld -be nualyzed and translated into terns that will
ease the- probler of interpreting the data.
	 SyaidoZic _- names can be supplied
from symbol tables that might be stored in auxiliary memory for diagnostic
_ purposes.	 Linkage traces can the used to de- 3rmfue the return, li nkages of
subroutine calls. -Arguments of subroutines and the val .!es of important
variables can be posted to give further diagnostic aid.
	 Since program
jumps usually prevent program steps from being traced backwards, it is-	 -
recommended that jumps leave - a return address in a special -register-or
106
J
107
Several interesting problems have emerged from the preceding dis-
cussion. These are collected here in the hopes of stimulating further
interest in the area. The problems are:
1. What methods for automatically changing the precisio't
of a calculation?
2. H'hat features of lan.-nages are error-prone? How might
they be eliminated in favor of equally powerful but
less error-prone features?
3. How can redundancy be utilized in programming languages
to protect pregrammers from errors?
4. Is it possible to develop an abstract model of a program
and use the model to determine the placement of check
calculations? Can the model be used to guide the
generations and placement of check calculations for
practical programs?
5. What is adequate program documentation? How can
lani;uages alleviate the problem of documentation by
becoming more self-documentary:'
6. Can updating he computer assisted to eliminate the
propensity for introducing new errors by updating?
7. How can programs be designed to be modular? What
stan.lsrds are necessary to aid modularity:'
b. How can tests be generated or built into the hardware
to give extensive protection for a small decrease of
efficiency?
9. How can recovery be made from errors detected in he
ward-core software'.
	
diagnostics should be
returned after any detected lailure and how should they
be returned? Can speci.-tl routines perform diagnosis"?
108
-_ ._s _.— _._
i
mamma -MUMMMM-000
VI CONCLUSIONS XND SIRP.LXR1 OF OTHM, STUDIES IS PROGRESS
In this chapter we briefly presen our conclusions on techniques for
the rcaliza;i.on of ultrareliable spaceborne computers. These are baled
upon both the research conducted cluring the second phas^ and prior related
work, and also recommendations for the future direction of research in
this area. We also sununarize other studies that have either not yet
progressed to a state where reporting is appropriate or do not bear li-
rectly on the technical conte.:ts of this report.
:►. Conclusions
(1) The computer reliability requirements of an relvanced space-
borne mission cannot be satisfied without the use of redun-
dant logic structures.
(2) Practically an 'N- reli.bility constraint can be satislicj by
the exclusive use of passive masking techniques. But,
with the exception of mission tasks of relatively minimal
complexity, the cost of such a computer would he excessive,
Significantly improved utilization of resources is
theoretically achieved u-ith a reconfiguration technique
in which the logical interconnections can be altered so that
faulty units are disconnected from the system and moreover,
with a graceful degradation technique in which the S o-hedul-
ing of tasks can be altered to match the available perfor-
mance capabilit%.
(3) For advanc;d spaceborne missions there exists, in addition
to the severe reliability constraint, the requirement to
accommodate to simultaneous introductie of several problems
and to varying measures among the mission problems of
priority, accuracy, and urgency. The multiprocessor frame-
work arpears to be the best match to these requirements.
(4) The reliability of the multiprocessor is significantly
enhanced if a limited degree of fault-masking and recon-
figuration (or repair) capability is incorporated within
the memory, control, and processor modules.
109
'- I
(5) The major problems associated with this repairable multi-
processor are the design of memory, control and processor
modules, which are either amenable to repair or fault mask-
ing; the design of conunutation networks for the data
switchir.;; the specification of diagnostic techniques for
the detection and location of faults; and the overall
reliability anal}:-<7z of the system.
(6) Processor modules, which include microprogram control and
which are amenable to repair, can be designed by organiz-
ing the logic associated with each b}• te or computation
into a slice or module of moderate complexity J .e.,
approximately 1000 gates/module). The repair operation is
then the electrical shorting of data around a faulty slice.
7) Commutation networks can be designed which are suitable for
the routing of data between memory and processor modules
and also for the above defined repair operation. These net-
works call be easily set up and diagnosed, and can be madam
insensitive to failures within the commutation network with
a moderate increase in complexity.
18) It appears feasible t ­ tiynthesize programs that can detect,
utilizing combinations of software and hardware redu,.dancy,
the occurrence of many hardware faults. These techniques
are also appropriate to the specification of formal rules
for the synthesis of programs that do not contain human
mistakes.
B. Summary of Other Work in Progress
(1) Work is continl!ing on the description of logical design
techniques for the various module types of the multi-
processcr. During the next period, ? substantial effort_
will be devoted toward the design of the irregularl }
 -structured control module, and in particular, to the
investigation of the opti-.aim balancing of the various
reliabilit y-enhancement sche;es.
(2) Work will be initiated on diagnostic techniques for
locating the block within a module that is suspected
of being faulty. We expect to consider the diagnosis
of byte-sliced processor modules, commutation networks,
associative memories of the type used for memory re-
addressing and the table of available equipment, and
general control logic. We have looked at the possibility
of ii.cluding auxiliary outputs to facilitate •liagnosis,
and formulated the problem of specifying the optimum set
of ouch outputs as a covering problem of the type similar
Lo that de,cribed in Sec. III-B of Pef. 1.
F
110
(3) Work has begun on the analysis of the multiprocessor
system, and we are seeking models which are amenable to
analysis.
(a) The survey of the literature pertinent to the problem
of improving reliability by the use of reduadant struc-
tures hati been completed, and a paper summarizing this
work wil. be submitted for publication to one of the
IEEE Transactions.
111
	Nj:ECEDiNG F'A	 EJGE U-J4K NCI (iLIUA^i7.
Appendix
USE OF CODES FOR CORRECTION AND DETECTIDN
It is of intere.,t to dc'eriiine the probability of an undetected
error due to a coherent noise sil ,nal that affects rill channels.
The number of detectable errors for a douhle-err-)r correcting c-, c
may be computed as follci, c:
]et
it
11 = the --umber of bi is in the core -vori
k _ the	 of non-redun 'ant : -!,s bi to ;
then the combiued nitmoe, •
 of v-Jid -^.:.d orrecfed patterns is
	
3k- n-
	
n	 i1	 2K_ n ` + 1z	 2 /
	
^^ 1 ,	 r )^
	Q 	 I	 2	 r
out of . 2n te1n1 pattern-;. The -re.ction of error pa' t. • rns that nve--on]v
detected is tl.---
,.	 2k ' n - n	 2^
	
J	 ,.n+1
01• approxi'mately,
	
^	 1	 nRk^-1	 '2
This fra-.tion i.., approximately 1/2 for the att,- ctivv n
	 16, k = 8 cede,
to	 and incre r,ses rapidly for lamer word ?erg-ths, If P bytes are used, and
if all channel errors are in iopendent, the probability of nn undetected
error is f B ; for example, for tare 16,8 code, with B = 3 (i.e., 24 non-
redundant bits per word), f B is 1/8. In the caso in which noise tran-
sient lasts for several memory cycles (very likely for noise clue io
113
MA
electrical arcing), this probability decreases by a factor of f  for
each cycle.
We conclude that the normal error detection capability of error
correcting codes provides reliable warning of the existence of massive
errors.
1
11.1
REFERENCES
1. J. Goldberg, K. N. Levitt, and It. A. Short, "Techniques for the
Realization f Ultra-Reliable Spaceborne Computers," Final Report-
Phase I, Contract NAS 12-33, SRI Proj ct 5580, Stanford Research
Institute, Menlo Parl,, California (September 1966).
2. AES-EPO Staff, "AES-EPO Study Program" Fii-il Study Report, VOlUmes 1
and 2, IBM Electronics System Center, Owego, New York (December 1965).
1. A. Avizienis, "A Set of Algorithms for a Diagnosable Arithmetic Unit,"
Tech. Report No. 32-546, Jet Propulsion Laboratory, Pasadena,
California (1964).
4. A. Avizienis, "A Design of Fault-Tolerant Computers," Proc. Fall
Joint Computer Conference (AFIPS) (1967).
5. W. G. Bouricius, et. al, "Investigations in the Design of an Auto-
matically Repaired Computer," Digest of the First Annual IEEE
Computer Conference, IEEE Publication 16C51 September 1967 .
6. P. W. Agnew, et. al, An Approach to Self-Repairing Computer," Digest
of the First Annual IEEE Computer Conference, IEEE Publication 16C51
;,September 1967.
R. P. hassett, and E. If. Miller, "Multithreading Design of a Reliable
Aerospace Computer," presented at 1966 Aerospace and Electronic
Systems Convention (3-5 October 1966).
ti. L. J. Koczela, "Study of Spaceborne Multiprocessing," 2nd Quarterl}
Report,, Volume II, Contract NAS 12-108, Autonetics Division of North
American Aviation, Anaheim, California (October 1966).
9. E. C. Joseph, "Self Repair: Fault Detection and Automatic Reconfigu-
rabiiity," Proceedings of the Spaceborne Multiprocessing Seminar,
NASA Electronics Research Center, Boston, pp. 41-49 (31 October 191
10. R. L. Alonso, et. al, "A Multiprocessing Structure," Digest of the
First Annual IEEE Computer Conference, IEEE Publication 16C51
i September 1967 .
11. J. F. Keeley, et. al, "An Application-Orien,.ed Multiprocessing System,"
IB.%I Systems Journal, Volume 6, Nc. 2 (Entire Issued X1967).
12. J. J. Pariser, "Multiprocessing with Floating Executive Control,"
IEEE International Convention: Record (1965).
115
.--
-Y -7r-
Y'
13. S. P. Frankel, "Oil 	 Minimum Logical Complexit y- Required for a
General Purpose Computer," IRE Trams. on Electronic Computers,
Volume EC-7, No. 4, pp. 282-255 (December 1958).
14. H. Weaer, "A Microprogrammed Implementation of EULER on IBM System/
360 Model 30," Communication of the ACM, Volume 10, No. 9, pp. 549-
558 (September 1967.
15. A. Grasse-lli, "The Design of Program-Modifiable Microprogranuned
Control Units," IEEE Trans. on Electronic Computers, Jude 1962,
pp. 336-339.	 -
16. J. Goldberg, "Logical Design Techniques for Error Control," WESCOti
paper 9/3, Session 9 (September 1966).
17. W. H. Kautz, K. N. Levitt, and A. Waksman, "Cellular Interconnection
Networks," Accepted for publication in IEEE Transactions on Electronic
Computers.
18. A. Waksman, "A Permutation Network," Accepted for publication in the
JoL rnal of the ACM.
19. W. J. Cody, "The Influenc- of Machine Design on Numerical Algorithms,"
AFIPS, Proceedings of the SJCC, Thompson Books, Washington, D.C.,
pp, 305-310 (1967).
20. Peter M. Neely, "Comparison of Several Algorithms for Computation of
Means, Standard Deviations and Correlation Coefficients," Communica-
tions of the ACbI, Volume 9, No. 7, pp. 497-499 (July 1966 .
21. N. Metropolis, and R. L. Ashenhurst, "Basic Operations in an Un-
normalized Arithmetic System," IEEETEC, Volume EC-12, No. 3, pp. 896-
904 (December 1963.
116
^y
