Fault-tolerant building-block computer study by Rennels, D. A.
General Disclaimer 
One or more of the Following Statements may affect this Document 
 
 This document has been reproduced from the best copy furnished by the 
organizational source. It is being released in the interest of making available as 
much information as possible. 
 
 This document may contain data, which exceeds the sheet parameters. It was 
furnished in this condition by the organizational source and is the best copy 
available. 
 
 This document may contain tone-on-tone or color graphs, charts and/or pictures, 
which have been reproduced in black and white. 
 
 This document is paginated as submitted by the original source. 
 
 Portions of this document are not fully legible due to the historical nature of some 
of the material. However, it is the best reproduction available from the original 
submission. 
 
 
 
 
 
 
 
Produced by the NASA Center for Aerospace Information (CASI) 
https://ntrs.nasa.gov/search.jsp?R=19780022892 2020-03-22T03:16:14+00:00Z
	 7f
	
WANG"	 11
JPL PUBLICATION 78-67
t
Fault-Tolerant Building-Block
	
^	 Computer Study
David A. Rennels
(NASA-CR°157!(4)	 FAUI'I-ICLEFANT
	 N78
-30815
BUILDING-EICCR CCMEUTFF STUCY (Jet
Propulsicn Lat.)	 1F F HC AC2/hF A01
	
CSCI 09E
	 Unclas
G3/60 29118
July 15, 1978 	 "k^'-^^^+' •^
Prepared for
Naval Ocean Systems Center	 r:	 4 ;^ f
by NOV
Jet Propulsion Laboratory
California Institute of Technology $33'^ ^
Pasadena, California

t
t
will
	
^ .^ ^..	 ^ ^---rte._ ^;	 ^.-.^-^..^►.:,...; ......^	 f	 1
;f
CONTENTS
^.	 EXECUTIVE SUMMARY . 	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . .	 . . . .	 1
I .	 INTRODUCTION	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 2
11.	 THE SELF-CHECKING COMPUTER MODULE . . . . . . . . . . . . . . 	 3
111.	 SYSTEM CONFIGURATIONS . . . . . . . . . . . . . . . . . . . . 	 7
IV. COST AND EFFECTIVENESS OF BUILDING-BLOCK
FAULT-TOLERANT COMPUTERS	 . . . . . . . . . . . . . . . . . .	 9
V. CONCLUSIONS .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 10
APPENDIX 1 - CHARACTERIZATION OF BASELINE (NUN-REDUNDANT) AND
	
SCCM COMPUTER MODULES 	 . . . . . . . . . . . . . . . .	 10
ACKNOWLEDGMENT 	 . . . . . .	 . . . . . . . .	 . . .	 .	 .	 .	 . . . . . .	 13
REFERENCES	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 13
Figures
1. Fault-Tolerant SCCM Configurations
	 . . . . . . . . . . . . . . 	 1
2. A Self-Checking Computer Module . . . . . . . . . . . . . . . .
	
2
3. Self-Checking Computer Module Architecture 	 . . . . . . . . . .	 4
4. The Memory Interface Building Block . . . . . . . . . . . . . .
	
4
5. The Bus Interfa- ,^ Building Block	 . . . . . . . . . . . . . . . 	 5
6. The Core Building Block . . . . . . . . . . . . . . . . . . . .	 6
7. A Distributed SCCM Architecture . . . . . . . . . . . . . . . . 	 7
8. Reliability Improvement Using SCCMs . . . . . . . . . . . . . . 	 9
Tables
1. SCCM Cost and Reliability Example
	
. . . . . . . . . . . . .	 3
2. Computer Module Physical Properties	 . . . . . . . . . . . . 	 9
A-l.	 Complexity of Component Elements . . . . . . . . . . . . . . 	 11
A-2a.	 Characteristics of Computer Modules
	 . . . . . . . . . . . .	 12
A-2b.	 Packaging Density	 . . . . . . . . . . . . . . . . . . . . .	 12
A-2c.
	
Cost Model	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 12
i i i
A
E1^,
NI
7' T T,
ABSTRACT
The development of ultra-reliable core computers is a starting point
for improving the reliability of complex military systems. Such com-
puters can provide reliable fault diagnosis, failure circumvention, and,
in some cases serve as an automated repairman for their host systems.
This report describes a small set of building-block circuits which
can be implemented as single VLSI devices, and which can be used with
off-the-shW microprocessors and memories to build Self Checking Com-
puter Modules (SCCM). Each SCCM is a microcomputer which is capable of
detecting its own faults during normal operation and is designed to com-
municace with other identical modules over one or more Mil Standard 1553A
buses. Several SCC!1s can be connected into a network with backup spares
to provide fault-tolerant operation, i.e. automated recovery from faults.
Alternative fault-tolerant SCCM configurations are discussed along with
the cost and reliability associated with their implementation.
^r
it
i^
f
i
iv
Vo
JI
r
^^ ♦ r`,CCM SCCM
o) STANDBY REDUNDANT
SCCM	 SCCM	 5C:.M
v
1/O
e) VOTED ONEIGURA?ION
SCCM	
'C SCCM	 ♦ 	 S (S)	
515 M
	 IM^	
_____ JJ
Sccm	 S SCCM ♦ ••• SCCM ♦ 	 (5^ SSM
0
r
1/0	 1'0
b) DL17R4c1I€p (5) - SPARE MODULE
4^OM UJU NETWORK
III INTERCOMMUNKAT IONS BUS
HLM -HIGH-LEVEL SCCM
SSM - SCCM, IN SUBSYSTEMS
FAULT-TOLERANT BUiLUING-BLOCY COMPUTER STUDY
^ 1
EXECUTIVE SUr111ARY
Reliability is a c o n t i n u i n q problem in mili -
tary electronic systems. The cost of electronic
system failures shows up as reduced operational
readiness and requirements for logistics supply
chains and maintenance personnel. 	 it is estimated
that the cost of supporting an electronic system
for its lite is often more than its initial pro-
curement cost.
Today both the knowledge and the technology
exist to build highly reliable computers at only
small penalties of size, weight and cost. The
reliability of these computers is achieved through
a redundant or fault-tolerant architecture. VLSI
technology provides the capability of putting large
amounts of circuitry in small and inexpensive
packages. By using a standby redundant architec-
ture in which unpowered elements of the computer
are spare, computer system reliability can be im-
proved in increments by adding spares. The long
term potential result is systems which are suffi-
ciently reliable, that they do not require techni-
cian or logistic support for the life of their
mission.	 In the shorter term, systems can be built
which utilize a highly reliable core computer which
can significantly aid in the diagnosis and mainte-
nance of the entire system.1
The purpose of this study is to define and
characterize the VLSI building blocks required to
combine existing microprocessors and memory into a
wide variety of self-checking and fault-tolerant
computing systems. Fault tolerance is the ability
to continue correct operation in the presence of
failures.	 Self-checking circuits are capable of
detecting their own malfunctions. This study has
resulted in the definition of four VLSI circuits
which allow the construction of a single or a dis-
tributed fault-tolerant computer system. The four
building-block circuits are:
	 1) an error detecting_
(and correcting) Memory Interface, 2) a program-
mable taus Interface, 3) a Core [building [clock, and
4) an I/O building block. These circuits interface
with two commercial microprocessors and commercial
memory to Corm a Self-Checking Computer Module
(SCCt1). This computer- operates just like any reg-
ula r microcomputer but additionally it signals its
own malfunctions and can disable its outputs upon
detection of an internal fault.
	
It is designed to
connnunicate with other identical modules over ,
 one
or more Mil Standard 1553A buses.
The Self-Checking Computer Module (SCCH),
which is constructed using the VLSI buildiny-block
circuits, is itself a much larger building block
which is combined with other SCCr1s to form a vari-
ety of fault-tolerant computing systems. Examples
of fault-tolerant SCCH configurations are shown in
Figure 1. The first is a standby redundant con-
figuration. A single computer (SCCM) is backed up
by one or more spares. Upon failure of the primary
SCCM, a backup spare automatically takes over the
ongoing computations. The second configuration
represents a network of SCCt1s operating as a
Figure 1. Fault-Tolerant SCCH Configurations
distributed system. In this case, redundant spare
modules can be employed to provide automated fault
recovery in critical functions as designated by
the system designer. This type of configuration
is applicable to avionics and shipboard control
system3.	 In the third configuration, a number of
SCCt1s perform the same computations simultaneously.
Their outputs are voted in peripheral devices.
This type of configuration is used for _,r•emely
high reliability applications which are human-life
dependent, such as commercial aircraft control.
In summary, the important attributes of this
building-block approach to fault-tolerant comput-
ing are:
(1) Using the four VLSI building blocks, Self-
Checking Computer HodUleS can be constructed from
a variety of commercial microprocessors and
memories.
(2) The self-checking property of the SCCM
allows these machines to instantly detect and sig-
nal internal faults, thus allowing straightforward
implementation of automated recovery by backup
spares.
(3) Using the SCCt1s as building blocks allows
the system designer to choose from a wide variety
of system architectures. He is allowed full flex-
ibility in the tradeoff between redundancy and
performance in adding or deleting computers in the
system.
The following report describes the individual VLSI
building blocks, the resulting Se l f-Checking Com-
puter Module (SCCM), fault-tolerant configurations
of several SCCr1s, and tinally, an evaluation of
the cost and effectiveness of this approach.
Ir	
j^
^	 7
1	 ^
LRUUII PAGE TS
OF POOR QUALITY
I .
	
INTRODUCT1014
Reliability and consequently maintenance are
a con tiiuinq problem in complex military systems.
The cost of failures shows up in many ways, includ-
ing reduced operational readiness, and high dollar,
costs associated with the large number of personnel
involved in maintenance and logistics.
	 It is esti-
mated that costs of ownership often exceed procure-
ment costs in major electronic systems.
	 It is
likely that life-cycle costs can be significantly
reduced by increasing system testability, maintain-
abilicy, and adding self-repairing features in the
earls stages of a system design. A moderate
increase of initial hardware costs can yield im-
proved system reliability and reduce maintenance
costs for a system's operational lifetime.
The computers within a system provide the
starting point for automated maintenance. If
computer reliability is assurel, the computers
can be used for 1) subsystem testing and failure
diagnosis, 2) automatically replacing failed sub-
systems with spare parts, or 3) where no backup
spares are available, modifying on-board processing
to account for- the degraded subsystem state.
Stated in another way, the computer becomes an
automated repairman which may bring within reach
the ultimate goal of maintenance-free systells. To
achieve the level of computer reliabilit y required
for this goal, fault-tolerant design techniques and
extensive use of VLSI must be employed.
The use of VLSI circuitry enhances the basic
component reliability, and combining reliable com-
ponent ,, with fault-tolerant design can lead to
failure-free operation. Using fault-tolerance
techniques, spare ;nodules are included in a com-
puter which are automatically substituted for
faulty modules .vhen a fault occurs. Discarded
modules can be replaced by a repairman at regularly
scheduled maintenance intervals without disruption
in service. The number of spares can be adjusted
to provide fault-free operation over various re-
quired time intervals. 	 fault-tolerant computer
design is a mature discipline', and the use of VLSI
technology makes the cost of such machines rela-
tively inexpensive."i'
The purpose of this study is to define and
characterize the VLSI building-block circuits
required to combine existing microprocessors and
memories into a wide variet y of fault-tolerant com-
puting systems. The study has resulted in the
definition of four VLSI building-block circuits
which allow the construction of single or distrib-
uted (multiple) computer systems. These systems
are fault-tolerant, and thus have the ability to
continue correct computation in the presence of
failures and transient malfunctions.
	 Specifically,
the building-block circuits are connected with
existing ("off-the'shelf") microprocessors and
memory devices to form Self-Checkinq Computer
Modules (SCCM). Each SCCM contains a computer and
the circuits necessary to connnunicate with other
(SCCM) computer modules or with dedicated i/,'
circuits.	 Each SCCM is also Self-Checkinq in that
it is capable of detecting internal hardware faults
concurrently with normal operation. It generates
fault-alarm signals so that recovery can be imple-
mented and other, redundant, (SCCM) computer mod-
ules can take over in case of failure.
The four buildinq-block circuits are desig-
nated the Memory Interface, ProgranMnable Bus inter-
face, 1/0 Block, and Core Block. They provide the
following important properties:
I. They can be used with a variety of existing
microprocessors. This means that standard
computers with existing software can be used
in constructing building-block computer
systems.
2. The building- block concept is oriented toward
current and future standards through the use
of standard interfaces and the ability to
accept different microprocessors and memory.
3. Computers can be arranged in a distributed
configuration. Modules can be added to aug-
ment performance or to provide redundancy for
fault-tolerant operation.
A typical Self-Checking Computer Module is
shown below in Figure 2. it provides a computer
building block with a great deal of computing
power. A 16-bit processor with instruction cycles
in the range of 1-2 microseconds would be provided
aloncl with 32 thousand words of memory.
23 RAMS
	 I	 MI	 CORE
I
	
PP
	 yP
BI	 BI ^	 BI
MI	 - MEMORY INTERFACE BUILDING BLOCK
BI	 - BUS INTERFACE BUILDING BLOC.
CORE	 CORE BUILDING BLOCK
1,0	 1 O BUILDING BLOCK
MP	 MICROPROCESSOR
Figure 2. A Self-Checking Computer Module
Typical packaging would be a 50 square inch multi-
layer board containing 23 commercial RAMS and two
commercial microprocessors. The building-block
circuits are one Memory Interface (MI), one Core
circuit (C), three Bus Interfaces (BI), and two
I/O devices (10).
A description of the individual buildinq
blocks and the resulting self-checking computer
modules is presented in Section 2. This is fol-
lowed by a description of their, use in multicom-
puter networks, presented in Section 3. Upon
definition of the building blocks, a study was
conducted to estimate the cost, physical charac-
teristics, and the reliability of the building-
block computers. This is presented in Section 4
and summarized below.
In order to evaluate the cost and reliability
of this approach, we postulated a hypothetical
Baseline Computer which has performance equivalent
i
I
I
I	 '
j
	
0.-
^^ 1
i
r
A
N.
^.
to the SCCM, but does rot have any of its fault-
tolerant properties. This baseline machine was
estimated to cost between $8,000 and 510,000 when
constructed with high-rel VLSI parts. The Self-
Checking Computer Module was then compared with the
baseline computer. The relative increase in cost
for its fault-tolerant properties (over the base-
line machine) was tabulated, along with the rela-
tive improvement in reliability. An example is
shown in Table 1. For various design options, the
expected percentage of failures is shown for an
ensemble of machines.
Relative	 Expected Percent
Cost	 Failures in Ensemble
of Machines in:
6 Months 1 Year 2 Years
Baseline
Computer	 1	 22):	 39%	 633,
Self-Checking
Computer
Module (SCCM)	 1.5	 6,	 16"	 40':
SCCi1 plus:
1 Spare	 3	 0.5	 2	 16;
2 Spares	 4.5	 0.11°	 0.5	 6'L
Table 1. SCCM Cost and Reliability Example
To assess the incremental cost of an SCCM,
we examine the additional features which must be
included in the Baseline Computer. The additional
logic for self-checking circuitry adds about 15
to the complexity of the Baseline Computer-. This
property is essential for fault-tolerant operation
since prompt and dependable detection of faults is
necessary for automated recovery. Each SCCM also
contains error correcting (SEC/DED) codes in
memory and a spare replacement bit such that most
single faults in memory can be corrected. Since
the memory is by far the least reliable portion of
the SCCM (using current technology), single-fault
recovery in this area significantly improves the
reliability of the whole module. More common than
hard failures are transient mistakes caused by
erroneous bits stored in memory. Most of these
errors are also corrected by this error correction
capability which represents an addition of 25 to
the complexity of the SCCM.
Thus the SCCM is about 1.5 times as expensive
as a conventional computer of similar capability
but, due to its internal memory fault correction,
it is expected to have several times less failures
in the field. The self-checking property allows
the SCCM to be easily connected with one or more
backup spares which provide automatic fault recov-
ery.	 In this case, the expected failures can be
reduced one or more orde s of magnitude.
When spares are used, their number is chosen
to assure reliable operation over some specified
time interval at which scheduled maintenance is
performed (e.g. 6 months, 2 years, etc.). At the
time when the maintenance person arrives, the
computer system indicates which SCCM modules have
been discarded as faulty and automatically re-
placed with spares. He then substitutes new SCCMs
for the faulty modules without disrupting computer
service.	 In the future, when the cost of VLSI
circuitry drops even farther, it is expected that
sufficient redundancy can be included so that the
core computers will function properly over the
entire operational life of their host systems
without any external maintenance.
The relative increase in internal gates and
registers to provide a self-checking and fault-
tolerant computer is relatively constant over
various levels of circuit integration (i.e. SSI,
MSI, LSI, VLSI). VLSI implementation makes this
type of design more attractive for several reasons.
First, packaging is more efficient. When a com-
puter is partitioned into chips, pin limitations
often do not allow full utilization of the poten-
tial gates inside.
	
Extra space is often available
for a nearly "free" implementation of self-
checking logic. A side benefit is that the self-
checking chips are more easily tested by the manu-
facturer. The general reduction in circuit costs
associated with VLSi similarly reduces the incre-
mental costs of fault tolerance, i.e. checking
circuitry and spares.
	
In past computers, fault
tolerance cost an additional 5100,000 or more and
was only applied to one-of-a-kind problems such as
space missions. With VLSI technology, this incre-
mental cost is reduced by an order of magnitude
making fault tolerance applicable to a much wider
range of applications. This report is directed
toward this area of LSI and VLSI devices which can
currently be produced. Extrapolating to future
VLSI developments, the cost of fault-tolerant com-
puting will be reduced by an additional order of
magnitude. It is expected that self-checking com-
puters will be packaged on a single chip along
with redundancy for local fault recovery at a cost
of a few hundred dollars.
Ii. THE SELF-CHECKiNG COMPUTER MODULE (SCCM)
The Self-Checking Computer Module, as previ-
ously described, is a small computer- which is
capable of detecting its o %-;n malfunctions.	 It
contains I/O and bus interface logic which allows
it to be connected to other- SCCMs to form fault-
tolerant configurations. The SCCi1 contains com-
mercially available microp;neessors, memories, and
four types of VLSI buildine-block circuits as
shown in Figure 3. The building blocks are:
1) an error detecting (and correcting) Memory
Interface (MI-BB), 2) a programmable Bus Inter-
face (BI-BB), 3) a Core building block (Core-BB),
and 4) an I/O building block (iO-BB).
The building-block circuits control and
interface the various processor, bus interface,
memory, and I/O functions to the SCCM's internal
bus.	 Each building block is responsible for
detecting faults in its associated circuitry and
then signaling the fault condition to the Core
building block by means of an internal fault
indicator. The Memory interface Building Block
implements fault detection and correction in the
memory, as well as providing detection of faults
in its own internal circuitry. Similary, the Bus
Interface and I/O Building Blocks provide irter-
courrunications and input-output functions, along
­& 	
_°__ `—L._`t. tam_280r	 -.1.4 __J --a AM* ^
	
_...' '– 9..^_ A —
Z
rI
1 '
i
r
with detecting faults within themselves and their
associated communications circuitry. The Core
Building Block checks the processing function by
running two CPUs in synchronism and comparing
their outputs.	 It is also responsible for fault
collection and fault handling within the module.
I •T[ RN4L IN If Rf OMMUNIC ATI W,
	 Rf UUNUANT Mf MIiRrNUS I , S NI
• NI15 NAM :•141.
11" M11 N[AI.I	 Ib NITS	 I'I. ^	 Ifli	 4l+f
tll Il UINC FBI U R • .	 INII PAM
IK 1' 	TRI S'All I
'	 14 i '^	 SN fwl Ih l•. Iif',1	 INl	 A'
hik
	
._---	 F.V II UING-NL,ICI, 	 I
IFl^ :y:^	 HAMMING J11 111111 	
r11 Rf	 2 r11R R1 CIIVN
-	 1'.A	 L.	 !'^ILOING
	
IN If RRUPT
fAIf
 4rr
r	 -	 6u  CNf 11, T 	 PI ,1 T
	 ..11 111111	 iR nC[ .	 ^	 .w ^I LIA	 a •
NA	
.---
	 M' All
yyyyyy
	 I	 I	 1Vf'	 II	 II	 r1l,AI
 A.
•	 -	 ^_	 IUT['[t1 INA181T ION rRA al
A	 1,9-BB	 N - FUS AWT(M
.1 - PUSCOM.oII[^TINT I
	
f AULT	 VS ASSIG NME NT
,Alo"M 9 1 N LS
DMA	 ^^- ^-°-	 141$8
Rf ULW ST
	
-
Figure 3. Self-Checking Computer Module Architecture
The Core Building Block receives fault indi-
cators from the other buildinq-block circuits and
also checks internal bus information for proper
coding. Upon detecting an error, the Core-BB dis-
ables the bus interface and I/O functions, isolat-
ing the SCCM module from its surrounding environ-
ment. Tne Core-BB can optionally 1) halt further
processing until external intervention, or 2)
attempt a rollback or restart of the processor, or
when fitted, 3) initiate a memory reload from a
local nonvolatile store and execute a program re-
start. Repeated errors result in the disabling of
the faulty computer module by the Core Building
Block. Recovery can be effected by an external
SCCM which is programmed to recognize the lack of
activity from an SCCM and take over the ongoing
computations.
An important attribute of the building blocks
is that they are interconnected via the internal
processor-memory bus. All I/0, external bus trans-
missions, reconfigurations, and external diagnoses
are cormranded by reading from or writing into out-
of range addresses. This use of '(memory-mapped
1/0" avoids dependence on processor-specific I/t?
operations, and thus allows use of a wide range of
existing microprocessors in the SCCM.
The following is a brief description of the
building-block circuits.
	
Th. Alemu2 •j trlterJ'r. e vuii.: •	 . (' - .. )
The fault-detecting and ce , recting MI-BB
interfaces a storage array (consisting of a redun-
dant set of memory chips) to the SCCM internal bus.
It provides Hamming correction to damaged memory
data, replacement of a faulty hit with a spare,
parity encoding an[{ ^Leading to the internal bus,
and r(Ptcction of internal faults within its own
.lrcuitry.
The MI-BB needs only to be capable of detect-
inq errors to satisfy the requirement of an SCCM.
However, memory is the source of the largest number
of failures within the SCCM, and single-fault
repair in this area will greatly improve the relia-
bility of the whole module, even though the basic
computer module is treated as a replaceable (throw
away) item with backup spares. A block diagram of
the memory interface building block 'Is shown in
Figure 4•
-------- FAULT INDICATION LINES
----- CONTROL LINES
.. - ADORE SS BIT, r	 DATA BIT.
p - PARITV HIT	 - CHE'N HIT
CONTROL -SPARE
BUS µ
ERROR
1N UIC ATORS ^ I
baI
UDRES^?I"B 
US
ACC L°.S
(EALEEIMf NT
ga ____b
16e
—	 --
—_.__
_	
..._
MEMORY
MCTROL-
-
-1 16. 	 ADURFSS
—i CONTROL
-- T 1 T IJ lbe rI
SCCM ERROR -- ____^lo
	 I_; I	 (STORAGE
INTERNAL CONTROL z	 1 1l	 ARRAY
Bus (E C1 _ --^----j
'
_
r___ J I,	 16Ci77
la+
_
7.
1 <	
b
DATA	 2,	 I DATA-BUS lba	 b.	 1	 BIT- 1,	 DATA
HUS INTERFACE REPLACEMINiI-
I
(DBI) —_ —r—J	 (BRI 1
Figure 4. The Memory Interface Building Block
The .Access Element (AE) provides the address
parity checking and decoding required to select a
memory module (storage array plus MI-BB). It names
and validates the incoming address by means of a
self-checking parity -hecker circuit.
The Error- Control element is responsible for
generation of (lamming code check bits and syndromes,
byte-parity generation and checking (for the SCCi1
internal bus), and error analysis. The circuits
used in the EC are also self-testing. A single-bit
error is corrected by decoding the syndrome gener-
ated from the word read from memory, in order to
localize the faulty bit. The correction is per-
formed by conlplementinq the fault bit.
The error analyzer collects various error indi-
cations such as single error, double error, and
circuit error, which are recorded in an Error
Status !lord and which can be transmitted over the
bus on system demand.
The Bit Replacement (BR) element performs the
reconfiguration of the storage array. It contains
a multiplexer circuit which can replace any one bit
plane in the memory with a single standby spare.
The bit to be replaced is specified by external
command.
The Data Bus Interface (DBI) contains a Memory
Data Register and the tr-state drivers and re-
ceivers used to interface with the SCCM internal
data bu y . Bit inversion for Hamming correction is
performed in the Data Register.
The itemory Control (MC) element receives com-
mands from the SCCM internal control bus which
1
i t	 l
i'
i
f	
j^
I	 f
I
1
r
4
177=L-1J	 lug=^ ^' 1- _> > __ _; ..i ^1V .^ _^_ ._^ .1.e
specifies "read" and "write" operations. For ad-
dresses less than 61,440, the commands are inter-
preted as normal memory operations. "Read" and
"write" instructions with addresses larger than
61,440 are reserved for memory-mapped 1/0. A set
of these out-of-range addresses is reserved for
commands to the MI-BB. Among these commands are:
1) read error status word, 2) read error position
of faulty word, 3) read address of last error, 4)
reset, 5) disable correction, 6) read redundant
check bits, 1) replace a bit plane with a spare,
and 8) set soft name.
From preliminary designs, %he complexity of
the Memory Interface building block is estimated
to be equivalent to 2000 gates. This represents
a small failure rate with respect to the storage
array and is readily implemented as a single LSI
building block.
The Bus Interface Building Block (BI-BEt)
The Bus Interface Building Block provides the
mechanism by which information is transferred be-
tween SCCMs via an intercomnunications bus system.
This external bus system consists of several redun-
dant buses and is being designed to utilize MIL STD
1553A communications formats. Each BI-BB can be
microprogrammed to function as either a Bus Con-
troller or REnote Terminal (designated Bus Adaptor)
for a single 1553A bus. Several BI-BBs are em-
ployed in each SCCM so that each computer module
can communicate over several buses simultaneously.
The Bus Controller and Adaptor functions pro-
vided by the BI-BBs are much more powerful than
those normally implemented for 1553A controllers
and terminals. These BI-BBs are capable of moving
data directly between the memories of the SCCMs
attached to a given data bus with a minimum of
software support. The controller and adaptors on
a given bus operate together in a relatively auton-
omous fashion similar to the data channe l s on much
larger machines, as described below.
The SCCM which controls an intercommunications
bus contains a BI-BB which is microprogrammed to be
a Bus Controller. When it wishes to initiate a
data transfer between the memories of the SCCM
modules on its bus, it alerts the Bus Controller.
The Bus Controller_ reads a control table from
its host SCCM's memory which specifies the source
and destination of information required for the
bus transfer along with the length of the trans-
mission. The Controller then broadcasts the appro-
priate commands over the bus system to "set up" the
transmitting and receiving adaptor circuits.
	 It
monitors the transfer of information, records
status messages, and notifies the host SCCM upon
completion of the transfer.
BI-BBs in other SCCMs connected to the same
intercommunications bus are microprogrammed as Bu-s
Adaptors. These Bus Adaptors serve as remote
terminals on the bus. The Bus Controller, in set-
ting up a transfer, s pecifies one adaptor as a
data source.	 It then specifies one or more Bus
Adaptors as data acceptors and names the data to
be moved.
The source Adaptor then finds and extracts
the specified information from its host SCCM's
memory, using cycle-stealing, and places this in-
formation on the bus. Simultaneously, the acceptor
Adaptor(s) takes this information off the bus and
loads it into its host SCCM's memory, via cycle-
stealing.
An SCCM can contain several b7s adaptors to
provide an interface to a number if redundant
external buses. Communication can occur simultane-
ously over as many as three buses with an SCCM
without conflict (time delays) seen on any bus. A
Bus Adaptor cannot initiate a bus transfer, but
only responds to the commands of a Bus Controller.
Provision is made for sending discrete commands
through Bus Adaptors such as, power on, power off,
halt, interrupt, reconfigure, etc.
The functioning of the Intercommunications Bus
System is further described in a following section.
A block diagram of the BI-BB is shown in Figure 5.
It consists of five major elements, a Manchester/
NRZ translator, a Microprogram Control Unit, a
Control ROM, a Data Path Element, and a DMA (direct
memory access) Controller.
	
NO	 (SCCM)  STHOST(SCCM)   	 FAULT IIFI	 ACKNOWLEDGE,
INTERNAL
	
BUS REQUEST,
HOST SS BUS  	  
A
(^	 SIGNALS	 {t W, CPL
 18	 #[ Te	 1111	 ll
DATA PATH ELEMENT	 CONTROL
	 (`IF)
(DATA-ADORES S REGI STERS,
ROM,000N T ERS,ADDER)--^ -----
V	 11 HOST	 10
(IF) NRZ	 COMMANDS	
CONTROL
	
OUTPUT	 DATA 2	 A	 STATUS	
ROM
	
INHIBIT	
Z	 Z	 OUTPUT ENABLL	 2 10 MICRO-
PROGRAM
CODE ERROR	
AoORESs
OUTPUT I
 MANCH-
DRIVER	 ESTER	 VNCS
	
(-	 J	 TRANSMIT;	 MICROPROGRAM
NRZ	 RECEIVE	 CONTROL UNIT
INPUT	 TRANS-
RECEIVERr	 LATOR	 TONE TZRO	 ^--
2
	
CLOCK	 INTERNALB
us
	
EXTERNAL
	 FAULT (IF)
Figure 5. The Bus Interface Building Block
The Manchester/NRZ Translator translates in-
coming Biphase Manchester into commands and data
by supplying a bus-synchronized clock, command and
data word-sync indicators, NRZ data, and parity and
Manchester-error detection signals.	 It will also
accept NRZ data, encode it, and output Manchester
data for bus transmission, along with the associ-
ated command and data sync signals.
The "licrop o-gram Control Unit (MCU) is a
microproc.:air sequencer. A microprogram location
counter is started at one of several fixed addresses
by command sync, data sync, or a host processor
command (detection of an out-of-range address).
The locatior counter proceeds through sequential
addresses or branches on the basis of incoming data,
internal flags, or other internal circuit condi-
tions. The microprogram sequencer is programmed
to generate a unique set of address sequences for
each type of incoming bus command, data sequence,
or computer command. This output sequence is then
mapped through a Control ROM to generate the
LEOOR QUALITY
.i
Z
detailed control signals required to drive the
Data Path, MCU, and DNA Control Elements.
The Control ROP1 maps the microprogram address
sequence into control 	 signals for the various
I
	
circuit elements.
The Data Path element contains 1) registers
necessary to buffer addresses and data, 2) ROH to
store memory protection hounds, data keys, and
table addresses, and 3) an arithmetic logic unit
for addressing computations. This circuit is not
unlike existing bit slice processors, with the ex-
ception that serial-parallel conversion registers,
RO11, and several holding registers are required
for the unique bus interface and DNA functions.
The WIA Control element is responsible for
obtaining control of the host SCCP1's internal bus
and transferring data between the BI-BB and the
host SCCN's memory.
The fault detection techniques employed in
the Bus Interface Building Block are based on
parity coding to protect memory information and
duplication with morphic comparison for most of
the logic circuitry.' Preliminary designs indicate
that this building block will have complexity
equivalent to 7,000-10,000 gates.
Tne I/O duituir, :Locke
Input-output requirements of host systems
vary widely in voltage ranges, currents, and timing
parameters. The approach best suited to building-
block development is to provide a standard set of
functions which serve a majority of general appli-
cations. The user is required to supply any addi-
tional functions unique to his applications.
To be consistent with the SCC1 design, all
building blocks must provide memory mapped I/0.
That is, each I/O building block must recognize
its identification and the function being requested
from an out-of-range address appearing on the host
SCCH's internal address bus. Data for output or
input is transferred over the data bus in response
to a processor write or read to the specified
address.
	
Candidate I/O functions are:
	 1) 16-bit par-
allel data in and out, 2) 16-bit serial data in
and out, 3) a pulse saITIplinq circuit, 4) a pulse
counter, 5) a pulse generator, 6) an adjustable
frequency generator, 7) an analog multiplexer with
A/D converter, and 8) a high-rate DHA channel.
The density of VLSI te,,rinology is sufficiently
high that a number of I/O functions can be supplied
on a single chip. The specific function which is
required can be activated by connecting pins.
This approach can reduce the inventory of different
I/O building blocks to one or two.
The Core Building Block (Core—BB)
The Core-BB is responsible for: 1) detecting
CPU faults by synchronizing and comparing two
duplex CPUs, 2) collecting fault indications from
itself and other building blocks, and 3) disabling
its host SCCM upon detection of a permanent fault.
Three options are provided for attempted recovery
	
from transient faults. These are:
	 1) Stop at
first fault indication; wait for outside help;
2) Roll back at first fault indication; stop if
the fault recurs; 3) Reioad memory and restart;
stop if fault recurs.
	 In all cases, Bus Controller
and I/O outputs are inhibited as long as the SCCM
is suspect; e.g. before a rollback or restart has
been successfully completed. Specific functions
of the Core-BB are listed: 1) Compare two CPUs
for disagreement; 2) Parity encode CPU output for
internal bus transmission; 3) Check parity on the
internal bus; 4) Recognize Core-BB commands which
can be sent from an external module via a bus
adaptor as out-of-range address (these are com-
mands to halt and inhibit outputs, restart, and
enable outputs of the receiving mod^lle); 5) Allo-
cate the internal tristate bus among several DMA
requests from the Bus Controller, Adaptors, and
I/O-BBs; 6) Detect internal faults within the Core-
BB; 7) Collect internal fault indications from all
buildinq blocks within the SCCM; 8) Disable SCCM
output under fault conditions; 9) Provide optional
rollback/restart compatibility for optional tran-
sient fault recovery; and 10) Halt computation on
recurring faults.
OP CODE	 PSEL RRWL STR CLOCK
ft 	 fp R
'72ONTROL	 ERROR
2
F HE
PARE TREE
I lAlb
ADDRESS OR	 I	 TO CHECK CPU
DATA BUS
TO MASTER CPU
16	 -
a) PROCESSOR CHECK ELEMENT Ix 2)
BUS REQUEST (BR)
CLOCK • TO CHECK CPU
I _^ 1 BUS AVAILABLE
TR UE FROM CHECK CPU
PR
^
INTERNAL 16 RESOLVERL_-BUS SELF 1F PSEL,BUS
REQUESTS CHECKED AVAILABLE TO
l8 x 2 COMPARE REQUESTING DMACOMPLEMENT'
PRIORITY 2
RESOLVER INTERNAL ERROR
_---'J 1 BUS AVAIL. FROM
CLOCK	 1 MASTER CPUBR TO MASTER CPU
W BUS ARBITRATION ELEMENT
b 4
OP CODE s RECOVERY
__	 —^ OPTION STRAPS
INTERNAL RECOVERY
ERROR SELF SEQUENCER BSIGNALS CNECKED PROCESSOR RESET
COMPARE RUN STOP,OJTPUT
10 x 2 - 20: TREE RECOVERY INHIBIT. BULK LOAD
SEQUENCER
cr FAULT HANDLER ELEMENT
Figure 6. The Core Building Block
I^
I
i
6
-	 _ __ __	 __ .- I I
0
HIGH
LEVEL
MODULES
BUS
NO.I
NO.2
NO.3
TERMINAL MODULES
B	 - BUII DING BLOCK
J - MICROPROCESSOR
BC	 - BUS CONTROLLER
BA - BUS ADAPTOR
RTI
	 - REAL-TIME
INTERRUPT
Pi	 - PRIORITr CHAIN
FOR I,h BUS
--- - BC CONNECTION FOR
---	 BUS-SHARING
TO
SUB SY STEM
1
	
lkli
)
1
r
I
1
The Fault-Handler element accepts morphic
	
RTI
fault indicators from the other BBs and from within
the Core-BB. it reduces these to a single two-wire
master fault indicator which indicates a fault
somewhere in the SCCM. This fault indicator
triggers the removal of power from bus controller
and output drivers, isolating the computer module
from the rest of the system. Duplex Recovery
Sequencers are employed to implement optional tran-
sient recovery sequences. They are checked with a
self-checkinq comparator.
The Core-BB consists of three elements as
shown in Figure 6. The Processor Check element
serves three functions: 0 to compare the outputs
of two synchronous CPUs, 2) to encode and check
internal bus parity, and 3) to recognize and de-
code commands sent to the Core-BB through the in-
ternal bus.	 It contains self-checking parity
checkers, a duplex command decoder, and mcrphic
reduction trees.''
The Bus Arbitration element accepts two-wire
bus request signals from the various DMA control-
lers on other BBs. It obtains release of the bus
by the CPUs, and or •ants access to requesting BBs
on the basis of hardware priority. The Bus Avail-
able signal is sent as a two-wire indicator to
each requesting BB.
The Building-Block Desi g ns are all directed
toward developing a computer module which removes
itself from operation whenever a fault occurs
inside. The next section deals with the use of
these modules in a computer network.
III. SYSTEM CONFIGURATIONS
The SCC11s can be used in a number of differ-
ent fault-tolerant configurations to support a
variety of applications.	 This includes single
computer applications which employ standby redun-
dancy 0 .e. a single computer protected with back-
up spares) and hybrid redundancy (i.e. several
computers execute the same program and their out-
puts are voted; additional spares may also be
employed). The SCCMs may also be combined into
distributed computer networks in which the indi-
viduel computing functions can be protected with
standby or hybrid redundancy.
A distributed computer network architecture
has been developed at the Jet Propulsion Laboratory
for control and data handling applications, which
allows SCCMs to be connected into a variety of
configurations. This architecture, designated the
Unified Data System (UDS), has been constructed in
the form of a breadboard system. Software has
been developed, and the results have been widely
reported.'" '" This distributed computing archi-
tecture provides a framework for a number of com-
puters to work together, performing a collection
of different dedicated computing tasks.
	
individ-
ual computer modules may be protected by standby
redundancy or voting, and thus the architecture
provides a model for single computer configura-
tions by considering a single internal computer
with its redundant protection, or a model for dis-
tributed computer fault tolerance by implementing
the complete processing ensemble.
The next section describes the hardware and
intercommunications structure of the LIDS
architecture. This is followed by a brief descrip-
tion of the fault-tolerant configurations that can
be constructed.
Tht	 Iionitectuve Using Self-i neckirLj Computer
Ak:,tu l,, s
A fault-tolerant LIDS architecture consists of
a set of Self-Checking Computer- Modules connected
!,y a redundant set of interconenunications buses as
shown in Figure 7. There are two types of modules:
Terminal Modules (T11) and Hign-Level Modules (HLM).°
Figure 7. A Distributed SCCM Architecture
Terminal Modules (TP1) are SCCHs located within
various subsystems which are responsible for con-
trol and data gathering within their associated
subsystem. The TM contains two microprocessors
(IMP), memory (RAH) with an MI-BB, I/O-BBs, a Core-
BB, and several BI-BBs configured as Bus Adaptors
(BA). The TM interfaces with the other modules in
two ways:
	 1) It receives a single Real-Time Inter-
rupt (RTI) which is common to all modules and which
is used for timing and synchronization, and 2) Each
TH contains a Bus Adaptor interface to each of sev-
eral interconwruTications buses. Data words can be
entered or extracted from the memory of the T11 com-
puter using Direct Memory Access (DMA) techniques.
Each BA can be commanded over its bus to fetch or
deposit data into the TM memory. A TH cannot ini-
tiate bus communications but it actively supports
DMA transactions into and out of its memory. An
external High-Level Module enters commands, data,
and timing information into the memory of the TM.
The TH delivers information to the system by plac-
ing outgoing messages into predetermined locations
of its memory, which can then be extracted by the
HL11 over the bus. The TM can be accessed through
several buses simultaneously. The associated BAs
provide hardware conflict resolution between com-
petinq DMA requests from different buses.
Hi g h-Level Modules (111-P1) are SCCHs which are
respons ible for coordinating the processing which
is carried out in the remote TP1s, for control of
intercommunications over the bus system, and for
high-level processing such as data compression and
decision making. They contain the same internal
components as the TMs, augmented by an additional
BI-BB which configured a ,^ a Bus Cnntroller (BC).
'Il^T^CII PAGE I$
In POOR QUALITY
I!
e
a1
1
Each BC, which is unique to an HLM, can move blocks
of data between the memories of all computer mod-
ules connected to its bus via commands to their Bus
Adaptors. The BC specifies a source module, one or
more destination modules and identifies the data to
be moved. The BAs in the addressed modules then
move the specified data over the bus between the,(
host SCCM memories.
The computer in the HLM activates its BC by
presenting it with the address of a Bus Control
Table in the HLM memory. This table specifies the
source module, destination modules, data names, and
the length of the requested information transfer.
The BC initiates and controls the specified trans-
mission, monitors status messages to verify a cor-
rect transfer of information, and notifies the HLM
computer when it is completed. The BC is the mech-
anism by which the HLM can coordinate the proces-
sing in other computer nodules by entering com-
mands into their memories and reading out informa-
tion to monitor ongoing processes.
The Intercommunication Bus System (IBS) con-
sists of several independent serial buses, each of
which provides a bandwidth of approximately 1 mega-
bit.	 Each bus is connected to one Bus Adaptor 	 in
each of the computer modules (HLMs or TMs) to
which it is connected. To each bus is assigned a
primary Bus Controller whose HLM has complete con-
trol over that bus.	 It can relinquish control
over its bus to another HPI under two conditions:
1) it is not powered, or 2) its p rocessor commands
release of the bus to lower priority Bus Control-
lers for a designated time interval. Thus, the
set of buses may be operated simultaneously with
each bus controlled by a different 110 or with
individual buses time-shared between several such
module,.
Access to each bus by the various HLMs is
based on a fixed hardware priority assignment be-
tween Bus Controllers. A daisy-chain structure is
utilized for each bus to establish this priority
assignment as shown in Figure 7. Modules of higher
priority signal release of a bus via its "daisy-
chain," which then activates the hardware necessary
to allow bus access within modules of lower prior-
ity. Thus spare modules can gain access to a bus
whose controlling HLM has failed, or if a bus
fails, another bus can be shared between two con-
trollers.	 The individual buses are physically
independent; each has its own set of hardware bus
access control circuits and a daisy-chain for
pr i ority assignments. Therefore, no central bus
system controller exists as a potential catastroph-
ic failure mechanism. Similarly, there is no com-
mon clock. Each bus uses clock signals generated
by the transmitting module.
	
The number of
buses within the IBS may be selected to meet re-
quirements of data throughput and redundancy. The
concept facilitates reconfiguring throughout a
mission if failures occur. 	 In the extreme, a
single remaining bus can support essential
functions.
The UDS design is oriented toward removing
"hard core" items whose failure can cause cata-
strophic system failure. Intercommunications and
clocks have often presented significant problems
in this area. These are dealt with in the UDS in
the following ways. The buses are made independ-
ent to avoid any common failure mechanisms. Each
HLM or TM uses its own internal clock, and the
buses use the clock of whatever module is trans-
mitting. With independent clocks in each computer
module there is also a distributed mechanism for
protecting against failure of the common RTI.
	 If
two or more independent RTI signals are generated,
each module can decide whether the RTI is correct
by comparing this signal with their own internal
clocks. If an RTI generator fails, the modules
will automatically switch to a backup, and if an
individual computer clock fails, damage is con-
tained to the faulty module.
Each SCCM, as previously described, is capa-
ble of detecting its own faults concurrent with
normal operation and will disable itself upon
detecting a permanent internal fault before dam-
aged information can propagate beyond the faulty
SCCM. This allows automated recovery to be imple-
mented with backup spares.
: grant Co>:," :.ti .;ti rs
A number of fault-tolerant configurations can
be derived from the UDS arc.itecture using SCCMs
as HLMs and TMs. Several are described below.
a) Standby Redundant Unikrocessor. This configu-
ration consists of a set of H'LP1s. One HLM is des-
ignated the "primary" module, another is desig-
nated a "hot" spare, other HLMs may be employed as
additional spares which are maintained in a dor-
mant state. The primary module performs the re-
quired computations, generates outputs, and col-
lects inputs for both itself and the "hot" spare
module via the intercommunications bus system.
The "hot" spare performs identical computations
to the primary module but does riot produce outputs
At each Real-Time Interrupt these two modules
perform cross-checks via the bus to see if the
other is functioning properly. Failure of either
module triggers recovery by the "good" module.
Should either active module detect an internal
fault and disable its outputs, the other machine
continues the ongoing computations. At a later
scheduled time, the surviving machine diagnoses
the faulty module and activates a new "hot" spare
if additional spares are available.
	 Under special
conditions two or more "hot" spares can be as-
signed to back up the primary module and provide a
greater degree of redundant protection.
b) Voted Uniprocessors. For applicafion.s demand-
ing extremely high reliability, thre- +:Ms can be
assigned to perform identical comput:^ions, and
each communicates with peripheral devices with a
separate dedicated bus. Outputs are voted in the
peripheral devices. Backup spare HLMs may option-
ally be employed to provide hybrid redundancy.
This type of configuration provides voting in
addition to the self-checkinq features of the
Wls. It has primarily been applied to systems in
which the cost of failure is clearly unacceptable,
such as flight control of commercial aircraft and
the Space Shuttle.'
c) Distributed Systems. Since the computer mod-
ules in a distributed system are performing a var-
iety of different functions, the requirements for
fault recovery vary in the different modules of
the system. Some functions are deemed more criti-
cal than others and redundancy is employed in a
selective fashion. Some computer modules,
1
^r
J	 1
, 0
IF — - - -1 —1  - I - . Pr-^^T — F-	
411,
embedded in fully redundant subsystems.maty not
require fault retover'y . others performing critical
functions ma y be very heavil y protected.
The HLit which serves 	 the systmn executive.
and in some Cases othe r HL-, must survive a fault
without interruption in computations.	 These tr• iti-
tal modules are protected by st,indbv redundanty
with "hot" spares as described above. There are
often tether HLMs assigned to function. whith ,ire
deemed noncritical and which can tolerate a program
interruption of up to several seconds. These WCs
,10 not have dedicated "hot" spares. !then one of
these nodr.les develops an internal fault, the s y s-
tem executive WH replace; the fault y rxrdule with
a 'blank" spare. loads the spare's menrerrY. and re-
starts the internal programs. Since the HLMs have
nondedicated connections, a comm set of spare
nodules can bask up a number- of HLM% performing
different critical and noncritical computations.
The Terminal Modules are attached by a number
of wires to the spetifii subsystem in which thev
are located so they must have dedicated spares
which are also connected to the same subsystem.
The number of spares is determined by the critical-
ity and tailj' •e rate of its associated suhsystekn.
Since a Terr,inal Module does riot have the ability
to initiate a bus communication, it t:an onl y halt
and si gnal an error. Recognition of a failed Tfi
and commends for reconfiguration are the re,ponsi-
hility of a High-level ftodule.
Each controlling HLM is responsible for pvll-
ing the various modules under its Control to deter-
mine if a rWdule has isolated itself due to d
failure. This polling process can he carried out
nearly automaticalI,. using the bus sys tem every
few (10-100) milliseconds. (Bus system failures
are determined by rerouting suspeci messdges
through a different HIM-bus comb in.tti(in.) The HLM
is then responsible for issuing the reload and re-
confiquration corrrnands necessary to replace ,t
faulty module with a spare or reinitialise a
transient-disturbed module.
These connwnds are sent to the various UPS
nodules through one of several redundant Su,
Adaptors, which provide the tollowinl functions
with re;pett to their associated HI It or TM: 1)
power or unpower the internal computer. 1) load or
read out ttevory. 3) halt or start the processor at
specified locations, d) sample error status from
the RRs, and ^) send reconfiguration corwnands to
the RRs. Since there are several Bus Adaptors in
each UPS module which are ,onnected to independent
t,u, s y stems. there are redundant paths tot , tarry in.l
out r•econt iqurat ion. The Rus Atlai+tors are powered
at all tines.
1V. COST AND E11 -WiVENESS 01 RU1lDING-11LOCA
1ALIT-T01IRANT COMPUTERS
In order to e-stablish the tort and effective-
ness of fault-tolerant SCCM (tint i qurattons. it 1s
first necessar• v to examine the properties of indi-
vidual SCCM modules. The user determines how manv
are u;ed for a given s y stem and the number of
spares which are to he e^ployed, and thus his cost
and reliabilit y results .Ire quite System-specific.
Append i,	 1,a rhdratterisation of the test,
wei ght, power. vo+tane. and reliability of a sin,tle
SCCt1. Also included is d h ypothetic&) baseline
ttwiputet- module which does not have any of the
error detection and mer'+t ► ry-fault torrettfon tApa-
bilrties of the SCCM, this baseline computer It,
presented so that the relative • rntrealm in cost
tor the SUM can be deteminedl. The cost Increase
Is what one pays tot- intr •edsed reliability And
fault tolerdnte.
The estimates in Appendix 1 are rmAdr for an
SC(M containing 1) two 16-bit wlcroproce+tors. .'r
3:.000 word% of RAiL. 3) three bus interface
buildin g blotks, 3) one Core hurldin q
 block. and
5) one flenory Inter• tate building blot ► . Thft ton-
ficluration torre,ikards to a ►tigh-level tot/ule
previously described. (A Teraina) fiodul• would
tvpitall y contain uric less Rus Interface and 11m,
additional 110 building blot As and it if ♦ ► radar
compl et, itv.I The technolo gy is ass.vwt to he ht•
rel CitOS/SOS c • irtuits with up to 10.0M gdtrs on a
single chip. The results are sune,arr:etl below le
lat, le ?.
Cost Power Wright vo l
 ow
Raseline
Computer i "idle... S 14.6K	 6M	 3 1M10 Ife1
Self -Chet It Intl
Computer ►lodule... S13.6K	 NV '.? 1M ?a 1"
Table ?. Corputrr flodule P%-%t, Al Properties
Ahsolute estimates Of %CCM propertlrx Are. at
best, an ApUrotlrnatlpfl time- the vt%l drtrltes hairy
not been develo ped. Out basis of etti%&t1o11 Ile•
in pro3ettion of past And current etpprittim p . The
r•eldttVe estimates t+rtwet-n the propertlel of the
Baseline end SCCM module% should be wre actweate.
sinte the y are less Jel,brrr.tent uWm tppt ifit attelr"
bons. In a relative srrr.r. the %CC" it M to
50 more expensive. mtwr ikrwer tt)nttlr.11lg. IAO pram#
voluminous than a nonr •r,tunttan • hAt+ellnp etWule.
This represents the "frvnt -r- 	 r eeAllth 1t
paid to ot-Jer to re,f: o rte,	 r
tOsts.
FMR7W^ PAGL M
Q 10(3A QUAL11 Y
	
q r	 •
Will!
pllual fy / 4 Piml •
Co	
/
•I c 	 ll I%A U
twooowt n. n ^
sw
%1* 41Dt'lxvt' "herd
tilt COW ' 441 e► •ANI
w^
trrw11,l► t!
	iw 	
/ f
	
I	 _
Kt!!rl* 11N! -4
	
•	 !	 bit	 INM^^rx
0 W,A low
F t gure Y. 4el labil st y lai'trsr-mot thiwq i^Clla
'	 Ir
a
- T— - — — — -	 -	 — -- ---sa— -	
w
I"'T 'rte
4	 j
Figure H shows the comparative reliability of
the nonredundant baseline computer, a single SCCM,
and a duplex pair of an SCCi1 backed up by a spare.
It a large number of nonredundant computers
are deployed in the field, the user can expect a
large percentage of them to fail in a year's time.
He must then support maintenance personnel and
their associated test equipment to correct these
failures. At an extra SO initial investment,
SCCM% can be employed which, without backup spares,
reduce this failure rate by a factor between 2 and
3. At 3 times the price ;an extra SIO-$20K per
computer) he can employ backup spares and reduce
the failure rote 10 to 20 times. Additional spares
can be employed to reduce in-field failures to
extremely rare occurrences.
P69wv f'r+nap
Through the use of LS1. the cost of hardware
has dropped to the point where i fault-tolerant
computer costs much less than a non fault-tolerant
computer did only a few years ago. Using LSI cir-
cuits. the Intrinsic reliability of computer sys-
tems has Improved greatly, but not enough to pro-
vide fault-free operation. This is achieved by
fault-tolerant, i.e. self-repairing. architectures
which offer fault-free operation of a year or more
with current technology. The current cost of
fault tolerance is on the order of an extra
110.000-?0.000 per computer. This is largely due
to the cost of the additional high reliability
parts ohlch are many times more expensive than
commercial quality devices. Current costs may be
Sir 
IfiCantly reduced in two ways: 1) the use of
to-testing logic should make therm much more
ossil y
 tested. thus reducing a large production
cost, ,ind 2) using fault tolerance less component
screening Is required since an Occasional failure
IS automatica lly corrected.
VLSI circuit development is not static and by
the aid-1960% there should be major improvements
in this fault-tolerant computer technology. For
eaamlple, we can expect an SCCi1 to be packaged on
one or two chip% at a cost of less than Si000.
C e peMflts can be expected to be several times
sere reliable. producing an equivalent increase in
the reliable life of the fault-tolerant configura-
tions. One can project conservatively that fault-
tolerant machines can be .runt in a few years
stick provide S-10 years of maintenance-free oper-
stiea at lest ton S11000 in increased costs.
V. CONCLUSIONS
This Study h4%  resulted in the definition and
0re11SIRM resign of a Moll set of VLSI buildinq-
block circuits. which. if implemented. would allow
the user to coa%truct fault-tolerant computing
tyttamm% ovt of existing processors in a straight-
formarN fashion. VLSI circuitr y Is the key to
toils lmmpleiteatatlon. The redundant circuitry re-
quI M for fault tolerance. which was once expen-
••lee, tan row be obtained in small inexpensive
pat
ihe • rLSI buildtnq-block circuits are used to
coastew t mol l (SECfI) computer% which art- capable
of dotecting their own Internal faults (and cor-
"K111414 sore faults which are most likely to
w cur). These Sole-Checkinq Computer Modules can
CAM be comp laefl with backup %pore% to provide
automatic recovery from faults. A standard busing
system allows these SCCns to be connected into a
variety of fault-tolerant distributed computing
networks.
The ultimate goal of this work is maintenance-
free systems. By providing an ultrareliable core
computing system, it can act as an automated re-
pairman, to provide automated testing and redun-
dancy management in complex systems.
The next phase of this effort will be to do a
detailed logic design of the buildinq-block cir-
cuits and implement a breadboard consisting of
several self-checking buildinq-block computers
inte rconnected by the redundant bus system. This
step will be completed in 1979 and will provide the
detailed experience necessary for a subsequent VLSI
implementation of the building-block circuits.
When this is accomplished, we hope that the
system engineer will have in hand the tools for the
immediate and routine use of fault-tolerant comput-
ing. This building-block development has a near
term goal of developing the enablinq technology
(circuits) for the off-the-shelf use of fault-
tolerant computers.	 it is a first step in a long
term goal of usiny the increasing density and re-
liability of new VLSI circuit technologies to pro-
vide fault-free computing at lower and lower costs.
APPENDIX I.
CHARACTER IZATION OF BASELINE iNONREDUNDANT)
AN D SCC^i C^IDUTER HODULES
There are two approaches to evaluating the
fault-tolerant building-block systems which are
of value to the potential user. First, there is
the absolute approach: "How much does it cost and
what are its reliability characteristics?" And
secondly, there is the relative measure: "How much
more does it cost than an equivalent nonredundant
structure, and what is the relative improvement in
reliability?" Both approaches will be examined.
The cost and reliability of each computer sys-
tem is hiqhly dependent on the number of integrated
circuits which are required for implementing the
building blocks. Two levels of integration are
considered in estimating cost and reliability.
These correspond to the current and projected
state-of-the-art in device and technology as
listed below:
1. SSI i1Si - Current breadboard technology
2. 10.000 gates/chip - Next generation custom VLSI
it should be recognized that absolute cost and re-
liability estimation results, at best. in a rough
approximation. This is because the VLSI devices
have not been developed, and thus our basis of
estimation lies in a projection of past and cur-
rent experience.
The following is an examination of the char-
acteristics of a Building-Block Self-Checkinq
Computer nodule. These computers make up the
basic modules from which fault-tolerant networks
are constructed. in order to evaluate the addi-
tional costs of self-Checkinq buildinq-block cir-
cuitr y . estimates are also included for a
(l
r,-	 - -
---I -Wat 	 10"	 low w^.'T'^S^'^8 Y.1!^: R'w	 •`a^.	 I	 IT	 •	 7	 .^
R,
I•^
nonredundant computer module of similar capability
and constructed out of similar technology.
The parameters to be studied are power,
weight, volume, cost, and reliability. Since the
computer circuits have yet to be constructed, est-
imates of these parameters must be regarded as only
approximations projected from current experience.
Underlying assumptions are (liven in the following
discussion in order to aid the reader in evaluating
the validity of the results.
.c'.	 ..J:Jtiop"4 Ci,	 nt8
Table A-1 examines the logical complexity of
the various component elements which make up 1) a
building-block self-checking computer, or 2) an
equivalent nonredundant machine. Complexity is
determined in terms of equivalent logic: gates
e.g. 1RAM cell = 1 gate, iROM cell = 1/4 gate,
1 bit of parallel-serial shift register - 6 pates)
for each element which makes up the computers. It
is assumed that each computer has a memory of
32,000 16-hit words.	 (This is admittedly small
for VLSI implementations, but it is expected that
memory development will not proceed as rapidly in
low power, hardenable technologies required by the
military.) An estimated component count is tabu-
lated for each of two implementations. The first
assumes that all circuitry surrounding the CPU and
Memory will be constructed out of existing SSi and
MSI circuits. The second implementation assumes a
VLSI technology with on the order of 10,000 gates
per chip.
Table A-2a is a tabulation of the component
requirements for self-checkinq hiqh-level and
terminal (computer) modules along with equivalent
nonredundant computers.
Two patterns become obvious in these results.
First, as the level of integration is increased in
the building-block circuits, memory becomes the
dominant cost in a number of devices. Second, as
IMPLEMENTATION
ELEMENT	 QUANTITY	 COMPLIXITY
	
SSI -MSI	 VLSI (10 4 g/ chip)
(equivalent gates)
	
128 pin pkg.
2-4	 2500-3000 qates	 200 Chips	 I VLSI
12,000 Bits ROM	 3 ROMs (512x8)
.	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .
2	 8000 gates
	 2 CPU	 2 CPU
.	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .
1	 1000 Yates	 150 Chips
	 1 VLSI
.	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .
n	 1700 gates
	 150 Chips	 i VLSi
8Kxi6	 1 gate/bit	 184 RAMS (ca4Kxl)	 23 RAMS (0 32Kx1)
(in)
.	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .
k	 100-600 gates each
	 10-40 Chips	 1 VLSi for 3 or 4
(10 types)	 (each type)	 I/O functions
--------- --- -_
	 - concurrently-
1. BUILDING-BLOCK, SELF-
CHECKING COMPUTER
a. BUS INTERFACE
b. CPU
c. CORE BUILDING
BLOCK
d. MWORY INTERFACE
BUILDING BLOCK
(for 32K module)
e. MI HORY
f. i/0 BUILDING
BLOCKS
I
t	 I 1.	 NONREDUNDANT COMPUTER
- of similar technology and capability
t{ a.	 BUS	 INTERFACE 24	 1700-2000 qates 130 Chips 1	 VLSI
In,000 Bits ROM
.	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .
3 ROMs	 (512x8)
.	 .	 .	 .	 .	 .	 .	 .	 .	 . .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .
r b.	 CPU 1	 8000 gates i	 CPLI 1	 CPU
.	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . .	 .	 .	 .	 .	 .	 .	 .	 . .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .
r
c.	 MEMORY	 (32Kxl6) 1	 (late/bit 136 RAMS	 (@4Kxl) 17	 RAMS	 (O	 32Kxl)
^I d.	 BUS ARBITER 1	 100 gates 16 Chips Included	 in VLSI
.	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . .	 .	 .	 .	 .	 .	 .	 .	 .	 .
Bus	 interface
.	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .
e.	 1/0 STANDARD k	 75-400 gates each 8-32 Chips each 1	 VLSI	 for	 several
'.
CIRCUITS (10
	
types)
.	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . .	 .	 .	 .	 .	 .	 .	 .	 .	 .
functions	 (1	 type)
.	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .
Extra discrete	 logic folded	 into custom LSi
	 circuits.	 Common elements are clock, power supply,
and bus drivers.
Table A-l. Complexity of Component Flements
^RIGtNAL' r GE IS
11	 Oar WOR QUALITM
- ' ^ - — -	 _	 --	 __, --^• ^,^•• ^ ,d^ir ^ i ^ _.._ r	 a a-y ••tea _^ ay s..^ .....1 ^^.. 1 ^—^^-• ^ ^, .
0
1	 ,
P^
j^
i
4
1
the level of integration is increased, the relative
cost of a self-checking computer (over a nonredun-
dant computer) decreases dramatically.
If memory fault detection (but not correction)
is employed in the self-checking computer, the cost
of self-checking is negligible over the nonredun-
dant machine. If error correction is employed in
the memory of the self-checking computer (an addi-
tional feature to improve reliability), the incre-
ment in cost between machines is primarily related
to the cost of SEC/DED codes.
Several authors have demonstrated that self-
checking circuitry is relatively inexpensive if
used. One reason for this is because system par-
titioning, i.e., partitioning of a computer into
VLSI chips, does not occur on the basis of qate
count alone. It is based on separation of comput-
ing functions (1/0 Processor, Memory Interconenuni-
cations) which results in breaking up the ^.omputer
at points where the number of interconnections and
the control interfaces are reasonable. When par-
A ''TYPICAL" BUILDING-BLOCK,
COMPl1TER MODULE
COMPONENT - NUMBER - AREA - COST
	
(in 2 )	 SK (ICs only)
I.	 SSIPISI VERSION:
	 (14 lbs)
SSI/MSI	 900	 300	 31.5
ROM
	 9	 7	 1.4
CPU	 2	 3	 .4	 70.1
RAM
	 184	 138	 36.8 + 20.1*
LSI	
_ 0	 0--	 90.1K
	
1095	 448	 70.1
2.	 VLSI VERSION:	 (1.4 lbs)
SSI/MSI	 0	 0	 0
ROM
	 0	 0	 0	 it 6
CPU	 2	 3	 0.4	
2 0*RAM	 23	 17	 9.2
	 13.6
VLSI
	 5	 25	 2
	
30	 45	 11.6
* Board & Fabrication Costs
Table A2-a. Charac
1. VLSI 123 pin pky. - 5 in 
2. LSI	 64 pin pkg. - 1.5 in 
3. ROM, RAM some MSI
20-24 pin pkg. - .75 in
4. SSI 14-18 pin pkg. - .25 in
(avg. area for SSI/MSI = 1/3 in 2)
Table A-2b. Packaging Density
titioning is completed, the VLSI chips are seldom
used to full capacity, and enough area remains to
implement checking circuits at negligible cost on
each chip.
t'uaK.: _, , ..na
The methodolo g y for estimating packac^iny
density is to assume the use of flat packs on
multilayer boards. The area for each circuit type
is shown in Table A-2b, and the resulting board
area is computed for each of the computers which
are being examined. The circuit boards require
about 112 inch of depth, and thus volume is esti-
mated at 0.5 times the circuit area requirements.
Cost is determined on the basis of three
items--parts costs, board costs, and assembly
costs--as shown in Table A-2c. Assembly costs
are related to board area, which is in turn re-
lated to the number of pins to be soldered. From
our experience, a technician can assemble and
inspect on the average of one square inch/hour.
NONREDUNDANT COMPUTER MODULE 1.41
SIMILAR CAPABILITY AND TECHNOLOGY
COMPONENT - NUMBER - AREA - COST
	
(in 2 )	 SK
I.	 SSI/MSI VERSION:	 (8 lbs)
SSI/MSI	 406
	 135	 14.2
ROM 
	 7	 1.4
CPU
	
1	 2	 .2	 43.0
RAM
	 136	 102
	 27.2	 11.1*
LSI	 0	 0	 0	 54.1K
552	 °47	 43.0
2.	 VLSI VERSION:	 (1.1 lbs)
SSI/MSI	 0	 0	 0
ROM	 0	 0	 0
CPU	 1	 2	 0.2	 7.0
RAM	 17	 13	 5.2	 1.6*
VLSI	 4	 20	 1.6	 8.6
22	 35	 7.0
VLSI 5400/chip
ISI 5200/chip
RAM 5200/chip (4Kxl, or 8Kxl)
RAM S400/chip (32Kxl)
ROI, 5150/chip (512x8)
SSI/MSI 535/chip, avg.
Multilaver board S25/in2
Assembly and Inspection S20/in`
Table A-2c. Cost Model
teristics of Computer Modules
i
i
i
r
	 , n
r
1
Note:	 These estimations do not include the power supply, which is estimated at 4 lbs, 102.5 in3,
$10,000, and 75 efficiency.
12
s
-AIL
