SUMC fault tolerant computer system by unknown
  
 
 
N O T I C E 
 
THIS DOCUMENT HAS BEEN REPRODUCED FROM 
MICROFICHE. ALTHOUGH IT IS RECOGNIZED THAT 
CERTAIN PORTIONS ARE ILLEGIBLE, IT IS BEING RELEASED 
IN THE INTEREST OF MAKING AVAILABLE AS MUCH 
INFORMATION AS POSSIBLE 
https://ntrs.nasa.gov/search.jsp?R=19800025603 2020-03-21T15:37:52+00:00Z
1
1
FOR
CONTRACT NASS-31747
(X&S4-CH-161581) SURC FAULT TOLERAVT
COMPUTER SYSTER (RASA) 74 p HC A04/HF 1101
CSCL 09B
NSO-34111
Unclas
G3;/60 29013
Saptwnber 10, 1980
f4oV1980
RECEIVED
Q
SUMC Fault Tolerant Couputer System
Final Report
for
Contract NASS-31747
September 10, 1980
Prepared For:
The Marshall Space Flight Center
Huntsville, Alabama
w
e
sS^tion _ PaRa
100 11ATRODUCTION 7.-1
290 TRADE STUDIES 2-1
2.1 Configurations 2-1
2.2 Redundancy Management Unit (RMU) Concept 2-11
2.3 Interfaces 2-15
2.4 FTM Control Strategy 2-17
2.5 Storage Address Expansion 2-21
3.0 IM11MMATIOJ 3-1
3.1 Fault Tolerant Memory 3-1
3.1.1 Storage Array 3-1
3.1.2 Translator 3-1
3.1.2.1 Parity Trees 3-4
3.1.2.1,1 Check Bit Matrix 3-4
3.1.2.1.2 Error Detection and Location 3-5
3.1.2.1.3 Self-Testing 3-6
3.1.2.2 Corrector 3-6
3.1.2.3 Error Analysis 3-7
3.1.2.4 Command and Status Registers 3-13
5.1.2.5 Spare Assignment Register 3-13
3.1.3 Error Correction Algorithms 3-16
3-163..1.3.1 Error Location
3.1.3.2 Address Tally 3-21
3.1.3.3 Fault Tally 3-21
3.1.3.4 Reconfigure 3-25
3.1.3.5 Test/Copy/Correct 3-25
3.2 FTM System Management 3-25
3.3 Implementation of Address Extension 3-30
r
i
R
fre
2-1 Simplifierd SUMC-IIC
T-2 Trade-Off Configurations
2ft3 Dual Port Configurations 1 and 2
24 Dual Port Configuration 4
2-5 Dual Port Configurations 5 and 6
2-6 SUMC-11C Configuration
2-7 Cross-strapping with no Single Failure Modes
2-9- Simple Cross-strapping
2-9 Extended Memory Segmenting
2-10 Flow Diagram of the S/360 Effective Address
Calculation
2-11 SIWC-IIC Block Diagram
2-12 Sectored Memory Implementation
2-13 Double Precision EA Microprogram Hardware
2-14 Data Path Extension
3-1 Storage Array Organization
3-2 Translator Functional Block Diagram
3-3 Parity Check Matrix for 16 Data Bits
3-4 Functional Parity Tree Representation
3-5 Correction Decoder
3-6 Error Analysis
3--7 Command 'NO-OP' Command Structure
' 3-8 Translator Error/Status Word
3-9 Diagnose Repair Flow
3-10 RMU Status Codes
F1 3-11 Form Fault Word
3-12 Address Fault Tally
t
ii
per+ e
2-2
2-4
2-6
2-7
2-8
2-10
2-16
2-16
2-23
2-25
2-27
2-29
2-30
2-31
3-2
3-3
3-5
3-7
3-8
3-10
3-14
3-15
3-17
3-18
3-20
3-22
AND
3-13 Fault Tally Routine
3-14 Reconfigure Routine
3-15 Test/Copy/Correct Routine
3-16 Multiple Error Correction Algoritba
3-17 Extended Addressing - Data Path Extension
Table
2-1 Configuration Trade Data
2-2 Si—ry of Address Expansion Trades
3-1 FTM Diagnostic Execution
3-2 Control Equations
Lit,
^tI
Pass
3-25
3-26
3-27
3-28
3-32
2-9
2-32
3-30
3-35
fi
I':
iii
This report presents the results of the trade studies conducted on
•-t
contract number NAS 8-31747. These trades cover: establishing the
'.	 basic configuration, establishing the CPU jmamory configuration,
establishing an approach to crosstrapping interfaces, defining the
requirements of the redundancy management unit (RMU), establishing a
spare lane switching stre y for the fault-tolerant memo	 FTMP	 P	 ^	 tal•_ 	 ry ( ),
and identifying the most cost effective way of extending the memory
addressing capability beyond the 64 K-bytes (K-1024) of SUMC-II B.
i	 The results of the design are compiled in Contract End Item (CEI)
_$	 Specification for the NASA Standard Spacecraft Computer II (NSSC-II),
IBM 7934507. The report also presents, in Section 3, the implemen-
tation of the FTM and memory address expansion. The scope of the
a
original contract was reduced so that the IOUs and RMU were not
designed.
r.
c
t
t
WIMM
R-11
The first item requiring resolution in the SUMO-II C is the basic
configuration.
	 The objectives and constraints of the program are
listed balm:
0	 Provide f1rJtility to be able to meet varying levels of
redundancy.
Pt 0	 Make masimum use of hardware developed for theHTC/SUMC-11 B.
Use0	 error correcting memcry techniques since they pro-
vide the most reliability for the least: amount of
hardware duplication.
0	 Use spare memory planes and a bit-plane organized
memory.
A 0	 Use a fault-tolerant redundancy management unit (RMU)
to control the redundant elements of the system.
Providing redundancy within the CPU or IOU would require develop-
sent of completely new untie and would not represent a cost
effective way of providing reliable systems. 	 Configurations were
limited to CPUs and IOUs as "stand alone" units.
	 The CPU to
memory interface is a significant part of the study.
	 The Top
Level system configuration is shown in Figure 	 2-1.	 Subsequent
2-1
1
i
I
I	 CPUs
I	 AND
i	 MEMORIES
1
1
1	 ?
i (CONFIGYRATIO?
1	
KNOWN)1
I
1
L______
FIGURE 2-1. SIMPLIFIED SUMC-IIC
2-2
sections of this report will develop this general concept into a
working approach to the SUMO-I1 C.
Over a period of time a number of candidate subsystems were defined
for the CPU/memory portion of the 8UMCr11 C. These configurations
are shown in Figure 2-2 and are explained below:
o Configuration 1 consists of two CPUs sharing a common
fault tolerant memory built from existing Main Memory
Unit (MMU) hardware. No attempt is made in this con-
figuration to eliminate single point failures in the
memory. A new design is required for the power supply
and the dual port interfaces.
o Configuration 2 consists of two CPUs sharing a common
fault tolerant memory. This memory is an extensively
redesigned MMU and contains no single point failures.
A new redundant power supply would be required for this
design. A new design would be required for the dual
port interfaces.
o	 Configuration 3 consists of two CPUs, each having a fault
tolerant MMU memory.
	 The computers will be used in a
redundant manner and no new design is required.
o	 Configuration 4 consists of two CPUs, each having a
E dedicated memory except that the dual port interfaces
are used to provide cross-strapping of ...-mory write when
4 both systems are powered on.	 Each computer supplies
power to its dedicated memory and reads only from its
memory.	 The cross-strapping is provided for rapid
. 2-3
C6 In
m
Csu IK
CL. N
1	 ^
1	 i
1	 ^
^S
v sc
CL v) I
EO- 1
I
I	 Cd i
ate- 1 09
nn
L
W
V
16
IL
O
1
N
N
Lit
Z
Y 7
	
V/
d N
W
Z LLLiii...
d	
f t
Q
dst1 sc
a vs	 1^	 a vs	 a In
N
	
r)
2-4
information transfer between memories during CPU switch-
over.	 This approach requires a am design for the dual
port interfaces but no new power supply design.
o	 Configuration S consists of two CPUs cross-strapped to
two fault tolerant MW memories.	 Both r4 wries share a
- redundant power supply and remain in a powered on state.
Either CPU can read from either memory but always writes
both.	 The dual port design is used for cross-strapping
and the memory raid selection is delegated to the
Redundancy Management Unit (RMU).	 A redundant power
supply and interface designs are required for this
comf iguration.
o	 Configuration 6 is the same as configuration S except
that each memory hoe its own power supply which can be
powered up or down by the MU to save power. A new
power supply and new interface designs are required.
E,	
All configurations use common designed interface boards which can
be populated/depopulated and/or jumpered to fit each requirement.
It	 Figures 2-3 throupb 2-5 show the interfaces except for the clock
switching circuitry. Two CPU interfaces (CPA, CPB) and three
I
memory interfaces (MPA, MPB, MPQ are shown.
The trade data derived fox the six configurations is summarized in
Table 2-1. Based on that data, configuration throe ws selected.
The slight increase in reliability of six over that of three does
not justify the major difference in development cost. The result-
ant configuration is shown in Figure 2-6.
c1
	
.2-5
0
f
E
I
t
t
t
2-6
WE-I
-`"q
N
V
C
A
VI
4JAL
C
41
pL
4
A
a
N
W.
WW an
Ll
I
-
+-+
^3 ^ O t^N N tDN N h at	 O ;0N N
W
1	 J N Z ^ ^ WN
W ~
	
W•.t	 w
^ N
	
tJ d O O ^ "7
1
•	 ^	 N
%0 t0 N	 1
"^	 N N N
W
La
mp
N O Q
in
~
•
¢C Wr
Ov
W
t J N O
	
4
4 t/!
use.	 c^a c o
1
v ^^
I
1
N
W
S
1
I
2-7
R
C
b
tt7
Ir
a
O'
•r
C
O
t^
L
O
N
tYL
tl
N N
W
V -
~
1r+OM1 G W M04 Ch
^^ Q d Q O 3cf N V' d O O CD '^	 N
N 7
t0
.O
C
b
N	 _
C
O
L
O
I
m
dU
1
N 01O 3
a g ^
U C7 O`
V-4 I tt O 40 10 N
COU
LOa
r
b
D
LO
1N
L
O
W
1
W
m `^ Z
V }
Ln
O^-i G1 tY W w
OO d Z O In
CD 0 0 mU	 tt
F
U Q
2-8
W
AIC
zoo
in4A z
F- cc Q d !-
O
V Q Cl 00
I
a
CD N d
^^ N N N "N^
i
.-^ et O ^p tG N N
'S N N N ^m
t^
.[ t
't
WU
^ W S
_.
C
ly O Q
^ W w
O F- F- N
^ Q Q A m
2-9
g.
^p
N
per,, a
+Q M Ci O
^Q W ^ V .O
B	
qq^^
Y .^ a
N
M W i^ pr
N N .0 94	 IN ^ 00 N Op Q! t^G 1^6	 W,Ti ^ iaG .Z N Q
^ ^ M
r-4 q ax d
+'°c d a bw
w^ a
- a
	 a a
^^y a'
1 to
z0NNNV1r -^1O
yy p^
OG W tOAA04 C4 R: P4 .'SI T.
7
51
 1 D+	 rl M
N v
'dp
OV4 a 00O%%0 L O A w 44 .0 M00 01 M C .-4 rl
ri 0% Sri b b r." vi $p rl e0 O Yyyf^ p
NN%TrM O zT+ zzNN N i^ G7 IOd t^/1r1	 J^.1 ^A
^ M
00
n
rI > 00
N
,r^ to oR7 A
V4 0 4.G
\ n a M O 44 44 ,G " 01
^C O^ YI a V Q 4j 00 9-1
00Nr10 zz W vi a w R, A cn^zzOO A
u
F 7 4)
a oN N s a tea, W° 60m
^
,Al,	
c
'
^
'
-4 OHO% 	 a b y 'e^o '00 W 3 a to m
-4 °NNmP-46zp+ zz .:2 xa xU z En AA
rl 01 1-i
*^
P, 000N n 4w N 0 a^ ^D a to ,a a r, ,4
m ^. N 0! O O CL 01
:D 01^ a OxEnO
t0	
0140NNcn1-40N^+ '^. z N00
aMi b
4c to
0ai oo 004 c+ Wr► M ^;
Vo a	 P.,z
wwUP`'ma	 rrRaia4
z a	 M O O Ai aq
pe►
0
s=i 4) w° 	 O CA. 14 w U d
r^	 .a a O +1 a+	 Oa 0o R
44	 0 rd C3 4 to
U8k~+ W O V I r-4 w ;
O d	 O a d	 O031 ?^ P4zzzIPa 07P4V a 0
rl
a
a
y
1+
3O
a
a,
a
v
°a
a
a,
z
H
z
4c
1
1
t
t
s^
CO
Z
M
e00
,a
It
N
a
at'
t
t
Vf
W
N O
O
a»
^r
W
- d ^
2-10
E
Ji
7-)o
a-A I I
Z
O
F-
c.^
U-
Z
OU
U
vE
IZ1N
W
G'
C3
LL
2.2 REDUNDANCY MANAGEMENT UNIT (RMU)_ CONCEPT
Optimisation of the RMU must be done at the system level and is,
therefore, application dependent. Several major items which are
application dependent are: the time allowed for system recovery,
the need for a degraded mode of operation, the availability of
program load capability, the degree of operator participation vs.
automatic operation, the need for complex interface safing, the
need for selecting from among multiple prime power sources, and the
need to store recovery parameters for transferring control from one
computer to another.
Since mission success is dependent upon the operation of the RMU,
it must be essentially failure proof. If the RMU performs many
Functions, then its simplex implementation requires significant
amounts of hardware and it is both difficult and costly to achieve
the required reliability.
Effective management of system redundancy can involve many acti-
vities. A list follows which represents candidate features which
might be included in an RMU:
o	 Comparators for checking the outputs of multiple com-
puters. This could be required for instantaneous
recovery or where the consequence of an undetected error
is catastrophic.
o	 Storage of present status parameters for assistance in
restarting after a detected failure.
.2-11
0
o Safing all critical spacecraft functions which are under
control of the computer.
o Check "status" signals from the computer to assure that
all critical program segments were executed and in the
correct sequence.
o Gather status and send it to operator/monitor personnel
in the spacecraft or on the ground.
Provideo auxiliary storage for error and status logging,
program reload, and diagnostic program storage.
o Provide command facilities for remote.(probably ground)
-' personnel to override the automatic redundancy management
features.
o Provide an independent "Watchdog" timer to detect com-
plete loss of computer operations or excessive time in
completing a phase of operations.
PF o Provide highly redundant (fail-safe) crosstrapping
between/among redundant system elements.
o Control application of power to each redundant element
tk
_ of the system.
o Enable/Disable computer I/0.
Since high level redundancy is required in the RMU to meet the
necessary failure tolerance, even small amounts of functional
2-12
t
t
E
II
Fk ^
f[
f^
i
i
I
I
hardware can have significant impacts . on power, weight, ate. There-
fore, two precepts are proposed for the design of the RMU:
1) Don't include any function which can be performed
outside the RMU.
2) Make provisions for but don ' t include functions which are
not iequired by all applications.
Applying these rules to the candidate list gives the following
results:
o	 Accommodation should be made for the future addition of
compartors but don't include them.
o	 If required for rapid recovery on a specific mission, a
few restart parameters might be included. A large number
should be put in a system storage device and not included
in the RMU. The basic RMU should not include restart
storage.
o	 By putting all I /0, into a predetermined state when
system confidence has been lost, proper system design
can insure the spacecraft to be safe at all-times.
o	 Status checking and proper program sequencing are
important, however, they should be handled by software
checking not the RMU.
o General status collection should be done by the operating
computer which can send it to the ground via telemetry if
desired.
t
t	 2-13
t
0
t
e
E
E
f
I^
i^
r
t
'I
I
i
I
o	 Auxiliary storage should be provided as a system function
not by the RMU.
o	 The ability to use human intervention to override the RMU
is desirable and should be provided both for test purposes
and as a "last ditch" precaution.
o	 A "watchdog timer" is an essential part of any RMU.
o	 Crosstrapping should only be provided in the RMU when
required by the application and there is no other place
for it.
o	 The RMU should control power to each unit, although the
power switches themselves need not be a part c: the RMU.
o	 The RMU should have the ability to enable/disable outputs.
Between the time that power is applied to a computer and
the time the computer has initialized and performed self-
test, the RMU should insure that all I/O is disabled.
This results in a basic RMU which can be used "as-is" or can be
expanded to suit special requirements. The basic RMU features are
summarized below.
BASIC RMU SUMMARY
o	 Watchdog Timer
o	 Ground Commanded Override of RAM
o	 Power Control to all Units
- 2-14
MR
o	 I/O Enable/Disable Capability
o	 Growth for Special Missions
o	 Fail Safe Implementation
To communicate with the RMU, the SINC-II C CPU will use a dedicated
line FLAG RMU which will indicate that the output bus has data for
the RMU. FLAG RMU is just a pulse so the RMU must store the data
on the bus or react quickly to it. In addtion to an "IM OK" code,
several codes are generated to signal operational status of the
microprogram handling the FTM analysis and spare plane switching.
This is explained more fully in Section 3.1.2.
2.3 INTERFACES
Crosstrapping and other redundancy considerations can impact the
design of unit-to-unit interfaces. If mission reliability or
application design groundrules will not tolerate single point
failures, the interfaces used must be carefully designed to pre-
vent a short in one circuit or wire from dragging down an entire
function and precluding operation of that functon. The interface
shown in Figure 2-7 illustrates a design approach to eliminate
all single point failures in a crosstrapped interface.
The circuit of Figure 2-8 is much simpler than that of 2-7
since all drivers and receivers are the ones which would normally
be provided in a simplex unit if one precaution is taken in selec-
tion of the three state driver. The driver must be selected such
.
2-15
SN55 12 1 CMCC IAO
FIGURE 2-7. CROSS-STRAPPING WITH NO SINGLE FAILURE MODES
SIGNAL
	 SYSTEM
	
SIGNAL
.SOURCE	 CABLING
	
RECEIVER
I
	
I
i
UNIT B
	 1	 UNIT B
FIGURE 2-8. SIMPLE CROSS-STRAPPING
2-16
that in its	 .::red OFF state its output goes to the high impedance
(disabled) state.
The impact on systems reliability of the failure tin a shorted mode)
of one of the interfaces in Figure 2-8 	 is negligible.	 However,
the problem of implementing the interface of Figure 2-7
	
is signi-
ficant.	 This study, therefore, recommends the use of the simpler
interface since the more complex one can be added later for any
application which requires it.
2.4	 FTH CONTROL STRATEGY
The fault-tolerant memory subsystem automatically corrects single
bit errors.	 This action is completely transparent to both the
computer user and the microprograms.	 The FTH subsystem does, how-
ever, have control functions to control switching of spare memory
planes and also to perform testing of the memory.
	
The controls are
a combination of hardware and microprogram. 	 The strategy for
switching spare planes can have a major impact on the reliability
of the unit, however, exact techniques for calculating the effect
of switching strategy or reliability are not yet developed.
To understand the significance of the switching strategies, it is
necessary first to understand the types of failures, their impact
on operation, and their method of detection.
o	 A "fault" is a hardware malfunction such that the
equipment is not capable of doing everything it was
designed to do, but it might not be causing any
problem at the present time. Example, if a bit is
2-17
t
unable to go to a 1 state but the correct current value
is a zero, it isn't causing a problem; but it is still
a fault.
o	 An "error" is the process of the computer gettint an
improper result at the present time. Example, if a !sit
read from memory should be a 1 but it is being read as a
O, that is an error.
o	 A "random fault" in the FTM is one which affects a single
bit in a single word in memory.
o	 A "systematic fault" is either one which affects the same
bit in many words or many bits in the some word. 	 The
system has been designed, however, to nearly eliminate
the ability of a single fault in memory altering the
value of more than one bit in a word.
o	 An "address fault" is a fault which causes a valid word
to be stored in and retrieved from the wrong storage
location.
"transiento	 A	 store" error is where a transient condition
in the memory caused a word to be storedd4 with a bad bit
in it.	 However, there is no "turd" fault.
Since the memory can correct single errors, there is not much con-
cern about the existence of words in storage which will give
single (correctable) errors when they are read. The most bignif i-
cant problem with single errors is that they have the potential to
become double errors which cannot be corrected by the translator
1	 2-18
I
t
t
E
t
s
t
hardware. Random faults are the most pra^able failure type in a
semiconductor memory and there is a IoM t-:obsoility that two random
failures will occur in the same word. A savory containing n words
has a probability of about .5 that there is a double failure after
n failures. For n-8K that is about 90 failures and for n=66K it is
about 256 failures. Therefore, random failures are predicted to be
a problem+ only if a wry large =umber of them have occurred.
Three strategies will be considered for switching out bad hardware,
to restore system operation or reastablish a high level of redun-
dancy. Each approach and its relative writs are discussed below.
STRATEGY A
Reconfigure the systrm to eliai=ats the failed bit-plane every time
an error is encountered.
Advantages:	 This is a simple concept which could be easily
implemented in hardware without any modification of the current
CPU/Memory interface.
Disadvanta&es: This strategy wastes spare plane usage on random
errors which are amply taken care of by the basic correction code.
When spare pla-es are needed for systematic faults they will not
be available.
STRATEGY 3
Reconfigure to switch out a bit-plane whenever it has been deter-
mined, during normal operation, to contain a predetermined number
of faults, or it is the one which has the most faults whenever a
word containing a double error is encountered.
1	 2-19
0
Nvantases:	 this strategy focuses on the distinction between
mystematic faults and random faults, and would significantly enhance
the lor$ term mission reliability.
Disadvantaras: interruption of the operational program to log the
location of errors during program execution would have a significant
Impact on computer operation. 	 There is ulac, a risk that areas of
the memory infrequently used because of mission structure could
accumulate multiple errors in words which, when they are needed,
cannot be practically reconstructed with the diagnostic decoding
algorithms, thus impacting the mission.	 Some wry performance
capacity would necessarily have to be allocated tc the error log-
gin& operation, especially if errors occur in frequently used
program such as vehicle control loops.
STRATEGY C
Rely on the ECC capability during normal program execution, then
revert to a test mode whenever a double error is encountered in a
word, and at periodic intervals.
	
The tee t mode would utilise the
diagnostic decoding algorithms and properties of the error code
to locate and log faults, verify that single errors signaled by
the translator are not triple errors, and provide the data for the
reconfiguration decision. 	 Reconfiguration would only be performed
If a double fault or a systematic fault was detected.
Advantages:	 This strategy has minimum impact on the operating
program yet it utilises the powerful diagnostic decoding techniques
to effectively attain the full potentials of the bit-plane switch-
ing capability.	 All memory locations would be tested frequently,
thus mirimising the likelihood of accumulations of many errors in
any word before detection of the errors.
2-20
li
L
E
:F
E
I
i
1
1
1
I
1
Disadvantages: It is difficult to ka w how often to go into a self
test mode, however, this can be programmer controlled rather than
"built-in" to the system either in hardware or microcode.
SELECTION: Strata" A clearly does not provide an effective use of
the tremendous potential of the spare planes. Strategy a makes a
major improvement in the effectiveness of the spare planes, but
Strategy C makes several improvements over that of S. Since both
S and C require a significant amount of microprogram support, tbere
seems to be little difference in the cost of implementation.
Strategy C is selected. The microprogram to support the strategy
is described in Section 3.1.2 and the FTM hardware is described in
3.1.1.
2.5 STORAGE ADDRESS EXPANSION
The basic HTC computer calculates storage addresses wing 16-bit
arithmetic. Since the byte is the basic element addressed in the
computer, this results in 2 1b,65,536 bytes maximum. Applications
which require more storage than this must either manage the move-
want of data and programs in and out of the computer or provide
some means to expand memory beyond 64K bytes (K-1024).
"Memory expansion involves several facets:
o	 Generating and holding addresses beyond 16 bits.
o	 Decoding the most significant bits of the address to
form "page select" signals.
2-21
I
•	 Driving the signals going to the additional memory.
•	 Mousing the additional memory.
•	 Powering the additional memory.
The first item was the primary subject of the study, whereas the
other items were resolved during the implementation phase.
Memory addressing is from three basic sources:	 operant addresses
calculated within the CPU, next instruction addresses taken from
the program counter, and storage addresses taken from a device
over the direct memory address (DMA) interface. 	 In the RTC all
address paths, registers and calculations are 16 bits.
	
If the
memory is to be expanded beyond 64K bytes, the maximum memory
size must be determined. 	 Discussions with MSFCP ersonnel identi-
f ied that most foreseeable requirements could be met with 18-bit
addressing (256K bytes) and that 20-bit (1 M bytes) would certainly
statisfy all requirements.
The next primary issue is to establish the basic approach to the
addressing of memory. 	 Both aerospace and commercial computers
have been successfully applied using a "sectored" memory where the
CPU, I/0, etc. never has access to the full memory at one time.
_ This approach usually involves one or more sector registers to hold
the most significant part of the address while the CPU manipulates
_
only the least significant part of the address. 	 With sectored
memory the most significant directly controllable address bit
usually identifies whether the lowest sector or the "current"
sector is being referenced. 	 Sectoring is illustrated in Figure
2-9
	
assuming the computer has a basic addressing capability of
2-22
0
f
1
i
t
t
tE
t^
`t
i
l
1iI
]C
tG
N
2-23
GC
uj
cz
LLI
O N
V ^
V) w
C7
F--
Z
W
t^7
WN
Y
OE
Wf
D
W
D
Z
W
F-
x
W
Q1
iN
W
C7
W
M M M N NM l^ M M M
LLS	 Y
O
1-N
_Z
Y Y
m	
CV)	 M
N
t1
1
16 bits.	 The 15 least significant bits (Liss) identify which byte,
of the 32K bytes in a sector, is being addressed.
	 If the most
significant address bit (MSB) is a zero, the lowest sector is
implied.	 If the MSB is a one, the value stored in the sector
register is used to identify the desired sector.
	 Thus continued
access is provided to the low sector of memory, and one other 32K-
byte sector can be selected.
The advantage of sectored addressing is that it minimizes the amount
of special addressing hardware required.
	 There are, however, some
disadvantages with this 	 toapproach	 address;
o	 The technique is not S/360 compatible.
o	 Special instructions are required for loading and
storing the sector values.
o	 The programmer must be concerned with memory management
as well as application programming.
o	 If a program or data file resides in two different
sectors, many changes in the sector register might be
required.
I	 o	 The MOVE instruction cannot be used to move across
sector boundaries except sector zero.
If sectored addressing is not to be used, address computations must
1
	 involve the full storage address (18 or 20 bits). Figure 2-10
shows a symbolic flow chart of the S/360 address calculation. Pro-
t	 vision must be made to provide all the identified additions in 18
2-24
INSTRUCTION REGISTER
X* B D
0	 12 15 ]61920	 31
**
Bs	 Ds
32 3536
	
47
* RX FORMAT ONLY
** SS FORMAT ONLY
NOTES: B AND X ARE THE NUMBERS
OF THE BASE AND INDEX
REGISTERS BUT (B) AND
(X) ARE THEIR CONTENTS.
P IS ANY CONVENIENT
WORKING REGISTER.
TO INSTRUCTION
EXECUTION
FIGURE 2-10. FLOW DIAGRAM OF THE S/360 EFFECTIVE ADDRESS CALCULATION
2-25
o	 The CPU's storage address register (SAR) must be expanded.
or 20 bit arithmetic.	 There are three approaches to providing
18/20-bit addressing:
o	 Use the RTC data path as-is and use double precision
calculations (two passes through the ALU) to'get the
addresses.
o	 Expand the data path to 32 bits for greater arithmetic
performance andet the extended addressing free in theg	 8
process.
o	 Extend the data path two or foir bits for address calcu-
lations only.
The double precision microprogramming of the effective address
calculation involves the least hardware but reduces machine per-
formance by stretching execution of all but the register-to-
register (RR) instructions. Going to a 32-bit data path is a
major change in implementation, increases power and weight, and
cannot be justified for the sole purpose of extending memory
}	 addressing.
Regardless of the approach to address expansion, the following
changes to the RTC must be implemented. See the HTC Block
Diagram of Figure 2-11.
o	 The program counter (PC) must be extended either as a
counter (preferred) or with a sector register (poot
design).
2-26
1	 1
I
	
ill	 I
	
1	 1^1hill
	
^ I	 11
v
,o
Cdla00caA
u0
PQ
U
N
H
I
r-I
1N
G!
W
00
MW
1
1
t
t
t
r
t
t
i
o	 The address multiplexer combining CPU and DMA addresses
should be expanded.
o	 The DMA address interface should be expanded, (if the DMA
is to have full access to memory).
o	 A path must be provided from the PC to the data flow
(for store PSW) and to the SAR for I-FETCH.
The implementation requirements of the various approaches to memory
extension are illustrated in Figures 2-12, 2-13 and 2-14. In those
illustrations, it can be seen that sectored memory provides minimum
hardware impact but not significantly less than microprogrammed
addressing. Figure 2-12 assumes the PC to remain at 16 bits and
share the same sector register as the operants of the effective
address calculation. Additional sector registers can be provided
for separate sectoring of the program counter, second operant, and
DMA. The additional hardware, however, does not seem warranted in
an approach with such inherent limitations.
Comparison of Figures 2-13 and 2-14 shows that providing and
controlling 2-4 extra bits of ALU and General register is the signi-
ficant difference between hardware and micropogramming addressing.
Table 2-2 shows a summary comparison between the three approaches.
The 32-bit data path approach was eliminated because of the extreme
impact both on production hardware (an extra slice) and on develop-
ment cost. The extended calculation approaches are shown with four
bits per flat pack. The sectored memory, however, is shown with
extension to only 19-bits. The larger the memory the more problems
encountered (with sectored memory) so 18-bit addressing (256K-bytes)
i	 2-28
1
.
9
C= LLJ
NH O r+ N M
W C9N W
C
0
co f gr-1
1--
UO .--^ N M et Ln p^ 1
O
^a
`c OLLI
f-
JVOv 0-4 N
9
Q
J
a-^
^p
Z
O
Wd'
O
i
W^N
N
1N
W
a
c^
Li
Zo^
x
W( W I W I Q	 Q I Q (Q I Q I	 Q
3
g-AV 2-29
ley
Lai
xax
a0
ocu
r
a
W
ZO
N
VWCa
W
Jm
S
1N
W
QC
a.^
Li.
2-30
evCID
LC 12	
Z,
z
uj
06 CY
N
W
ti
uj
2-31
w	 ^
N	
pp
10,^	 Iq	 .0
ti
I^
M TI
w u u^ ^
^O O M
y ^
2-32 -
a	 s	 «+
a
aa^a
would probably not be exceaded. Also 19-bits represents an
efficient use of hardware. The hardware estimates in the tattle
are those which were believed at the tams of the trade study.
The problaw associated with the manalement of a sectored memory
plus the loss of 8/360 compatibility makes sectored memory very
unattractive. The expanded ALU/DATA Path approach was selected over
the micropro=ramsed approach to Set a 102 increase in performance
at the cost of a fern flat packs. Section 3.2 of this report shows
the implementation of Lha address expansion.
2-33
1
1	 3.0 IMPLEKI NTATION
1
	
3.1 FAULT-TOLERANT MERRY
The Fault-Tolerant Memory (FTM) is a storage system which tolerates
single and multiple errors within words read from memory. The FTM
system is comprised of three major segments: Storage Array,
Translator, and Error Correction Algorithms.
3.1.1	 Storage Array
The storage array is composed of basic memory modules (BMMs) which
are hermetically sealed and each contains 8192 bits of N-channel
FET random access storage organised as an SK x 1 bit array with on
the chip address decoding. 	 Each memory module is part of a bit
plane.	 There are 16 bit planes corresponding to the 16 bits to be
stored from the CPU, 6 check bit planes and 4 spare bit planes. 	 A
bit plane provides one bit to the word read from storage. 	 The bit
plane organisation ensures that any failure in a BMM will affect at
bit	 from	 This featuremost one	 in any word read	 storage.	 signifi-
cantly enhances the effectiveness of the error correction code.
Any failure in a module can mutate only one bit in any given word
stored.	 Figure 3-1 contrasts the storage array organization for
64K simplex and Fault-tolerant machines.
3.1.2	 Translator
Refer-ing to Figure 3-2, the Translator is functionally partitioned
into six mu'ar data flow areas.	 a storage data re,4x2 ter (SM) which
3-1
I
t
t
t
t
t
c
f^
ti ;.
I
1
i
^^ a
N
^O N
UNaN
e
^
^R N
^ N
N 0.zW
49	 N .7
N	 N /'^1^
^O N
N
WU
W
f.
,^
N
H
<<
Q ^'N
OC ^O N
.r
^	 a
Wa^ ^N N^
U N^ W N +r
10
r0 N,
W
U
rr
a
x
t e^
C NS
^D N
^f4
W
y
W
d
3-2
^ ^ a
oq H :r
x
46
M
+O
r+ N
N
e
F N
6
N %0
^+ N
/D E
_ ^
^r r
4c
d
..^ N
r
W
W
W
a
N
^ U
v. a E
I
N
pr
OL0
W
L
H
M
d
OC
w
96
w yw
A
z ^a
w
AN
dH
A
s
PO
O
H
U
a
H
H
N
1
er1
N
00
ri
W
s.
0 L
lr	 la
U U
1
1 mm
-04
1 + r^i
N1
O1
k
to
$4 I carl	 ri amA
G' A	 0!	 >+ M-1u tC
*4	 Uw
= Gl b Cl M
L)i > ^"
' y PO
^r
a
C
H
ccj
a^
t0 >,+
^ d_
bo
y 0
N
d
ti
i Y MIF'i
3-3
includes an input multiplexer, parity tress, corrector, error
analysis, command and status registers ', and spare assignment
register.	 The SDR is the major working register for the translator.
All data inputs to be stored from the CPU are read into the SDR and
all data read from the storage array is read into the SDR.
3.1.2.1	 Parity Trees
Parity trees in the translator are used for four purposes: 	 gene-
rating check bits on store operations, checking byte parity bits on
store operations, generating syndromes on read operations, and
self-testing of translator circuits.
3.1.2.1.1	 Check Bit Matrix
Referring to Figure 3-3, there are 16 data bit positions labeled 1
through 16.	 There are six check bit columns labeled C1-C6. 	 Each
check bit is generated by parity trees to give odd parity over the
field consisting of itself and eight associated data bits in the
same row of the matrix.	 Thus, Cl could be generated as zero or one
if data bits 1 through 8 had odd or even parity, respectively.
Similarly, C2 would be generated to give odd parity over the field
consisting of itself and data bits 6 through 13. 	 It should be
noted that each column of the parity check matrix consists of an
odd number of l's.	 The data bit columns have three l's and the
check bit columns have a single 1. 	 This constitutes a Hamming
code of odd-weight.
i
-1
-1	 3-4
I
9
1
e
t
e
t
t
c
E
Bit Positioa.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Cl C2 C3 C4 C5 C6
S1 11111111 1
$2 1111 1 1 1 1 1
S3 1 1 11 1 1 1 1	 1
S4 1 1 1 1 1 1	 1 1	 1
S5 1 1 1 1 1 1 1 1 1
S6 11 1 1 1	 1 1 1	 1
Figure 3-3 Parity Check Matri-c for 16 Data Bits
On store operations, the 16-bit word from the CPU is first stored
in the SDL The 16-bit word is then flushed through the parity
trees to generate check bits which are also stored in the SDR.
Subsequently, a 22-bit word is transmitted to the storage array (16
data bits and 6 check bits).
3.1.2.1.2 Error Detection and Location
r
t
t
t
t
t
z
On read operations, each of six fields, consisting of eight data
bits and an associated check bit, is checked for odd parity. The
parity indication signals generated for these six 9-bit fields are
called syndromes labeled S1 - S6 in Figure 3-3• In the event that
one or more syndromes indicate a discrepancy, an error is flagged.
The pattern of the syndromes is analyzed to determine the type of
error and, in the event of a single error, the syndrome pattern
indicates the position of the errant bit.
Each data bit and each check bit has a unique pattern of 1's in
its column. Thus, if data bit 1 was in error, then syndromes 1,
3, and 4 would indicate discrepancies. The combination of syn-
dromes (1, 3, 4) uniquely identifies data bit 1 as the errant bit.
In this way, the syndrome patterns are decoded to locate a single-
bit error.
3-5
EV
r
r
I
t
I
t
3.1.2.1.3 Self-Testing
The construction of the parity trees used in the translator augments
the self-checking/self-testing properties of the translator. Figure
3-4 illustrates the organization for one of the three parity trees
on a parity chip. There are nine input bits per tree and each tree
is divided into a six-bit section and a three-bit section. There
is a partial output for the six-bit section of the tree and another
for the three-bit section. There is also a combined output which
represents the parity over all nine input bits. The pair of
partial outputs is called the "morphic" output of the parity net-
work, while the combined output is the usual logical output. Since
odd parity is being used, an error-free syndrome from the morphic
output is indicated by an 01 or a 10 signal on read cycles. Tn the
event of a fault within the parity tree network which results in an
erroneous output, only one leg of the morphic output will be
affected. Therefore, single gate failures in these circuits
propagate to the output where they may be detected. Thus an odd
parity input to a parity circuit, containing an error causing
fault, will result in an 00 or 11 morphic output. Of course, an
even parity input to a fault-free parity tree, will also cause
the morphic output to be 00 or 11.
3.1.2.2 Corrector
The corrector consists of a correction decoder and exclusive OR gates
which are in line with each of the 22 bits. Correction occurs as
data is . being transmitted to the CPU. Correction is complete
before the CPU receives a read data ready indication. The syn-
dromes generated by the parity trees are decoded into 22 bits.
The correction decoder illustrated in Figure 3-5 consists func-
tionally of 27 six-input AND gates which decode each of the 20t
I	 3-6
t
1 2 3 4 5 6 7 8 9
XOR XOR XOR XOR
XOR	 XOR
XOR
XOR
Paritial Combined Paritial
Output
	 Output Output
Figure 3-4. FUNCTIONAL PARITY TREE REPRESENTATION
h
combinations of six things taken three at a time, the six combiaa-
tions of six things taken one at a time, and the single combination of
none of six. The outputs from the decoder are wired to the
	
={	 appropriate SDR bit positions. All twenty of the three of six
combinations are available on the chip; however, only the appropri-
ate 16 are utilized for the sixteen data bit positions in the code
	
1$	 chosen. The inputs to the decoder are the combined outputs of the
	
,R
	six syndrome parity trees.
i
The outputs of the correction decoder are then exclusively ORed bit -by-bit
with the data as read from the storage array. Whenever the two inputs
disagree, correction. occurs.
3.1.2.3 Error Analysis
The error analysis portion of the translator is perhaps the most
unique portion. It is implemented in morphic logic. The morphic
logic uses dual line pairs to replace the single lines in conven-
tional logic gates arranged as two independent tree structures so
	
a^	 =
3-7
MM
I
M	 I
+o	 (r->O-
^	 1
F->O-
1-41 	M
U
O
^M^	 A
cn	
fn	 O
F>O
^i
1	 v
N	 ^
I
i
i
_	 r	 1
.	 N
^o
N
M `Y	 ri(N
-	 r-I
3-S
a
aOU
A
OH
U
U
L;
v
w
M
w
---
that a fault of a single gate in the morphic logic propagates to
the output where it can be detected. Circuits representing morphic
invert, morphic AND, morphic OR, morphic exclusive-OR, etc., have
been devised. Combinations of these morphic gates can be utilized
to implement any logical function. The morphic logic equivalent of
a conventional logic 1 is a 01 or a 10, [01], on the line pair.
[10]
The morphic logic equivalent of a conventional logic 0 is [00] on
[11J
on the morphic line pair. For explanation of the error analysis
circuits for this translator, the nomenclature 1M [01J, and OM
[101
[001 will be used.
[llj
The translator error analysis will be illustrated by explaining its
operation for checking the word read from storage. The verifica-
tion of byte parity checking and generation and check bit generation
will then be explained.
-	 The ANDM whose output is labeled A in Figure 3-6 has inputs from
the morphic output of the syndrome generation parity trees S1 - S6.
Since odd parity is used in the encoding, on read-out all of the
syndrome partial signals should be 1 M if no error has occurred.
Thus, the output A should be 1M.	 Two parity trees are shown as
the input B in Figure 3-6. 	 There are an even number of syndromes
(6).	 One each of'the two morphic lines from each of the syndrome
generators (the byte parity tree inputs are inhibited during read
cycles) are inputs to the two parity trees whose outputs form B.
Since the syndrome no error condition is1M and there are overall
an even number of syndromes, there should be in total an even
number of morphic (and logical) 1's under a no-error condition.
3-9
4
sw
w
d
44
w
a
u
W M
o u
4t
to ca
u
O 
O
A0 ip,
t h=6 N Ixo4 lcg K- oG E O'
t
{
I
I
?iR
iI
ii
t
i
i
t
1
N 04
>,
^
po
C
pQ
a
A
Q ^
tl
81
^ ^ a3a o3a
O M H w o ~ O
W	 iu+
o
W
^'+
rui
'a
o a ate+ y
ai
u
F4 Aj
c^M W a rI rnM r1 rAM
u M o M $4
H	 u Q
41
d
W
d
W
z '^3 w
-P4 00
cM/^ u ui'+ A
W O' C4 N
z ^
N N N y N N
u d
5r w ^
Q 
M
6
u
M
a
G
0!
0>
H
H
L
M
H
a)a
NH
aN
6
0
aW
c'r1
0!
H
M
kw
'-I	 M
W	 Pr
	 a	 a
0p O
M L
c0 O0 u
3-10
r4 N cn -7 M1 %GNNNNNN
Since this is true, both parity trees should have like parity either
odd or even since the sum of two odds or of two evens is even. The
input B in Figure 3-6 should be OM under a no-error condition. The
ANDM gate whose output is P indicates NO ERROR as 1M when the A and
B signals are normal.
The ANDM gate whose output is Q indicates a translator circuit
failure condition. There is no valid condition of the inputs which
causes outputs A and B to be 1M simultaneously. This condition is
s^
indicative of a failure in the circuits which generate A or B.
Therefore, the ANDM gate whose output is Q senses this condition
or	 as a circuit error.
A single error is mainfest as an odd number of OM syndromes -- one
syndrome or three syndromes having a value of OM. Under this con-
dition, the output A will be a OM and the output B will be LM.
The output A is inverted to make it a 1 M and combined with the
output B which will be 1M to cause signal R to be 1M -- the single-
error condition signal.
The ANDM gate whose output is S senses a double-error condition.
The output A will beOM as will the output B in the presence of a
double error in the word read from storage.
	 Inversion of both
these outputs makes them both 1M and when combined in the ANDM gate,
whose output is S, indicates the double-error condition.
The ANDM whose output is D indicates a circuit-error condition.
The input U is for byte parity circuit checks. 	 The signal Q (men-
tioned previously) is inverted because its normal (no error)
indication is an OM.	 In order to maintain consistency, all inputs
_ to the ANDM should, under normal conditions, be 1M'a.
3-11
r.
r.
r.
I
I
Therefore, the signal Q inverted is i,.Auring normal operation. The
signal T is used for checking the validity of the generated check
bits on write operations and the validity of the generated byte
parity bits on read operations. It is proved that, with the code
structure herein utilized, the parity of tae byte parity bits and
the parity of the code check bits should be the same; therefore,
their combined parity should always be Atvea. The signal T is the
morphic output of a parity tree whose inputs are the two byte
parity bits and the six combined outputs of the parity trees which
generate the check bits.
The two parity trees with T and R iapurs together with an inverter
perform the logical operation T . R which is true when there is a
valid single error condition and ao circuit failures. Certain
circuit failures might be detectP4 as & single data error without
this check.
i
A check is made to see that there is not an even number of the
inputs P, R, and S in a 1M state b;:cause P, R, and S are mutually
exclusive conditions. Should none or two of these three signals
be up, there will be an even number of logical 1's which, dis-
tributed between the two parity trees whose inputs are PRS, will
make their output OM. That is a failure indication causing the
output D to be OM.
This discussion of the read cycle operation is intended to
illustrate how the morphic logic is utilized to provide self-
checking during normal operation. Since the normal data flow
constantly changes, the translator circuits assume both states,
which provides the self-testing property.
3-12
0
1
t
1	 3.1.2.4 Command and Status Registers,
Microcode in the NSSC-II is used to communicate with the translator
to support fault isolation and correction.	 Each time the microcode
command NO-OP, is executed the contents of the SDR are interpreted
as a command to the translator and loaded into the command register.
There are four basic types of commands: 	 spare assignment, wAe
changes, loading fake checkbit register, and reset storage address
register (SAR) freeze latch.	 The spare assignment command is used
to load a spare assignment register which substitutes the spare
assigned for the bit specified.	 The load fake checkbit is used when
chackbits other than the ones normally generated are to be stored
' in memory in support of memory diagnostics.	 The SAR in the CPU is
frozen when a double error occurs so the location may be inter-
failing bits identified. 	 At	 thisrogated and the	 the end of
interrogation, the latch prohibiting reloading of the SAR is reset
with the reset SAR freeze latch command. 	 Special modes of operation
are required to execute the advanced microcoded storage diagnostics
employed.	 These modes are specified to the translator by the m:3e
command.	 Figure 3-7 shows the command NO-OP structure and gives a
brief definition of the various modes.
The status register contains the Translator Error/Status word and
is used by the diagnostic decoding algorithms to determine status
after an FTM error interrupt and to support multiple error cor-
rection. Definition of the Translator Error/Status Word is
contained in Figure 3-8.
3.1.2.5 Spare Assignment Register
When the decision is made in microcode to substitute a faulted
bit with a spare bit plane, one of four spare assignment registers
1	 3-13
t
t
t
c
t
't
r
L
I
0	 1	 2	 3	 4	 5	 6	 7	 8	 9	 10 11 12 1316 15
Load Spare Info I Load Re	 1-41	 Spare Ass ignment 0 0
Load Hods Info Nodes (See Below) 0 1
Load Fake Checkbits r Fake Checkbits 1 0
Reset SAR Freese Latch 1 1
Bit Node Function
6 Use Fake Uses bits stored in Fake Checkbit register instead of
Checkbits stored checkbits on read or generated checkbits on writes
7 Inhibit Write Inhibits checking for CPU parity errors or translator
Error Checks errors on write cycles
8 Inhibit Correct Forces translator timing to not generate the correction
clock used in test mode
9 Inhibit Load Forces translator timing to not generate the SDR load
pulse used in test mode
10 Error Check SDR Causes read cycle to use data in CPU SDR rather than
memory data, when used with bit S; can error check
generated word without using metgory used in test mode;
if used with bit 15, SDR will contain error status of
word at end of cycle
11 Reconfigure Node Force translator to read from old bit plane (bit being
reconned only) and write TDR data to assigned bit plane
12 Test Mode Forces translator to use TDR rather than exclusive ORs
for corrections and allows for additional control
13 Put Errors On Data in bus will contain checkbits and error status
Data Bus
Figure 3-7. COMMAND 'NO-OP' COMMAND STRUCTURE
t
i
3-14
Sit	 Meaning
0 No error detected during a read in test mode.
1 Single data error detected during a read in test mode.
2 Double data error detected during a read in test mode.
3 Translator error detected during test mode.
4-9 Checkbits as read from memory.
10 Translator or CPU parity error detected on write.
11 Spare.
12 Double data error during read in normal mode.
13 Translator error during read in normal mode.
14 Spare.
15 Spare.
Figure 3-8. TRANSLATOR ERROR /STATUS WORD
3-15
ilt
is loaded with the syndrome pattern of the faulted bit. This syn-
drome pattern is subsequently decoded and the identification of the
faulted bit is provided as input to the SDR. The SDR then switches
the faulted bit plane out of the data path and switches the spare
	
r=	 bit plane into the data path.
3.1.3 Error Correction Algorithms
a:
Several algorithms, implemented in microcode, permit maximum
	
6	 utilization of the fault tolerant memory feature. A flow diagram
of the microcode routines containing these algorithms is shown in
	
-	 Figure 3-9. During program v.iocution, the FIX operates under
direct hardware control and corrects single errors as they occur.
If an uncorrectable memory error occurs, the microcode routines
depicted in Figure 3-9 are invoked which analyze the error and
take appropriate actions to reconfigure memory to eliminate the
error. As the routines are executed, the FTM system provides
status reports to the outside world via the FLAG RMU control
signal and a code word placed on the I/C channel output bus. The
	
_-	 code words and their meaning are listed in Figure 3-10.
3.1.3.1 Error Location
If a double read error is detected, the address of .'aat word is
frozen by the memory interface hardware and the microprogram takes
	
y	 two actions: First, the READ is retried 64 times to determine
that it was a hard failure. It any retry results in a correctable
error, the normal program execution will be resumed.
If the error was "hard", the microprogram initiates the second
action of identifying the "stuck" locations in the word with the
3-16
- - - - - - - - -
Enter From
- - - - - - - - - 
Ii
'	 Interrupt* I
I	 (Address Frozen)
	 1
I
I	 Catastrophic
I
I
I	 Errors	 Transient Yes	
Exit to 	 I
Recovery
I	 Caused	 Error*
I	 CPU to Hang 1
Error
1 I
1
Location
I
1 Form I
1 Fault Word
I ^
II Identify and I
I Save Faulted
I
Bits
Stuck Addresses
Don't Show Up Here 1	 Address
i
0 or 1 Check for Fault
2 Faults and Tally 1	 Tally
Address Faults* I
Correct
	 Isolated
—
— — — — —	 — Transient	
WriteErrors 1Fault	 I
Tally	 1 Check for I Identify Bit 1
and Tally Having Stuck 11
i	 Stuck Bits
i
'	 Address 1
I	 Identify Plane	 I
I to be I
Reconfigured 1
Reconfigure Triple Errors Appearing
Copy and As Single Errors Will
Test Correct* Be Improved
Exit to	 *Footprints Sent to
Recovery*j	 RMU Along the Way
Figure 3-9. DIAGNOSE REPAIR FLOW
3-17
yLM
y
r-i
^ o0
to
oq
W Q
- A
Q ^ ^
O O
J
$ S Q ~
W (D	 W2
^^ N W	 Q
i + a:	 w	 cxV J A	 w	 A	 3 wV) A	 F-	 W z
Z W A C/)	 w W w	 Z	 Z^- WA O F-	 tY	 F-	 --	 ... !— zW O V J A>- w z	 Ir	 a J U O J
-" 2 V m w w F- w F- >	 > W -- J
~
Cl) A Z ALL W-i LU wliF- x
F- - F- F- ¢	 U w 0: O A A
• Q 4 w z F- z w	 O J O z Z c9
x w H F- cr- W w	 O w l X •r w O a[ O z
F- F- O V) O IY	 F-	 F- w ¢ tY - O U
C)
w	 Q V LL_ W LL w Q
O N w	 J Q	 \	 1 cr z w Cc LL1 0 A
°°- fY w A z J U- O F- O O O O O zpq w Q ¢	 LD t/) Z ¢ cn
x 0 0 0. F—	 cn ^L F-	 J U J t L JF-- \\ -+	 F- W z\ w cn w w z w mn -- -+ F- w Jcl: O> U zI= z•O O
cn J m O A U n. U¢ w¢ U w F- LL.x w w o. O ¢ A w O m w 0 w w x LL-L	 ^L-	 r = L..) u) F- V F- W
' O H Fx- ¢ O
"-z z u- z cl:
•- o , w
ca
1 f-IZOOOOOOOOOOOZOOr-i00000r-10 JZ
ozr-laoaor--4000a
oa000000r-l000 cn 11wLo OZOr--io0o00o0a ^Z
15 OOOOOr•-4000r--IOaG--10000000000 #LL O O o OJ O O r-1 r-i r-i r--1	 r--1 # #
O
-	 3-18
N
wAO
V
C/1
F-
d
F-
in
O
O
r-i1M
W
O
LL
routine called Form Fault Word which is flow charted in Figure
3-11.
o Read the data word without correction and place in
temporary storage.
o Read the check bits and place in temporary storage.
o bitsStore the bit-by-bit inversion of the data and check
in the same storage location.
o Reread the data and check bits and compare (exclusive
OR) with the original data and check bits.
o Zeros will identify the locations of any bits which are
not able to be inverted (stuck at 1 or 0).
o If two or more bits are stuck, the fault location data
is stored and the microprogram proceeds to the fault
tally segment.
o If less than two bits are stuck, there is an addressing
error or a transient WRITE error which caused the
storage of a bad bit.
o If less than two stuck bits were found, the analysis
" proceeds to find stuck addresses in the entire memory
(ADDRESS TALLY).
-w
se
s
F
g`
Start
et Up Inhibi
Correct Mode
ad Word Fr
Memory and
Save
Complement
Word and
ite it Back
Read Memory
Again; XOR
with Original
Contents; Sav
Restore
Original
Memory
Contents
Exit
Figure 3-11. FORM FAULT WORD
3-20
ADDRESS TALLY is invoked because a double error condition exists
and the fault word routine found less than two faults. This routine
is flow charted in Figure 3-12.
o	 Stuck addresses are found as follows:
-	 Read word A and store it temporarily.
-	 Read word B and store it (where the address of B is
different from A by a single bit).
-	 Complement A and store it in B.
-	 Check to see if A changed. If so one bit of B is
being stored in A.
o	 Each stuck address which is found is tallied for sub-
sequent reconfiguration.
o	 If there was a transient WRITE problem, it will be
corrected during the final microprogram segment which
is TEST/CORRECT.
--	 3.1.3.3 FAULT TALLY
FAULT TALLY is invoked because the fault word routine found two or
more faults at the error location. This routine is flow charted in
Figure 3-13.	 The FAULT TALLY routine checks the entire memory
for "stuck at" faults and tallies them by bit number. At the
Figure 3-12. ADDRESS FAULT TALLY
3-22
St art B
Set 0 Address omplement
For Memory Block cation 'B' and
N (Location 'A') tore Back
Read
Read Location Location 'A'
'A', Save in
Generator Regist r
Set Location 'B' it of 'A	
Nd
anged
Address = (Loc
+ 2) Yes
ddress Fault;
ocate Faulted
Read Location it Plane, Set U
'B'
Exit
Any
t of	 A'	 es B
ht BitNo
C
Any No 'B'Locationdress Bit
ested Address X2
Yes
Al 
Blocks	 yes
Tested ADDR TES TOK
No
C4Det 0 Addressor Memory Block
V lock N + 1
Restore
Location 'B'
xC
A*
Start
Set Up Inhibi
Correct Mode
Set Up Error
Counters, Zer
Read a Memory
Location
Any	 No
	
Increment
Error
	
Address
es
Form Fault
Word
Locate
Faulted Bits
Restore
Original Word
ofd	 No
or
Yes
Exit
Figure 3-13. FAULT TALLY ROUTINE
3-23
conclusion of this routine the counts for each bit location are
evaluated and a reconfiguration decision is made. 	 The decision
logic is as follows:
o	 If a double error caused the diagnostics to be run, a
plane will be substituted in the RECONFIGURE routine as
follows:
-	 if ERROR LOCATION found two or more stuck at bits
the one substituted will be the one with the largest
number of faults detected by FAULT TALLY.
-	 If there was only one stuck at fault and an
addressing fault the address fault will be sub-
stituted.
-	 If there was one stuck bit and a transient WRITE
the stuck bit will be substituted.
o	 If the analysis microprogram entered via diagnose
instruction, the only reason for a plane substitution
is if the tally of one or more bits exce-ded the thres-
hold of 512 faults.	 Since each chip contains 2048 bits
this threshold seems reasonable for a systematic fault.
A different threshold can be substituted by burning-in
a different value in the microprogram memory.
The tally operation is essentially the same as the fault location
operation, as it looks for stuck bits by the REAR/INVERT/STORE/
READ/COMPARE sequence. 	 This tally operation can be performed on
the entire memory at one time or can be broken into segments.
t
i
t
t
t
o^
T
VI
`I
'i
I
I
I
1
3.1.3.4 RECONFIGURE
The RECONFIGURE routine is the short microprogram sequence which
"tells" the translator which spare plane number should be sub-
stituted for which data or check bit. If the bad bit plane is a
spare bit plane, it must be unassigned to permit the new spare bit
plane to be effective. This routine is flow charted in Figure 3-14.
3.1.3.5 TEST/COPY /CORRECT
The TEST /COPY/CORRECT routine has two functions: copy the informa-
tion from the old plane to the new plane and correct all possible
errors. This routine works with each location in memory in
sequential order reading from the old plane and writing to the new
plane. This routine is flow charted in Figure 3-15.
Each time a READ operation results in any error, a special correc-
tion routine is used to correct the data. This proprietary routine
is so constructed that it can form the correct word for all single,
double, and triple "stuck at" type faults and one "soft-error" in
combination with zero, one, , or two "stuck at" type faults. This
routine is flow charted in Figure 3-16.
3.2 FTM SYSTEM MANAGEMENT
The previous sections presented the implementation of the Fault
Tolerant Memory hardware and microcode. The benefits derived from
this system implementation are: (1) the capability to recover from
permanent and transient type memory errors, (2) extending the
3-25
i
Double
Error
Retrieve Error
	
<All
	 yes
Tally for let	 ares
Error Bit	 sed
Get Reconfigure
Retrieve Error Code for Bit
Tally for 2nd with Greatest
Error Bit No. of Errors
Reconfigure 1st
lot
 No Available Spare
Bit Set Spare
Bit Used Ind
Get Reconfigure	 Get Reconfigure Exit
Code for 1st	 Code for 2nd
Double Error Bit	 Double Error Bit
'All	
°esSpares
Used
Reconfigure 1st
Available Spare
Signal
Error to
Bit Set Spare Controller
Bit Used Ind
CAD- Hang
TLC
Figure 3-14. RECONFIGURE ROUTINE
3-26
Start
Start
Address Counter
to First
Location
Read a Location
(Failed P13ne)
(Participates)
Any
	 No
Errors
Yes
Form Fault
Word
Perform Error
Correction
Algorithm
Write Data Back
(Replacement
Plane)
(Participates)
End of
	 No
Memory
Yes
Exit
Increment
Location
Address Counter
Figure 3-15. TEST /COPY/CORRECT ROUTINE
3-27
Start
Initialize Trial
Fault Word From
Results of "Form
Fault Word" Routin
Do "Test Read"
With All Fault
Bits Complemented
^ Any	 No	 Save Trial Word,
Errors	 Increment Solutio
Counter A
Yes
Single Error, Do
Norm 1 Error
Multiple	 No	 Correction, Save
Error	
Corrected Word,
Increment Solutio
Strike LS Fault
Bit From Trial
t_	 Fault Word
e_
--T-
/ Any	 No	 Solution	 No	 Solution	 No
Fault Bits	
-t	 Counter A	 Counter BLeft
	 a
es	 Yes	 Yes
No Solution or
Store Corrected	 Too Many;
Word in Memory	 Correction Failure
Exit
Figure 3-16. MULTIPLE ERROR CORRECTION ALGORITHM
3-?8
t
usable life of the storage array, and (3)•anhancing the overall reliability
of the SUMC-IIC computer. Associated with this benefit is the recurring
cost of time. Whenever a double error occurs, the microcode algorithms
which locate the error, reconfigure memory and correct data errors, take
the CPU "off-line" for the duration of time that it takes to execute the
algorithms. "fable 3-1	 lists the execution time for each of the microcode
routinss previously discussed. As per the example presented in Table 3-1,
1 double faulted location in 3 64K memory size machin# would take the CPU
"off-line" for 687 m sec. It is conceivable that at certain "mission
critical" periods of time, 687 m sec of 'off-time" would be intolerable.
The risk of this occurring cannot be completely eliminated, but it can be
greatly reduced by managing the tools which the FTM s;rstem provides.
The management concept is this: on a periodic basis during non-critical
mission phases, enter via the DIAGNOSE instruction the Fault Tally and
Address Fault Tally microcode routines. This provides the advantages of
(1) the flight programmer selects the time periods for fault location and
correction and (2) greatly minimizing the probability of a double error
occurrence. To prevent the occurrence of the accumulation of "soft errors,"
the flight program should, on a periodic basic, read and rewrite all memory
locations. This could be done in increments of 1K or 4K or 16K bytes.
i
This will further minimize the probability of a double error occurrence.
The disadvantage of this concept is the increased software overhead but,
when weighed against a 687 m sec. "off-line" time during "mission critical"
phases, it seems reasonable to conclude that the FTM management concept
'	 should be included as a part of the flight software.
I
I
I
3-29
I
.
^ a
	 Table 3-1
FTM DIAGNOSTIC EXECUTION
DIAGNOSTIC
	
RXEC111 i IN TIME
A. Fault Tally
	
62 Psec + 19 Psec /(halfword location)
B. Address Fault Tally
	
253 Psec/4Y block
C. Reconfigure
	
7 vsec
D. Copy /Test/Correct
	
5 Peet /halfword location + 62 vsec/
halfword location if correction needed)
E. Multiple Error Overhead
	
390 Psec
TOTAL EXECUTION TIME FOR 64Y MACHINE WITH 1 DOUBLE FAULTED LOCATION
(A) 522654 usec + (C) 7 Psec + (D) 164902 Psec + (E) 390 Psec - 687 m sec
E
	 3.3 IMPLEMENTATION OF ADDRF : S EXTENSION
Expanding the ALU and appropriate data paths ar.d registers was
selected as the approach to address expansion for the SUMC-IIC.
This provides the addition of 20 bit numbers when address calcu-
lations are being done as a part of the effective address
r	
.:alculation (EA Calc).
The EA Calculation is: E A - D + (B) + (X) where D is the 12 bit
displacement from the instruction register, (B) is the contents of
one of the general registers (now 20 bits) and (X) is the contents
1-
t`
3-30
t
of one of the general registers (20 bits). 	 This addition is per-
;. formed by calculating an interim number INTR - D + (B) then getting
the final EA-INTR + (X) .
Performing the 20-bit arithmetic for either INTR or EA requires
simultaneous access to 20 bits of the SPM (where the general
registers are implemented) and a 20 bit wide ALU for adding posi-
tive integers.	 Since the D is a positive integer, only the
extended part of the ALU does not have to propagate any negative
signs from the lower part of the ALU.
Figure 3-17 shows the parts of the SUMC-IIC block diagram which
are affected by the expanded addressing. 	 The salient features of
this hardware are discussed below:
T
o	 A four bit extension to the PC was added to accommodate
the 20-bit addressing.
o	 The extension to the PC can be loaded from the extended
ALU for branch instructions the same as the regular
part of the PC can be loaded from the PRM.
r
o	 The PC extension also goes through a new four bit MUX
into the SAR extended for instruction fetching.
- The PC extension can be read into the data path (MQM)
through a new four bit MUX. 	 This is used for BAL and
store PSW type operations.
o	 The address MUX was expanded to handle the 20 bit xdCress
both from the SAR and from DMA.
I
I
I
I
I
I
I
I
ac
ri
I
I
^E
+	 3-32
1
4r,
1
1 Lnz
t
o The address decode logic was expanded to provide six page
select signals, for up to 96K'bvtes in the prime CPU box,
and the full address is available to be sent to an exter-
nal memory unit so that page selects can also be generated
externally.
o	 The storage protect registers (not shown) were expanded to
1024 X 2 to accommodate the 1024 segments of 1024 bytes
each associated with the 20 bit address capability.
o	 Manipulation of SPM for 20 bit arithmetic is not as
straight forward as the items just mentioned and it is
1-	 described in the immediately following paragraphs.
To form the intermediate sum INTR = D + (B) in the extended ALU,
the most significant four bits (MSB) of the base address, B must
be added to the zeros which represent the MSBs of the 12-bit D
field. This poses a problem, however, since the 32 bit general
register holding the 20 bit base value is located in two separate
I`	 16-bit locations in the SPM. Two SPM locations cannot be read at
a	 the same time so a portion of the SPM is duplicated. Therefore,
when the SPM is reading the least significant 16 bits of a general4.	 register, the SPM extension must be reading what amounts to the
least significant four bits of the most significant half of the
general register. Thus the 20 bit base address (or index regis-
ter) is made available simultaneously tr the ALU and extended ALU.
The ALU extension register, which is cleared, is fad back to the
ALU extension to provide the most significant zeros corresponding
to the uprer part of D (which is only 12 bits). At the end of
this cycle, the ALU extension register contains the four MSBs of
the 20-bit Base Register plus any carry from the displacement
7 i
I
3-33
3-34
addition.	 The second addition is like the first except the interim
sum (20 bits) is added to the 20-bit X register value. 	 The full
20-bit EA is loaded into either the PC or the SAR according to the
instruction being executed.
The next problem is getting the MSB portion of the general registers
into the SPM extension.	 Since the general registers are loaded in
a double precision manner (first the lower half then the upper hall)
the extended SPM must be loaded when the upper half of the register
is loaded.	 Thus special manipulation of the most significant bits
of the SPM address is used to load when an upper half general
register is being loaded and read at all other times.
The overall operation has been described above. 	 Table 3-1
	
con-
tains the control equations for the block shown in Figure 3-17.
For definition of some of the signals make reference to the micro-
program control word.	 For example:	 S 1 is the first bit of the
SPM field, and R 2 is the second bit of the REGISTER field.
Table 3-2. CONTROL EQUATIONS
Block	 Control
Read S ' -37
Write ST ' 17 ' SPM write
Read S1 =.1
Write S1 3`^ SPM Write
Output = MAR if MUXA - MAR (Al ' A'f ' A3)
Output = SPM if MUXA = SPM (Al '
Else Output = 0
Output = SPM if MUXB = SPM (A5 ' A6 ' A'T)
Output = F if ALU = A-B (7 ' A9 ' A10)
Else Output = 0
Output Enable MAM = MAR or MAM = ALU or MAM = PC (R 	 )
Output = XMAR if MAM = MAR (R1 R2 ' R3 ' R4)
XALU if MAM = ALU (R1 R ' RY ' RRT)
XPC if MAX = PC (R1 ' R2 ' -R7 ' R4)
Load = R5 . CKZ (No reset)
Output = MQR Bits 12-15 if MQR to PC EXT (-M7 M8 M9 M10)
Output = 0 if MAM = PC10 (M5 + M6 = 1)
Else PC EXT
Load = Load PC
Output = PC EXT if MQR = PSW (R11	 R12 -R-TY) and
MAM = PCIO (M5 + M6 = 1)
Load = CPSAR Load Clock
Reset = POR
Same as A MUX
3-35
A
B
C
D
E
F
G
H
I
J
K
L
